We already have seen that in the Gridworld example in the policy iteration section , we may not need to reach the optimal state value function v∗(s) to obtain an optimal policy result. The value function for the k=3 iteration results the same policy as the policy from a far more accurate value function (large k).We could have stopped early and taking the argument to the limit, we can question the need to start from a policy altogether. Instead of letting the policy π dictate which actions are selected, we will select those actions that maximize the expected reward. We therefore do the improvement step in each iteration computing the optimal value function without a policy.In this section we will look at an algorithm called value iteration that does that.Value Iteration.The basic principle behind value-iteration is the principle that underlines dynamic programming and is called the principle of optimality as applied to policies. According to this principle an optimal policy can be divided into two components.
An optimal first action a∗.
An optimal policy from the successor state s′.
More formally, a policy π(a∣s) achieves the optimal value from state s, vπ(s)=v∗(s) iff for any state s′ reachable from s, vπ(s′)=v∗(s).Effectively this principle allows us to decompose the problem into two sub-problems with one of them being straightforward to determine and use the Bellman optimality equation that provides the one step backup induction at each iteration.v∗(s)=amax(Rsa+γs′∈S∑Pss′av∗(s′))As an example if I want to move optimally towards a location in the room, I can make a optimal first step and at that point I can follow the optimal policy, that I was magically given, towards the desired final location. That optimal first step, think about making it by walking backwards from the goal state. We start at the end of the problem where we know the final rewards and work backwards to all the states that connect to it in our look-ahead tree.One step look-ahead tree representation of value iteration algorithm.The “start from the end” intuition behind the equation is usually applied with no consideration as to if we are at the end or not. We just do the backup inductive step for each state. In value iteration for synchronous backups, we start at k=0 from the value function v0(s)=0.0 and at each iteration k+1 for all states s∈S we update the vk+1(s) from vk(s). As the iterations progress, the value function will converge to v∗.The equation of value iteration is taken straight out of the Bellman optimality equation, by turning the later into an update rule called the Bellman update.vk+1(s)=amax(Rsa+γs′∈S∑Pss′avk(s′))The value iteration can be written in a vector form as,vk+1=amax(Ra+γPavk)Notice that we are not building an explicit policy at every iteration and also, importantly, the intermediate value functions may not correspond to a feasible policy.
Initial value estimates for iteration 2 (A). States with next state positive value (B).v2(s33)=−0.04+P(s43∣s33,→)v1(s43)+P(s33∣s33,→)v1(s33)+P(s32∣s33,→)v1(s32)v2(s33)=−0.04+0.8×1+0.1×0.76+0.1×0=0.836v2(s23)=−0.04+P(s33∣s23,→)v1(s23)+P(s23∣s23,→)v1(s23)=−0.04+0.8×0.76=0.568v2(s32)=−0.04+P(s33∣s32,↑)v1(s33)+P(s42∣s32,↑)v1(s42)+P(s32∣s32,↑)v1(s32)v2(s32)=−0.04+0.8×0.76+0.1×−1+0.1×0=0.468
Initial value estimates for iteration 3 (A). States with next state positive values (B).
Notice how information propagates outward from terminal states and eventually all states have correct value estimates. s32 has a lower value compared to s23 due to the red oval state with negative reward next to s32
After many iterations, the value estimates converge to the optimal value function v∗(s) and a policy π∗ can be derived from the value function using the following equation:π∗(s)=argamaxs′∑P(s′∣s,a)v∗(s′)Optimal value function and optimal policy π∗ for the maze problem.
The rate of convergence depends on the maximum reward value and more importantly on the discount factor γ. The policy that we get from coarse estimates is close to the optimal policy long before v has converged. This means that after a reasonable number of iterations, we could extract the policy in a greedy fashion and use it to guide the agent’s actions.Convergence of value for the maze problem. For this problem, convergence is reached within 5 to 10 iterations.
Run the Value Iteration Demo
Open the value iteration gridworld notebook in Google Colab to implement this algorithm.