Skip to main content
We already have seen that in the Gridworld example in the policy iteration section , we may not need to reach the optimal state value function v(s)v_*(s) to obtain an optimal policy result. The value function for the k=3k=3 iteration results the same policy as the policy from a far more accurate value function (large k). We could have stopped early and taking the argument to the limit, we can question the need to start from a policy altogether. Instead of letting the policy π\pi dictate which actions are selected, we will select those actions that maximize the expected reward. We therefore do the improvement step in each iteration computing the optimal value function without a policy. In this section we will look at an algorithm called value iteration that does that. Value Iteration Value Iteration. The basic principle behind value-iteration is the principle that underlines dynamic programming and is called the principle of optimality as applied to policies. According to this principle an optimal policy can be divided into two components.
  1. An optimal first action aa_*.
  2. An optimal policy from the successor state ss^\prime.
More formally, a policy π(as)\pi(a|s) achieves the optimal value from state ss, vπ(s)=v(s)v_\pi(s) = v_*(s) iff for any state ss^\prime reachable from ss, vπ(s)=v(s)v_\pi(s^\prime)=v_*(s). Effectively this principle allows us to decompose the problem into two sub-problems with one of them being straightforward to determine and use the Bellman optimality equation that provides the one step backup induction at each iteration. v(s)=maxa(Rsa+γsSPssav(s))v_*(s) = \max_a \left( \mathcal R_s^a + \gamma \sum_{s^\prime \in \mathcal S} \mathcal{P}^a_{ss^\prime} v_*(s^\prime) \right) As an example if I want to move optimally towards a location in the room, I can make a optimal first step and at that point I can follow the optimal policy, that I was magically given, towards the desired final location. That optimal first step, think about making it by walking backwards from the goal state. We start at the end of the problem where we know the final rewards and work backwards to all the states that connect to it in our look-ahead tree. Value iteration look-ahead tree One step look-ahead tree representation of value iteration algorithm. The “start from the end” intuition behind the equation is usually applied with no consideration as to if we are at the end or not. We just do the backup inductive step for each state. In value iteration for synchronous backups, we start at k=0k=0 from the value function v0(s)=0.0v_0(s)=0.0 and at each iteration k+1k+1 for all states sSs \in \mathcal{S} we update the vk+1(s)v_{k+1}(s) from vk(s)v_k(s). As the iterations progress, the value function will converge to vv_*. The equation of value iteration is taken straight out of the Bellman optimality equation, by turning the later into an update rule called the Bellman update. vk+1(s)=maxa(Rsa+γsSPssavk(s))v_{k+1}(s) = \max_a \left( \mathcal R_s^a + \gamma \sum_{s^\prime \in \mathcal S} \mathcal{P}^a_{ss^\prime} v_k(s^\prime) \right) The value iteration can be written in a vector form as, vk+1=maxa(Ra+γPavk)\mathbf v_{k+1} = \max_a \left( \mathcal R^a + \gamma \mathcal P^a \mathbf v_k \right) Notice that we are not building an explicit policy at every iteration and also, importantly, the intermediate value functions may not correspond to a feasible policy.

Value Iteration in Gridworld Example

Some of the figures quote utility - this is synonymous to value
  • To solve the non-linear equations for v(s)v^{*}(s) we use an iterative approach.
  • Steps:
    • Initialize estimates for the utilities of states with arbitrary values: v(s)0sϵSv(s) \leftarrow 0 \forall s \epsilon S
    • Next use the iteration step below repeatedly. Since we are taking max\max we only need to consider states whose next states have a positive value.
vk+1(s)R(s)+γmaxa[sP(ss,a)vk(s)]sϵSv_{k+1}(s) \leftarrow R(s) + \gamma \underset{a}{ \max} \left[ \sum_{s^{'}} P(s^{'}| s,a) v_k(s^{'}) \right] \forall s \epsilon S
  • Let us apply this to the maze example. Assume that γ=1\gamma = 1 for simplicity.
Initial state value estimates Initialize state value estimates.
  • For the remaining states, the value is equal to the immediate reward in the first iteration.
Step 1 states States to consider — the only state that has as next state a positive value is shown.

Value Iteration (k=1)

v1(s33)=R(s33)+γmaxa[sP(ss33,a)v(s)]sSv_1(s_{33}) = R(s_{33}) + \gamma \max_a \left[\sum_{s^{'}} P(s^\prime| s_{33},a)v(s{'}) \right] \forall s \in S v1(s33)=0.04+maxa[sP(ss33,)v0(s),sP(ss33,)v0(s),sP(ss33,)v0(s),sP(ss33,)v0(s)]v_1(s_{33}) = -0.04 + \max_a \left[ \sum_{s'} P(s'| s_{33},\uparrow) v_0(s'), \sum_{s'} P(s'| s_{33},\downarrow)v_0(s'), \sum_{s'} P(s'| s_{33},\rightarrow) v_0(s'), \sum_{s'} P(s'| s_{33}, \leftarrow)v_0(s') \right] v1(s33)=0.04+sP(ss33,)v0(s)v_1(s_{33}) = -0.04 + \sum_{s^{'}} P(s^{'}| s_{33},\rightarrow) v_0(s^\prime) v1(s33)=0.04+P(s43s33,)v(s43)+P(s33s33,)v(s33)+P(s32s33,)v0(s32)v_1(s_{33}) = -0.04 + P(s_{43}|s_{33},\rightarrow)v(s_{43})+P(s_{33}|s_{33},\rightarrow)v(s_{33})+P(s_{32}|s_{33},\rightarrow)v_0(s_{32}) v1(s33)=0.04+0.8×1+0.1×0+0.1×0=0.76v_1(s_{33}) = -0.04 + 0.8 \times 1 + 0.1 \times 0 + 0.1 \times 0 = 0.76

Value Iteration (k=2)

Step 2 initial state values Initial value estimates for iteration 2 (A). States with next state positive value (B). v2(s33)=0.04+P(s43s33,)v1(s43)+P(s33s33,)v1(s33)+P(s32s33,)v1(s32)v_2(s_{33}) = -0.04 + P(s_{43}|s_{33},\rightarrow)v_1(s_{43})+P(s_{33}|s_{33},\rightarrow)v_1(s_{33}) +P(s_{32}|s_{33},\rightarrow)v_1(s_{32}) v2(s33)=0.04+0.8×1+0.1×0.76+0.1×0=0.836v_2(s_{33}) = -0.04 + 0.8 \times 1 + 0.1 \times 0.76 + 0.1 \times 0 = 0.836 v2(s23)=0.04+P(s33s23,)v1(s23)+P(s23s23,)v1(s23)=0.04+0.8×0.76=0.568v_2(s_{23}) = -0.04 + P(s_{33}|s_{23},\rightarrow)v_1(s_{23})+P(s_{23}|s_{23},\rightarrow)v_1(s_{23}) = -0.04 + 0.8 \times 0.76 = 0.568 v2(s32)=0.04+P(s33s32,)v1(s33)+P(s42s32,)v1(s42)+P(s32s32,)v1(s32)v_2(s_{32}) = -0.04 + P(s_{33}|s_{32},\uparrow)v_1(s_{33})+P(s_{42}|s_{32},\uparrow)v_1(s_{42}) +P(s_{32}|s_{32},\uparrow)v_1(s_{32}) v2(s32)=0.04+0.8×0.76+0.1×1+0.1×0=0.468v_2(s_{32}) = -0.04 + 0.8 \times 0.76 + 0.1 \times -1 + 0.1 \times 0= 0.468

Value Iteration (k=3)

Step 3 initial state values Initial value estimates for iteration 3 (A). States with next state positive values (B).
Notice how information propagates outward from terminal states and eventually all states have correct value estimates. s32s_{32} has a lower value compared to s23s_{23} due to the red oval state with negative reward next to s32s_{32}
After many iterations, the value estimates converge to the optimal value function v(s)v^*(s) and a policy π\pi^* can be derived from the value function using the following equation: π(s)=argmaxa[sP(ss,a)v(s)]\pi^*(s) = \arg \max_a \left[ \sum_{s^{'}} P(s^{'}| s,a)v^*(s^{'}) \right] Optimal policy Optimal value function and optimal policy π\pi^* for the maze problem.

Value Iteration - Convergence

The rate of convergence depends on the maximum reward value and more importantly on the discount factor γ\gamma. The policy that we get from coarse estimates is close to the optimal policy long before vv has converged. This means that after a reasonable number of iterations, we could extract the policy in a greedy fashion and use it to guide the agent’s actions. Value iteration convergence Convergence of value for the maze problem. For this problem, convergence is reached within 5 to 10 iterations.

Run the Value Iteration Demo

Open the value iteration gridworld notebook in Google Colab to implement this algorithm.