Value Iteration

We already have seen that in the Gridworld example in the policy iteration section , we may not need to reach the optimal state value function \(v_*(s)\) to obtain an optimal policy result. The value function for the \(k=3\) iteration results the same policy as the policy from a far more accurate value function (large k).

We can therefore stop early and taking the argument to the limit, do the policy improvement step in each iteration. In this section we will look at an algorithm called value iteration that does that.

Summary of Value Iteration

The basic principle behind value-iteration is the principle that underlines dynamic programming and is called the principle of optimality as applied to policies. According to this principle an optimal policy can be divided into two components.

An optimal first action \(a_*\).
An optimal policy from the successor state \(s^\prime\).

More formally, a policy \(\pi(a|s)\) achieves the optimal value from state \(s\), \(v_\pi(s) = v_*(s)\) iff for any state \(s^\prime\) reachable from \(s\), \(v_\pi(s^\prime)=v_*(s)\).

Effectively this principle allows us to decompose the problem into two sub-problems with one of them being straightforward to determine and use the Bellman optimality equation that provides the one step backup induction at each iteration.

\[v_*(s) = \max_a \left( \mathcal R_s^a + \gamma \sum_{s^\prime \in \mathcal S} \mathcal{P}^a_{ss^\prime} v_*(s^\prime) \right)\]

As an example if I want to move optimally towards a location in the room, I can make a optimal first step and at that point I can follow the optimal policy, that I was magically given, towards the desired final location. That optimal first step, think about making it by walking backwards from the goal. We start at the end of the problem where we know the final rewards and work backwards to all the states that connect to it in our look-ahead tree.

One step look-ahead tree representation of value iteration algorithm

The “start from the end” intuition behind the equation is usually applied with no consideration as to if we are at the end or not. We just do the backup inductive step for each state. In value iteration for synchronous backups, we start at \(k=0\) from the value function \(v_0(s)=0.0\) and at each iteration \(k+1\) for all states \(s \in \mathcal{S}\) we update the \(v_{k+1}(s)\) from \(v_k(s)\). As the iterations progress, the value function will converge to \(v_*\).

The equation of value iteration is taken straight out of the Bellman optimality equation, by turning the later into an update rule.

\[v_{k+1}(s) = \max_a \left( \mathcal R_s^a + \gamma \sum_{s^\prime \in \mathcal S} \mathcal{P}^a_{ss^\prime} v_k(s^\prime) \right) \]

The value iteration can be written in a vector form as,

\[\mathbf v_{k+1} = \max_a \left( \mathcal R^a + \gamma \mathcal P^a \mathbf v_k \right) \]

Notice that we are not building an explicit policy at every iteration and also, importantly, the intermediate value functions may not correspond to a feasible policy. Before going into a more elaborate example, we can go back to the same simple world we have looked at in the policy iteration section and focus only on the state-value calculation using the formula above.

State values for an MDP with random policy (0.25 prob of taking any of the four available actions), \(\gamma=1\), that rewards the agent with -1 at each transition except towards the goal states that are in the top left and bottom right corners

# Look at the backup tree above and the vector form Bellman equation to understand this code. 
# Have them side by side while you are reading. 
# Each of the 11 rows of the "matrix" P[s][a] has 4 tuples - one for each of the allowed actions. Each tuple / action is written in the format (probability, s') and is associated with the 3 possible next states that the agent may end up despite its intention to go to the desired state. The states are numbered sequentially from top left to bottom right. 
import numpy as np
P = {
 0: {0: [(0.9,0),(0.1,1),(0,4)], 1: [(0.8,1),(0.1,4),(0.1,0)], 2: [(0.8,4),(0.1,1),(0.1,0)], 3: [(0.9,0),(0.1,4)]},
 1: {0: [(0.8,1),(0.1,2),(0.1,0)], 1: [(0.8,2),(0.2,1)], 2: [(0.8,1),(0.1,0),(0.1,2)], 3: [(0.8,0),(0.2,1)]},
 2: {0: [(0.8,2),(0.1,3),(0.1,1)], 1: [(0.8,3),(0.1,5),(0.1,2)], 2: [(0.8,5),(0.1,1),(0.1,3)], 3: [(0.8,1),(0.1,2),(0.1,5)]},
 3: {0: [(0.9,3),(0.1,2)], 1: [(0.9,3),(0.1,6)], 2: [(0.8,6),(0.1,2),(0.1,3)], 3: [(0.8,2),(0.1,3),(0.1,6)]},
 4: {0: [(0.8,0),(0.2,4)], 1: [(0.8,4),(0.1,7),(0.1,0)], 2: [(0.8,7),(0.2,4)], 3: [(0.8,4),(0.1,0),(0.1,7)]},
 5: {0: [(0.8,2),(0.1,6),(0.1,5)], 1: [(0.8,6),(0.1,9),(0.1,2)], 2: [(0.8,9),(0.1,5),(0.1,6)], 3: [(0.8,5),(0.1,2),(0.1,9)]},
 6: {0: [(0.8,3),(0.1,6),(0.1,5)], 1: [(0.8,6),(0.1,10),(0.1,3)], 2: [(0.8,10),(0.1,5),(0.1,6)], 3: [(0.8,5),(0.1,3),(0.1,10)]},
 7: {0: [(0.8,4),(0.1,8),(0.1,7)], 1: [(0.8,8),(0.1,7),(0.1,4)], 2: [(0.9,7),(0.1,8)], 3: [(0.9,7),(0.1,4)]},
 8: {0: [(0.8,8),(0.1,9),(0.1,7)], 1: [(0.8,9),(0.2,8)], 2: [(0.8,8),(0.1,7),(0.1,9)], 3: [(0.8,7),(0.2,8)]},
 9: {0: [(0.8,5),(0.1,10),(0.1,8)], 1: [(0.8,9),(0.1,9),(0.1,5)], 2: [(0.8,9),(0.1,8),(0.1,10)], 3: [(0.8,8),(0.1,5),(0.1,9)]},
 10: {0: [(0.8,6),(0.1,10),(0.1,9)], 1: [(0.9,10),(0.1,6)], 2: [(0.9,10),(0.1,9)], 3: [(0.8,9),(0.1,6),(0.1,10)]}
}

R = [0, 0, 0, 1, 0, 0, -100, 0, 0, 0, 0]
gamma = 0.9

States = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Actions = [0, 1, 2, 3] # [north, east, south, west]

v = [0]*11

# once v computed, we can calculate the optimal policy 
optPolicy = [0]*11

# value iteration

for i in range(100):
    for s in States:
        # trans[1] = s'
        # trans[0] = P_ss'
        q_0 = sum(trans[0]*v[trans[1]] for trans in P[s][0])
        q_1 = sum(trans[0]*v[trans[1]] for trans in P[s][1])
        q_2 = sum(trans[0]*v[trans[1]] for trans in P[s][2])
        q_3 = sum(trans[0]*v[trans[1]] for trans in P[s][3])

        v[s] = R[s] + gamma*max(q_0, q_1, q_2, q_3)
    
    print(v)

    # [5.46991289990088, 6.313016781079707, 7.189835364530538, 8.668832766371658, 4.8028486314273, 3.346646443535637, -96.67286272722137, 4.161433444369266, 3.6539401768050603, 3.2220160316109103, 1.526193402980731]

    
    for s in States:       
        optPolicy[s] = np.argmax([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])

    print(optPolicy)
    # [1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]

[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, -99.28, 0.0, 0.0, 0.0, 0.0]
[0, 0, 1, 0, 0, 3, 3, 0, 0, 0, 2]
[0.0, 0.0, 0.7200000000000001, 1.8748, 0.0, 0.06480000000000001, -99.784612, 0.0, 0.0, 0.04665600000000001, 0.004199040000000001]
[0, 1, 1, 0, 0, 3, 3, 0, 1, 0, 2]
[0.0, 0.5184000000000001, 1.4204880000000002, 2.6464319200000004, 0.0, 0.17869896000000002, -99.6327799624, 0.0, 0.03359232000000001, 0.1320644736, 0.015287025024000004]
[1, 1, 1, 0, 0, 3, 3, 1, 1, 0, 2]
[0.3732480000000001, 1.1160633600000003, 2.0493578088000004, 3.3280520579920005, 0.26873856000000007, 0.32499125661600003, -99.46510577776505, 0.19651507200000007, 0.14753746944000007, 0.24864790926528008, 0.03476080210331521]
[1, 1, 1, 0, 0, 3, 3, 0, 1, 0, 2]
[0.8613444096000002, 1.6764290271360005, 2.6098888976416807, 3.9306121677612715, 0.6685409157120002, 0.4912620173851466, -99.28940778019488, 0.5123141880422402, 0.39542295988961296, 0.3924251910966691, 0.06347451690238555]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 0, 2]
[1.3447185788160003, 2.18087723118649, 3.10914434314053, 4.463618846769277, 1.0885347415756803, 0.6688499105986535, -99.11098966163851, 0.8654413572483566, 0.6942939099989471, 0.595406374351821, 0.10500093238259618]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[1.789224405289524, 2.63114182867475, 3.553825052510406, 4.9353755206090515, 1.48417782529208, 0.8550027640486311, -98.93076412911574, 1.208984208262555, 0.9954415337488502, 0.847254726755213, 0.1613036806378721]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[2.1890283173981646, 3.032359566968948, 3.9502648783288303, 5.353178010742926, 1.8432523970792531, 1.0473787545725786, -98.74958394448348, 1.525540042678089, 1.2775683068030173, 1.0903661942176737, 0.22878893879626708]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[2.5462041525206103, 3.390015434451169, 4.3040760946960335, 5.723441037224414, 2.165052421289105, 1.2396125092944903, -98.5717782954661, 1.8111174947814555, 1.5339668914671913, 1.3141542451724726, 0.3035929224904989]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[2.864824204447716, 3.7091375663823545, 4.619809521160725, 6.051770097056241, 2.452382863034395, 1.4265777456620208, -98.40088135136413, 2.0667732561471426, 1.7641907848900371, 1.5168832442959312, 0.38242975920393796]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[3.1491276838686857, 3.993907617184546, 4.901449323894542, 6.343064217766064, 2.708800847731645, 1.6047859080137974, -98.23925968830277, 2.2951233740601307, 1.970043170603501, 1.6993813065423966, 0.4627124225440055]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[3.402827052216903, 4.2479468842972885, 5.152567407663318, 6.601613083080211, 2.9376196301878665, 1.7721212380484486, -98.08828341309894, 2.498951122754991, 2.153852579092224, 1.8632090859595776, 0.5424858799970065]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[3.629161958110477, 4.4744789726911005, 5.376383397931811, 6.831181103108833, 3.1417681432733597, 1.9274906149451079, -97.94857672876, 2.6808253963230686, 2.31788774958921, 2.0100421527856533, 0.6203173565482841]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[3.8310085694621376, 4.676402261595302, 5.575799055397284, 7.035078608503911, 3.323844435801944, 2.070518951496942, -97.82024071806751, 2.843052176909505, 2.4642173623009014, 2.1414870002420825, 0.6951908888258975]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[4.0109463988223855, 4.856327726973199, 5.7534252187432955, 7.216221942575064, 3.486173405596468, 2.2013157447864824, -97.70302550892765, 2.9876991105583937, 2.594702484816206, 2.259038036120239, 0.7664180431997986]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[4.171296745818401, 5.016605148350349, 5.911606485371727, 7.377184357169258, 3.630844869996613, 2.3203053431805443, -97.5964559368768, 3.1166244499812756, 2.7110160512534356, 2.364072461039544, 0.8335651364853959]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[4.314148452235603, 5.159345596170706, 6.052444801731571, 7.520239361462941, 3.7597389622090245, 2.4281064007393924, -97.49992098665228, 3.2314996979016217, 2.8146626717147862, 2.4578532211947506, 0.8963945504606982]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[4.441378696542925, 5.286442464557458, 6.177821948475705, 7.647397858147795, 3.8745456747085303, 2.525447373802704, -97.41273657408729, 3.333827499055619, 2.906995080228708, 2.5415335113144413, 0.9548176018914653]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[4.554671767894001, 5.399591446522851, 6.289420696871469, 7.760440127818147, 3.976781894331216, 2.6131079878746792, -97.33418905305636, 3.4249569960540653, 2.9892281516000945, 2.616162004079089, 1.0088568378992049]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[4.655536671096723, 5.500309362121571, 6.38874447365622, 7.860943506161759, 4.06780714416926, 2.691879334265947, -97.26356484836303, 3.5060978070907414, 3.062451488393351, 2.6826887920942664, 1.05861602998684]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[4.74532368410147, 5.589951706214362, 6.477135467149462, 7.950306432034478, 4.148838338503525, 2.762537304003417, -97.20017011953563, 3.5783330403161067, 3.1276410569384003, 2.7419719096444397, 1.10425645615734]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[4.825239810508791, 5.669728843466198, 6.555791180468583, 8.0297694161901, 4.220963564496964, 2.8258255369926326, -97.14334328485404, 3.64263143519072, 3.1856700235862307, 2.794784187179423, 1.1459783063335938]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[4.896363071046181, 5.740720841761297, 6.6257794842283815, 8.100433380694536, 4.285154852762704, 2.882445117061398, -97.09246246388327, 3.6998586252790737, 3.237318814446455, 2.8418201837831214, 1.184006244670692]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[4.959655619210934, 5.803890980161469, 6.688052248216145, 8.163275740702026, 4.34227991932916, 2.9330490031641405, -97.04694933903828, 3.750787511492293, 3.283284394874813, 2.883702991135119, 1.2185783273854214]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.015975704184867, 5.860097995144689, 6.743457645929686, 8.219164538102314, 4.393112892492354, 2.9782397396140134, -97.00627052958401, 3.7961077541675348, 3.3241887740780913, 2.920990763103648, 1.2499376138615197]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.066088530205127, 5.910107144195419, 6.7927512321326, 8.268870886754808, 4.438344062396316, 3.0185693920933527, -96.96993727263731, 3.8364344124674545, 3.360586756310624, 2.954182878511379, 1.278325926293855]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.1106760771548325, 5.954600173090649, 6.836605894643799, 8.313079948789337, 4.4785887067828165, 3.0545409518911804, -96.93750398588087, 3.872315774073655, 3.392972973468944, 2.98372568563387, 1.303979312005071]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.150345955179655, 5.994184275299852, 6.875620779316471, 8.352400628657845, 4.514395054950258, 3.086610667207181, -96.90856612495116, 3.9042404268430198, 3.421788242551384, 3.0100178063926917, 1.3271248452994497]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.185639369127586, 6.029400130661833, 6.910329282820777, 8.387374144666724, 4.54625145566291, 3.115190918418383, -96.88275762964182, 3.932643628322792, 3.4474252960516596, 3.033414998390192, 1.3479784745476715]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.217038268307665, 6.06072910715009, 6.941206202271567, 8.418481615384486, 4.574592815200843, 3.140653369320794, -96.85974816599513, 3.957913030138308, 3.470233934988881, 3.054234586285983, 1.3667436771493524]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.244971754663831, 6.088599704922545, 6.968674124520143, 8.446150779668248, 4.59980637009411, 3.163332209883523, -96.83924030777028, 3.9803938133292065, 3.4905256538950273, 3.0727594824596753, 1.3836107319123463]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.2698218187724475, 6.113393316540561, 6.9931091314574685, 8.470761953362453, 4.622236856133102, 3.1835273663686796, -96.82096675453982, 4.000393288466015, 3.508577785396636, 3.08924182188013, 1.3987564568182123]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.291928468650704, 6.135449371626679, 7.01484589122532, 8.492653312433866, 4.642191131532466, 3.20150759796494, -96.80468765023255, 4.018185011351015, 3.5246372095441254, 3.103906238657827, 1.4123442915019566]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.311594311587696, 6.155069928575034, 7.034182198979507, 8.512125580979587, 4.6599423080189855, 3.217513429922117, -96.79018804193274, 4.034012461654233, 3.53892367010899, 3.1169528126506685, 1.424524629255145]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.3290886443386265, 6.172523770408751, 7.05138302490645, 8.529446192835046, 4.675733439367229, 3.231759894924065, -96.77727550166655, 4.048092328203095, 3.551632736925847, 3.1285597142683366, 1.435435323980818]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.344651102227827, 6.18805005660622, 7.066684121625979, 8.544852987142725, 4.689780812690137, 3.244439069575815, -96.76577792190429, 4.060617440998505, 3.5629384501655763, 3.1388855746651885, 1.4452023141443295]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.358494913099095, 6.201861577759825, 7.080295237950924, 8.558557491001192, 4.702276883715573, 3.255722403230037, -96.75554148721127, 4.07175938647998, 3.5729956792953894, 3.1480716071032506, 1.4539403190961995]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.370809797700395, 6.214147655321434, 7.092402981227146, 8.570747836021411, 4.713392893413088, 3.265762843275362, -96.74642881888116, 4.081670839177208, 3.5819422264807597, 3.1562435036002223, 1.4617535737919416]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.381764554031647, 6.225076724441403, 7.103173366140642, 8.581591350130001, 4.7232811997171416, 3.2746967654349386, -96.73831728573387, 4.090487639705559, 3.5899007013545394, 3.163513129188433, 1.4687365763984317]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.391509359435203, 6.2347986340207155, 7.112754083935403, 8.591236861159487, 4.732077354742432, 3.2826457202943016, -96.73109747200789, 4.09833064610996, 3.5969801914429884, 3.169980034292398, 1.4749748299690457]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.400177820770902, 6.24344669455722, 7.1212765224155055, 8.59981674455658, 4.739901954808688, 3.289718008715609, -96.72467179201746, 4.10530738284202, 3.6032777501059927, 3.1757328039470356, 1.4805455646301604]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.407888799883362, 6.251139501159464, 7.128857563882539, 8.607448743840258, 4.746862287781585, 3.2960100993799, -96.71895324068413, 4.111513509168064, 3.608879721620085, 3.1808502608658857, 1.4855184308283598]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.41474803872466, 6.257982556204132, 7.135601185258607, 8.614237589183885, 4.753053799682442, 3.3016079017047324, -96.71386426897149, 4.117034126542292, 3.6138629210020654, 3.1854025377528434, 1.4899561573687274]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.420849605923615, 6.264069713502941, 7.141599882039098, 8.620276436622467, 4.758561400207842, 3.3065879070086823, -96.70933577349454, 4.121944942428639, 3.6182956843289924, 3.189452032745412, 1.4939151704157563]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.42627718427395, 6.269484463498681, 7.146935935382477, 8.625648147848622, 4.763460624714655, 3.311018210177761, -96.70530619002822, 4.126313306202739, 3.6222388036451916, 3.1930542604876244, 1.4974461714806488]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.431105216528025, 6.2743010769051475, 7.151682539551429, 8.630426428317012, 4.767818668348817, 3.314959423331503, -96.70172068121953, 4.130199131097463, 3.625746359046308, 3.196258610057064, 1.5005946738044613]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.435399925010622, 6.278585622319956, 7.155904805047714, 8.634676839391073, 4.771695306310436, 3.3184654921581123, -96.69853040945856, 4.133655714656453, 3.6288664591809816, 3.1991090198096725, 1.5034014975644843]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.439220218889265, 6.282396871651947, 7.159660651110097, 8.638457698506677, 4.775143712736149, 3.3215844247366206, -96.69569188654323, 4.136730468815397, 3.631641900199663, 3.201644578152924, 1.5059032250609954]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.442618501435689, 6.285787105696621, 7.163001599751013, 8.641820879768, 4.778211189326203, 3.3243589418217216, -96.69316639245375, 4.139465569526222, 3.63411075209482, 3.203900058305989, 1.5081326175469452]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.445641388270138, 6.288802830846122, 7.165973482174507, 8.644812526007787, 4.780939813633216, 3.326827056754884, -96.69091945621656, 4.141898534761809, 3.6363068804055705, 3.2059063942474895, 1.5101189956952996]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.448330346380511, 6.291485416717947, 7.168617067229252, 8.647473682116942, 4.783367015847947, 3.329022592396423, -96.68892039247147, 4.144062738775586, 3.6382604103914256, 3.2076911042797787, 1.5118885858983728]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.450722262637484, 6.293871663414293, 7.17096862049051, 8.64984085835887, 4.78552609195162, 3.3309756417547507, -96.68714188795343, 4.145987869630197, 3.6399981400041987, 3.209278667946131, 1.5134648346928339]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.45284994957131, 6.29599430616774, 7.17306040162046, 8.651946531416527, 4.787446660242635, 3.332712978324414, -96.68555963265658, 4.147700336241794, 3.641543907294848, 3.21069086141664, 1.5148686936286933]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.454742595324029, 6.297882464276925, 7.1749211068149386, 8.653819590060731, 4.789155067476975, 3.3342584215344204, -96.68415199096317, 4.149223630501721, 3.6429189172743124, 3.2119470559031007, 1.5161188768705207]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.456426163931477, 6.299562040476602, 7.176576262395169, 8.655485731564758, 4.790674750176519, 3.3356331621516273, -96.68289970849165, 4.150578649426937, 3.6441420326967715, 3.213064483166601, 1.517232093750116]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.457923751412874, 6.301056076210311, 7.178048574935837, 8.656967814311678, 4.792026556049043, 3.3368560519783914, -96.68178565085, 4.151783981746445, 3.6452300327428593, 3.214058471737908, 1.5182232583940058]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.459255902542997, 6.3023850676716595, 7.179358242726689, 8.658286171437862, 4.793229029919787, 3.337943861726256, -96.68079457087222, 4.152856162846285, 3.64619784314304, 3.2149426570747637, 1.5191056784358736]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.460440892645246, 6.303567246944115, 7.180523232836027, 8.659458889819911, 4.794298668090139, 3.3389115105348752, -96.67991290127186, 4.15380990156394, 3.6470587408917847, 3.2157291685269525, 1.5198912247004834]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.461494978265947, 6.30461883209188, 7.181559527573717, 8.660502058235764, 4.795250144607707, 3.339772270234171, -96.67912856996713, 4.154658281938565, 3.647824536356289, 3.21642879566503, 1.5205904836172444]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.462432620164783, 6.3055542496296155, 7.18248134373246, 8.66142998810689, 4.796096512548032, 3.3405379471143775, -96.67843083562248, 4.155412942681121, 3.648505735274539, 3.217051136247815, 1.5212128939922713]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.463266681677476, 6.306386332420702, 7.183301327613178, 8.662255409851767, 4.796849383066429, 3.3412190436698412, -96.67781014121175, 4.156084236823839, 3.649111682862581, 3.2176047278536477, 1.5217668696405682]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.464008605169857, 6.307126495717215, 7.184030728508745, 8.662989647545718, 4.797519084674255, 3.3418249025149014, -96.6772579836425, 4.156681373737242, 3.649650692006079, 3.2180971649775465, 1.5222599092568394]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.464668569002366, 6.307784893755396, 7.184679553025046, 8.663642774284286, 4.7981148049230695, 3.3423638344309627, -96.676766797691, 4.157212545461509, 3.6501301572933813, 3.2185352031980003, 1.52269869478586]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.465255627157174, 6.308370559054005, 7.185256702355726, 8.664223750382288, 4.798644716439318, 3.342843232290129, -96.67632985268597, 4.1576850390842495, 3.650556656453469, 3.2189248518404296, 1.5230891794421855]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.465777833442568, 6.308891526325844, 7.185770094393375, 8.664740546305056, 4.799116089037727, 3.3432696724099356, -96.6759411605476, 4.1581053367055585, 3.6509360405896265, 3.2192714564070646, 1.523436666424806]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.466242351977835, 6.309354942701883, 7.186226772351939, 8.66520025201877, 4.799535389450832, 3.3436490047234644, -96.67559539393919, 4.158479204361166, 3.6512735144461725, 3.2195797719029917, 1.5237458792753622]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.4666555554739364, 6.3097671657797365, 7.186633001390302, 8.665609174260332, 4.799908370042385, 3.343986432997291, -96.67528781342374, 4.158811771123179, 3.651573707809, 3.2198540280635055, 1.524021024738759]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.46702311265788, 6.310133850841371, 7.186994354562323, 8.66597292306148, 4.800240147721303, 3.3442865861943747, -96.67501420263802, 4.159107599463235, 3.65184073901915, 3.2200979873769975, 1.5242658489023246]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.467350066039915, 6.310460028436321, 7.187315789272368, 8.666296488714313, 4.800535274138573, 3.344553581958393, -96.67477081060446, 4.159370747843187, 3.652078271470542, 3.2203149966989755, 1.5244836873137908]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.467640901090215, 6.310750173394643, 7.187601715285075, 8.666584310234251, 4.800797798129898, 3.3447910830886074, -96.67455430039688, 4.159604826391764, 3.6522895638667676, 3.2205080331649554, 1.5246775097090164]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.467899607773954, 6.311008266216289, 7.187856055222292, 8.66684033625975, 4.801031321260629, 3.3450023477786495, -96.67436170346218, 4.159813046430921, 3.6524775149262814, 3.2206797450318474, 1.5248499599171697]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.468129735288841, 6.311237847678983, 7.188082298377105, 8.667068079224336, 4.80123904723488, 3.3451902743074338, -96.67419037897591, 4.1599982645312625, 3.65264470314924, 3.2208324880079884, 1.5250033914536265]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.468334440756003, 6.311442067413733, 7.18828354858313, 8.667270663544194, 4.801423825846601, 3.345357440794553, -96.67403797767811, 4.160163021700798, 3.652793422191438, 3.2209683575700643, 1.5251398992587435]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.4685165325321226, 6.311623727114326, 7.1884625667958115, 8.66745086848242, 4.801588192075517, 3.3455061405650075, -96.6739024096965, 4.160309578244674, 3.6529257123306245, 3.2210892177102064, 1.5252613479935009]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.468678508737003, 6.3117853189735635, 7.188621808969818, 8.667611166278045, 4.801734400864237, 3.3456384136080084, -96.67378181591779, 4.160439944774028, 3.6530433884568128, 3.221196726507545, 1.5253693972604148]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.468822591525078, 6.311929059873511, 7.188763459752197, 8.667753756062915, 4.801864458053619, 3.345756074561143, -96.67367454251688, 4.160555909789382, 3.6531480649705816, 3.221292358875001, 1.5254655240796862]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.468950757571011, 6.312056921798814, 7.1888894624535, 8.667880594031775, 4.801980147900779, 3.345860737603589, -96.67357911829538, 4.160659064216958, 3.653241177930915, 3.2213774267933317, 1.5255510429159458]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469064765187608, 6.312170658890307, 7.189001545708017, 8.66799342027946, 4.802083057557218, 3.345953838599706, -96.67349423452063, 4.160750823234506, 3.653324004756409, 3.2214530973099884, 1.525627123519815]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469166178448055, 6.312271831510027, 7.1891012471889075, 8.668093782673363, 4.802174598842899, 3.3460366547966895, -96.673418726989, 4.1608324456860695, 3.653397681750124, 3.2215204085496905, 1.5256948068205223]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469256388643407, 6.312361827647819, 7.189189934703526, 8.668183058088742, 4.802256027614975, 3.3461103223464064, -96.67335156006875, 4.160905051352039, 3.6534632196884904, 3.2215802839563623, 1.5257550190806959]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469336633369684, 6.312441881963146, 7.189268824958388, 8.668262471298135, 4.802328460996868, 3.346175851891741, -96.67329181250385, 4.1609696363113935, 3.653521517688132, 3.2216335449617843, 1.5258085845019245]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469408013506456, 6.312513092723406, 7.18933900025117, 8.668333111774096, 4.802392892704084, 3.3462341424312196, -96.67323866478468, 4.161027086606898, 3.653573375540831, 3.221680922254769, 1.5258562364494883]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469471508319802, 6.312576436871056, 7.189401423318765, 8.668395948635707, 4.802450206676993, 3.3462859936520966, -96.67319138791282, 4.161078190400731, 3.653619504685876, 3.2217230658054494, 1.525898627446576]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469527988896872, 6.312632783426301, 7.189456950545087, 8.66845184394398, 4.802501189207607, 3.346332116901058, -96.67314933340609, 4.161123648787272, 3.653660537970294, 3.221760553782198, 1.5259363380721245]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.4695782300963405, 6.312682905409198, 7.1895063437098194, 8.668501564528508, 4.802546539726734, 3.346373144943044, -96.67311192440695, 4.16116408541143, 3.653697038330883, 3.2217939004835077, 1.5259698848819365]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.4696229211787, 6.3127274904447255, 7.189550280439284, 8.668545792507627, 4.802586880399476, 3.3464096406420434, -96.67307864777267, 4.161200055024431, 3.65372950651715, 3.2218235633936474, 1.5259997274597967]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469662675262239, 6.312767150196335, 7.1895893635028125, 8.668585134646431, 4.802622764660718, 3.3464421046829527, -96.67304904703872, 4.161232051094459, 3.653758387961098, 3.221849949458885, 1.5260262746937352]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469698037734428, 6.312802428757366, 7.18962412908215, 8.668620130681003, 4.8026546848077185, 3.3464709824404193, -96.67302271615917, 4.161260512576558, 3.653784078888119, 3.221873420670383, 1.52604989036226]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469729493734098, 6.312833810115475, 7.189655054127353, 8.668651260723074, 4.80268307875394, 3.3464966700888983, -96.67299929393832, 4.161285829934658, 3.6538069317528157, 3.221894299030363, 1.5260708981061633]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469757474807065, 6.31286172479248, 7.189682562900076, 8.668678951846697, 4.802708336036797, 3.3465195200377464, -96.67297845907706, 4.1613083504983654, 3.65382726007433, 3.221912870969647, 1.5260895858532606]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469782364826533, 6.312886555750702, 7.189707032794027, 8.668703583947288, 4.802730803161728, 3.346539845765909, -96.67295992576649, 4.161328383227986, 3.6538453427375304, 3.221929391277222, 1.5261062097560911]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469804505259449, 6.312908643646827, 7.189728799512442, 8.668725494953424, 4.802750788355914, 3.3465579261225242, -96.67294343976792, 4.161346202953156, 3.6538614278190273, 3.221944086595677, 1.526120997696045]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469824199851098, 6.312928291505388, 7.189748161673613, 8.668744985462897, 4.802768565796856, 3.346574009152454, -96.67292877492592, 4.161362054143233, 3.6538757359905527, 3.2219571585305298, 1.5261341524015442]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469841718792195, 6.312945768875973, 7.189765384907632, 8.668762322866634, 4.802784379373815, 3.3465883154992015, -96.67291573006644, 4.161376154261188, 3.653888463546355, 3.2219687864160518, 1.5261458542226956]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469857302425641, 6.312961315531171, 7.189780705500592, 8.668777745017028, 4.802798446033749, 3.346601041431923, -96.67290412623744, 4.161388696746979, 3.653899785096169, 3.2219791297755593, 1.526156263600184]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.46987116454379, 6.312975144756037, 7.189794333636187, 8.66879146349105, 4.802810958757604, 3.3466123615380416, -96.6728938042544, 4.161399853671358, 3.653909855960688, 3.22198833050992, 1.5261655232620417]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469883495321473, 6.312987446274143, 7.189806456279237, 8.668803666492883, 4.8028220892078295, 3.3466224311184143, -96.6728846225168, 4.161409778096521, 3.6539188143024197, 3.2219965148442924, 1.5261737601782401]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.46989446392502, 6.312998388850397, 7.189817239740665, 8.668814521435895, 4.8028319900834235, 3.3466313883179053, -96.67287645506583, 4.16141860617597, 3.6539267830211344, 3.222003795059815, 1.5261810872997579]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.469904220833046, 6.313008122606352, 7.189826831959116, 8.668824177239394, 4.8028407972148095, 3.346639356020596, -96.67286918985664, 4.161426459022403, 3.6539338714399343, 3.22201027103399, 1.526187605105863]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]
[5.46991289990088, 6.313016781079707, 7.189835364530538, 8.668832766371658, 4.8028486314273, 3.346646443535637, -96.67286272722137, 4.161433444369266, 3.6539401768050603, 3.2220160316109103, 1.526193402980731]
[1, 1, 1, 0, 0, 3, 3, 0, 3, 3, 2]

Value Iteration in Gridworld Example

To solve the non-linear equations for \(U^{*}(s)\) we use an iterative approach.
Steps:
- Initialize estimates for the utilities of states with arbitrary values: \(U(s) \leftarrow 0 \forall s \epsilon S\)
- Next use the iteration step below which is also called Bellman Update:

\[v_{t+1}(s) \leftarrow R(s) + \gamma \underset{a}{ \max} \left[ \sum_{s^{'}} P(s^{'}| s,a) v_t(s^{'}) \right] \forall s \epsilon S\]

This step is repeated and updated

Let us apply this to the maze example. Assume that \(\gamma = 1\)

Initialize value estimates to \(0\)

Value Iteration

Next we want to apply Bellman Update: \[v_{t+1}(s) \leftarrow R(s) + \gamma \max_{a} \left[\sum_{s^\prime} P(s^\prime | s,a)v_t(s^\prime) \right] \forall s \epsilon S\]
Since we are taking \(\max\) we only need to consider states whose next states have a positive utility value.
For the remaining states, the utility is equal to the immediate reward in the first iteration.

Value Iteration (t=0)

\[ v_{t+1}(s_{33}) = R(s_{33}) + \gamma \max_a \left[\sum_{s^{'}} P(s^{'}| s_{33},a)U(s^{'}) \right] \forall s \in S \]

\[ v_{t+1}(s_{33}) = -0.04 + \max_a \left[ \sum_{s'} P(s'| s_{33},\uparrow) v_t(s'), \sum_{s'} P(s'| s_{33},\downarrow)v_t(s'), \sum_{s'} P(s'| s_{33},\rightarrow) v_t(s'), \sum_{s'} P(s'| s_{33}, \leftarrow)v_t(s') \right]\]

\[v_{t+1}(s_{33}) = -0.04 + \sum_{s^{'}} P(s^{'}| s_{33},\rightarrow) v_t(s^\prime) \]

\[v_{t+1}(s_{33}) = -0.04 + P(s_{43}|s_{33},\rightarrow)U(s_{43})+P(s_{33}|s_{33},\rightarrow)U(s_{33})+P(s_{32}|s_{33},\rightarrow)v_t(s_{32}) \]

\[v_{t+1}(s_{33}) = -0.04 + 0.8 \times 1 + 0.1 \times 0 + 0.1 \times 0 = 0.76 \]

Value Iteration (t=1)

(A) Initial utility estimates for iteration 2. (B) States with next state positive utility

\[v_{t+1}(s_{33}) = -0.04 + P(s_{43}|s_{33},\rightarrow)v_t(s_{43})+P(s_{33}|s_{33},\rightarrow)v_t(s_{33}) +P(s_{32}|s_{33},\rightarrow)v_t(s_{32}) \]

\[v_{t+1}(s_{33}) = -0.04 + 0.8 \times 1 + 0.1 \times 0.76 + 0.1 \times 0 = 0.836\]

\[v_{t+1}(s_{23}) = -0.04 + P(s_{33}|s_{23},\rightarrow)v_t(s_{23})+P(s_{23}|s_{23},\rightarrow)v_t(s_{23}) = -0.04 + 0.8 \times 0.76 = 0.568\]

\[v_{t+1}(s_{32}) = -0.04 + P(s_{33}|s_{32},\uparrow)v_t(s_{33})+P(s_{42}|s_{32},\uparrow)v_t(s_{42}) +P(s_{32}|s_{32},\uparrow)v_t(s_{32})\] \[v_{t+1}(s_{32}) = -0.04 + 0.8 \times 0.76 + 0.1 \times -1 + 0.1 \times 0= 0.468\]

Value Iteration (t=2)

(A)Initial utility estimates for iteration 3. (B) States with next state positive utility

Information propagates outward from terminal states and eventually all states have correct value estimates
Notice that \(s_{32}\) has a lower utility compared to \(s_{23}\) due to the red oval state with negative reward next to \(s_{32}\)

Value Iteration - Convergence

Rate of convergence depends on the maximum reward value and more importantly on the discount factor \(\gamma\).
The policy that we get from coarse estimates is close to the optimal policy long before \(V\) has converged.
This means that after a reasonable number of iterations, we could use: \[\pi(s) = \arg \max_a \left[ \sum_{s^{'}} P(s^{'}| s,a)V_{est}(s^{'}) \right]\]
Note that this is a form of greedy policy.

Convergence of utility for the maze problem (Norvig chap 17)

For the maze problem, convergence is reached within 5 to 10 iterations