SARSA

SARSA - TD Learning

In the TD prediction section, we have met the TD prediction step for

V(s)

but for control we need to predict the

Q(s,a)

The TD(0), also known as single-step TD, tree for SARSA is shown below:

SARSA action-value backup update tree. Its name is attributed to the fact that we need to know the State-Action-Reward-State-Action before performing an update.

Following the value estimate of temporal difference (TD) learning, we can write the value update equation as:

Q(S,A) = Q(S,A) + \alpha (R + \gamma Q(S^\prime, A^\prime)-Q(S,A))

Effectively the equation above updates the Q function by

\alpha

times the direction of the TD error. What SARSA does is basically the policy iteration diagram we have seen in the control above but with a twist. Instead of trying to evaluate the policy using episodes as in MC, SARSA does policy improvement on an estimate obtained over each time step significantly increasing the iteration rate - this is figuratively shown below:

SARSA on-policy control

The idea is to increase the frequency of the so called

\epsilon

-greedy policy improvement step where we select with probability

\epsilon

a random action instead of the action that maximizes the

Q(s,a)

function (greedy). We do so, in order to “hit” new states and therefore improve on the degree of exploration of our agent and as a result giving opportunities to the agent to reduce its variance and its bias.

The SARSA algorithm is summarized below:

SARSA algorithm for on-policy control

Key references: (Rafati & Noelle, 2019; Szepesvári et al., 2010; Tu & Recht, 2018; O’Donoghue et al., 2016; Ma & Yu, 2016)

References

Ma, S., Yu, J. (2016). Transition-based versus State-based Reward Functions for MDPs with Value-at-Risk.

O’Donoghue, B., Munos, R., Kavukcuoglu, K., Mnih, V. (2016). Combining policy gradient and Q-learning.

Rafati, J., Noelle, D. (2019). Learning sparse representations in reinforcement learning.

Szepesvári, C., Cochran, J., Cox, L., Keskinocak, P., Kharoufeh, J., et al. (2010). Reinforcement Learning Algorithms for MDPs. Wiley Encyclopedia of Operations Research and Management Science.

Tu, S., Recht, B. (2018). The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint.

Edit this page on GitHub or file an issue.

Reinforcement Learning

Prediction

Value-based Control

Policy-based Control

Policy-Value-based Control

Model-Based

SARSA - TD Learning

References

​SARSA - TD Learning

​References

SARSA - TD Learning

References