Skip to main content
Q-learning (Watkins, 1989) replaces SARSA’s on-policy TD target with the greedy one: Q(S,A)    Q(S,A)+α[R+γmaxaQ(S,a)Q(S,A)]Q(S, A) \;\leftarrow\; Q(S, A) + \alpha \Bigl[ R + \gamma \max_{a'} Q(S', a') - Q(S, A) \Bigr] Compare to the SARSA update, Q(S,A)    Q(S,A)+α[R+γQ(S,A)Q(S,A)],Q(S, A) \;\leftarrow\; Q(S, A) + \alpha \bigl[ R + \gamma\, Q(S', A') - Q(S, A) \bigr], The single change, Q(S,A)Q(S', A') becomes maxaQ(S,a)\max_{a'} Q(S', a'), is the entire difference between on-policy and off-policy TD control:
  • Behavior policy (the policy that picks AtA_t during training): ϵ\epsilon-greedy in QQ. This is what visits states and explores.
  • Target policy (the policy whose value is being learned): the greedy policy argmaxaQ(s,a)\arg\max_a Q(s, a). Q-learning bootstraps from maxaQ(S,a)\max_{a'} Q(S', a'), i.e. from the target policy’s choice at SS', even though the behavior policy may pick a different AA' next.
SARSA’s TD target uses the action the behavior policy actually selected, so it tracks the value of the ϵ\epsilon-greedy policy itself, including the cost of the random exploratory steps. Q-learning’s TD target uses the greedy choice instead, so it tracks the value of the underlying optimal policy regardless of how exploration is conducted.

Algorithm

Initialize Q(s, a) for all s, a (Q(terminal, ·) = 0)
For each episode:
    Initialize S
    For each step:
        Choose A from S using policy π derived from Q (e.g. ε-greedy)
        Take action A, observe R, S'
        Q(S, A) ← Q(S, A) + α [ R + γ max_{a'} Q(S', a') − Q(S, A) ]
        S ← S'
    until S is terminal
Two structural consequences of the off-policy target:
  • The Q-table converges (in tabular settings, under the standard Robbins-Monro step-size conditions) to qq_*, the optimal action-value function, independent of the behavior policy used to gather data, as long as that behavior policy keeps every state-action pair visited infinitely often.
  • The greedy path read off from the converged QQ is the optimal path under the true dynamics, not the safer path that an on-policy method like SARSA would learn while still exploring.

Practical consequence

The two methods learn different things while training is ongoing because their targets differ. Sutton & Barto Example 6.6 (the cliff-walking gridworld used in the SARSA example) is the textbook demonstration:
  • Q-learning converges to the optimal greedy path that runs along the cliff edge, minimum number of steps to the goal.
  • SARSA converges to a safer path along the top of the grid, longer, but with smaller penalty when ϵ\epsilon-greedy exploration occasionally pushes the agent off the cliff.
The next page works through this side-by-side, training both on the same Gymnasium environment. Key references: Watkins (1989), Learning from delayed rewards; Sutton & Barto, Reinforcement Learning: An Introduction, Chapter 6.5.