Q-learning

Q-learning (Watkins, 1989) replaces SARSA’s on-policy TD target with the greedy one:

Q(S, A) \;\leftarrow\; Q(S, A) + \alpha \Bigl[ R + \gamma \max_{a'} Q(S', a') - Q(S, A) \Bigr]

Compare to the SARSA update,

Q(S, A) \;\leftarrow\; Q(S, A) + \alpha \bigl[ R + \gamma\, Q(S', A') - Q(S, A) \bigr],

The single change,

Q(S', A')

becomes

\max_{a'} Q(S', a')

, is the entire difference between on-policy and off-policy TD control:

Behavior policy (the policy that picks $A_t$ during training): $\epsilon$ -greedy in $Q$ . This is what visits states and explores.
Target policy (the policy whose value is being learned): the greedy policy $\arg\max_a Q(s, a)$ . Q-learning bootstraps from $\max_{a'} Q(S', a')$ , i.e. from the target policy’s choice at $S'$ , even though the behavior policy may pick a different $A'$ next.

SARSA’s TD target uses the action the behavior policy actually selected, so it tracks the value of the

\epsilon

-greedy policy itself, including the cost of the random exploratory steps. Q-learning’s TD target uses the greedy choice instead, so it tracks the value of the underlying optimal policy regardless of how exploration is conducted.

Algorithm

Initialize Q(s, a) for all s, a (Q(terminal, ·) = 0)
For each episode:
    Initialize S
    For each step:
        Choose A from S using policy π derived from Q (e.g. ε-greedy)
        Take action A, observe R, S'
        Q(S, A) ← Q(S, A) + α [ R + γ max_{a'} Q(S', a') − Q(S, A) ]
        S ← S'
    until S is terminal

Two structural consequences of the off-policy target:

The Q-table converges (in tabular settings, under the standard Robbins-Monro step-size conditions) to $q_*$ , the optimal action-value function, independent of the behavior policy used to gather data, as long as that behavior policy keeps every state-action pair visited infinitely often.
The greedy path read off from the converged $Q$ is the optimal path under the true dynamics, not the safer path that an on-policy method like SARSA would learn while still exploring.

Practical consequence

The two methods learn different things while training is ongoing because their targets differ. Sutton & Barto Example 6.6 (the cliff-walking gridworld used in the SARSA example) is the textbook demonstration:

Q-learning converges to the optimal greedy path that runs along the cliff edge, minimum number of steps to the goal.
SARSA converges to a safer path along the top of the grid, longer, but with smaller penalty when $\epsilon$ -greedy exploration occasionally pushes the agent off the cliff.

The next page works through this side-by-side, training both on the same Gymnasium environment. Key references: Watkins (1989), Learning from delayed rewards; Sutton & Barto, Reinforcement Learning: An Introduction, Chapter 6.5.

Edit this page on GitHub or file an issue.

Reinforcement Learning

Prediction

Value-based Control

Policy-based Control

Policy-Value-based Control

Model-Based

Algorithm

Practical consequence

​Algorithm

​Practical consequence

Algorithm

Practical consequence