- Behavior policy (the policy that picks during training): -greedy in . This is what visits states and explores.
- Target policy (the policy whose value is being learned): the greedy policy . Q-learning bootstraps from , i.e. from the target policy’s choice at , even though the behavior policy may pick a different next.
Algorithm
- The Q-table converges (in tabular settings, under the standard Robbins-Monro step-size conditions) to , the optimal action-value function, independent of the behavior policy used to gather data, as long as that behavior policy keeps every state-action pair visited infinitely often.
- The greedy path read off from the converged is the optimal path under the true dynamics, not the safer path that an on-policy method like SARSA would learn while still exploring.
Practical consequence
The two methods learn different things while training is ongoing because their targets differ. Sutton & Barto Example 6.6 (the cliff-walking gridworld used in the SARSA example) is the textbook demonstration:- Q-learning converges to the optimal greedy path that runs along the cliff edge, minimum number of steps to the goal.
- SARSA converges to a safer path along the top of the grid, longer, but with smaller penalty when -greedy exploration occasionally pushes the agent off the cliff.

