Skip to main content
Open In Colab Run SARSA and Q-learning on the same CliffWalking-v1 environment, with the same hyperparameters, and read off the difference in learned policies. This is Sutton & Barto Example 6.6.
  • Environment: CliffWalking-v1 (4×12 grid, reward 1-1 per step, 100-100 on the cliff with reset to start, deterministic, γ=1\gamma = 1).
  • Behavior policy (both methods): ϵ\epsilon-greedy with ϵ=0.1\epsilon = 0.1.
  • TD target: SARSA uses Q(S,A)Q(S', A') where AA' is the next action sampled from ϵ\epsilon-greedy; Q-learning uses maxaQ(S,a)\max_{a'} Q(S', a').
The expected outcome: Q-learning’s greedy policy walks along the cliff edge (shortest path), SARSA’s greedy policy walks along the top row (safer under ϵ\epsilon-greedy exploration).
from collections import defaultdict

import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from tqdm import tqdm


def epsilon_greedy(q_row, n_actions, epsilon):
    if np.random.rand() < epsilon:
        return int(np.random.randint(n_actions))
    return int(np.argmax(q_row))


def train(method, n_episodes=500, alpha=0.5, gamma=1.0, epsilon=0.1, seed=0):
    """Train tabular SARSA or Q-learning on CliffWalking-v1.

    method: 'sarsa' or 'qlearning'.
    Returns (q_table, episode_returns).
    """
    env = gym.make("CliffWalking-v1")
    n_actions = env.action_space.n
    q = defaultdict(lambda: np.zeros(n_actions))
    rng = np.random.default_rng(seed)
    np.random.seed(seed)
    returns = np.empty(n_episodes)

    for ep in range(n_episodes):
        s, _ = env.reset(seed=int(rng.integers(0, 2**31 - 1)))
        a = epsilon_greedy(q[s], n_actions, epsilon)
        total_r = 0.0
        while True:
            s_next, r, terminated, truncated, _ = env.step(a)
            if method == "sarsa":
                a_next = epsilon_greedy(q[s_next], n_actions, epsilon)
                target = r + (0.0 if terminated else gamma * q[s_next][a_next])
            elif method == "qlearning":
                target = r + (0.0 if terminated else gamma * np.max(q[s_next]))
                a_next = epsilon_greedy(q[s_next], n_actions, epsilon)
            else:
                raise ValueError(method)
            q[s][a] += alpha * (target - q[s][a])
            total_r += r
            s, a = s_next, a_next
            if terminated or truncated:
                break
        returns[ep] = total_r
    env.close()
    return q, returns
# Average each method over multiple seeds to smooth the learning curves.
n_episodes = 500
n_seeds = 20

sarsa_returns = np.empty((n_seeds, n_episodes))
qlearn_returns = np.empty((n_seeds, n_episodes))
sarsa_q_last = qlearn_q_last = None

for seed in tqdm(range(n_seeds), desc="Seeds"):
    sarsa_q_last, sarsa_returns[seed] = train("sarsa", n_episodes=n_episodes, seed=seed)
    qlearn_q_last, qlearn_returns[seed] = train("qlearning", n_episodes=n_episodes, seed=seed)

print(f"Final 50-episode mean return, SARSA:    {sarsa_returns[:, -50:].mean():.1f}")
print(f"Final 50-episode mean return, Q-learning: {qlearn_returns[:, -50:].mean():.1f}")

Seeds:   0%|          | 0/20 [00:00<?, ?it/s]

Seeds:   5%|▌         | 1/20 [00:00<00:05,  3.46it/s]

Seeds:  10%|█         | 2/20 [00:00<00:05,  3.40it/s]

Seeds:  15%|█▌        | 3/20 [00:00<00:05,  3.30it/s]

Seeds:  20%|██        | 4/20 [00:01<00:04,  3.28it/s]

Seeds:  25%|██▌       | 5/20 [00:01<00:04,  3.31it/s]

Seeds:  30%|███       | 6/20 [00:01<00:04,  3.37it/s]

Seeds:  35%|███▌      | 7/20 [00:02<00:03,  3.30it/s]

Seeds:  40%|████      | 8/20 [00:02<00:03,  3.35it/s]

Seeds:  45%|████▌     | 9/20 [00:02<00:03,  3.44it/s]

Seeds:  50%|█████     | 10/20 [00:02<00:02,  3.43it/s]

Seeds:  55%|█████▌    | 11/20 [00:03<00:02,  3.37it/s]

Seeds:  60%|██████    | 12/20 [00:03<00:02,  3.44it/s]

Seeds:  65%|██████▌   | 13/20 [00:03<00:02,  3.40it/s]

Seeds:  70%|███████   | 14/20 [00:04<00:01,  3.48it/s]

Seeds:  75%|███████▌  | 15/20 [00:04<00:01,  3.56it/s]

Seeds:  80%|████████  | 16/20 [00:04<00:01,  3.59it/s]

Seeds:  85%|████████▌ | 17/20 [00:04<00:00,  3.58it/s]

Seeds:  90%|█████████ | 18/20 [00:05<00:00,  3.52it/s]

Seeds:  95%|█████████▌| 19/20 [00:05<00:00,  3.57it/s]

Seeds: 100%|██████████| 20/20 [00:05<00:00,  3.61it/s]

Seeds: 100%|██████████| 20/20 [00:05<00:00,  3.46it/s]
Final 50-episode mean return, SARSA:    -26.6
Final 50-episode mean return, Q-learning: -51.8
# Sutton & Barto Example 6.6, right panel: episode return averaged over seeds.
window = 20
sarsa_mean = sarsa_returns.mean(axis=0)
qlearn_mean = qlearn_returns.mean(axis=0)

def smooth(x, w):
    return np.convolve(x, np.ones(w) / w, mode="valid")

plt.figure(figsize=(9, 4))
plt.plot(np.arange(window - 1, n_episodes), smooth(sarsa_mean, window), label="SARSA")
plt.plot(np.arange(window - 1, n_episodes), smooth(qlearn_mean, window), label="Q-learning")
plt.xlabel("Episode")
plt.ylabel(f"Return (mean over {n_seeds} seeds, smoothed window={window})")
plt.title("SARSA vs Q-learning on CliffWalking-v1")
plt.ylim(-100, 0)
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig(f"{images_dir}/sarsa_vs_q_returns.png", dpi=150, bbox_inches="tight")
plt.show()
Output from cell 4
# Greedy-policy plots: trace the deterministic argmax-Q path from start to goal.
N_ROWS, N_COLS = 4, 12
START = (3, 0)
GOAL = (3, 11)
CLIFF_COLS = list(range(1, 11))
ARROWS = ["\u2191", "\u2192", "\u2193", "\u2190"]
ACTION_DELTAS = [(-1, 0), (0, 1), (1, 0), (0, -1)]


def greedy_path(q_table, max_steps=200):
    r, c = START
    path = [(r, c)]
    for _ in range(max_steps):
        if (r, c) == GOAL:
            break
        if r == 3 and c in CLIFF_COLS:
            break  # fell off the cliff during greedy rollout
        s = r * N_COLS + c
        a = int(np.argmax(q_table[s]))
        dr, dc = ACTION_DELTAS[a]
        r = max(0, min(N_ROWS - 1, r + dr))
        c = max(0, min(N_COLS - 1, c + dc))
        path.append((r, c))
    return path


def plot_grid(ax, q_table, title):
    for r in range(N_ROWS):
        for c in range(N_COLS):
            ax.add_patch(plt.Rectangle((c, r), 1, 1, fill=False, edgecolor="#cccccc", lw=1))
    for c in CLIFF_COLS:
        ax.add_patch(plt.Rectangle((c, 3), 1, 1, color="#ffd6d6"))
    ax.add_patch(plt.Rectangle((START[1], START[0]), 1, 1, fill=False, edgecolor="green", lw=3))
    ax.add_patch(plt.Rectangle((GOAL[1], GOAL[0]), 1, 1, fill=False, edgecolor="goldenrod", lw=3))
    for r in range(N_ROWS):
        for c in range(N_COLS):
            if (r, c) == START or (r, c) == GOAL or (r == 3 and c in CLIFF_COLS):
                continue
            s = r * N_COLS + c
            if s in q_table:
                a = int(np.argmax(q_table[s]))
                ax.text(c + 0.5, r + 0.5, ARROWS[a], ha="center", va="center", fontsize=14)
    path = greedy_path(q_table)
    xs = [c + 0.5 for r, c in path]
    ys = [r + 0.5 for r, c in path]
    ax.plot(xs, ys, color="#1f77b4", lw=2.5, alpha=0.85)
    ax.text(START[1] + 0.5, START[0] + 0.5, "S", ha="center", va="center",
            color="green", fontsize=14, fontweight="bold")
    ax.text(GOAL[1] + 0.5, GOAL[0] + 0.5, "G", ha="center", va="center",
            color="goldenrod", fontsize=14, fontweight="bold")
    ax.set_xlim(0, N_COLS); ax.set_ylim(N_ROWS, 0)
    ax.set_aspect("equal")
    ax.set_xticks([]); ax.set_yticks([])
    ax.set_title(title)


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
plot_grid(ax1, sarsa_q_last, "SARSA, greedy policy")
plot_grid(ax2, qlearn_q_last, "Q-learning, greedy policy")
plt.savefig(f"{images_dir}/sarsa_vs_q_paths.png", dpi=150, bbox_inches="tight")
plt.show()
Output from cell 5 Two readings of the figures. Returns curve. Q-learning’s training-time return is worse (lower) than SARSA’s, even though Q-learning learns the optimal greedy policy. The reason: training-time return is collected under the ϵ\epsilon-greedy behavior policy. Q-learning’s greedy target is the cliff-edge path, so its ϵ\epsilon-greedy rollouts occasionally take a random step into the cliff and absorb the 100-100 penalty. SARSA’s TD target accounts for the exploration cost, it learns a Q that places lower value on cells adjacent to the cliff, so its ϵ\epsilon-greedy rollouts walk the safer top-row path and rarely fall off. Greedy paths. Read off the deterministic argmaxaQ(s,a)\arg\max_a Q(s, a) from each Q-table:
  • Q-learning: shortest path from start to goal, hugging the cliff edge.
  • SARSA: longer path along the top row, away from the cliff.
The two methods are minimizing different objectives. Q-learning’s target policy is the optimal one (for the actual MDP), exposed only at evaluation time when ϵ=0\epsilon = 0. SARSA’s target policy is the ϵ\epsilon-greedy policy itself, so it embeds the exploration cost into the value function. If you set ϵ0\epsilon \to 0 over training, both methods converge to the same greedy policy. The interesting regime is the one shown here: a fixed exploration rate during training, and the difference in behavior under that exploration that the two TD targets capture. Key references: (O’Donoghue et al., 2016; Ma & Yu, 2016; Li, 2017; Mansour & Singh, 2013; Lillicrap et al., 2015)

References

  • Li, Y. (2017). Deep Reinforcement Learning: An Overview.
  • Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., et al. (2015). Continuous control with deep reinforcement learning.
  • Ma, S., Yu, J. (2016). Transition-based versus State-based Reward Functions for MDPs with Value-at-Risk.
  • Mansour, Y., Singh, S. (2013). On the Complexity of Policy Iteration.
  • O’Donoghue, B., Munos, R., Kavukcuoglu, K., Mnih, V. (2016). Combining policy gradient and Q-learning.