Skip to main content
Open In Colab It is instructive to see the difference between MC and TD approaches in the following example. Random walk MRP from Sutton and Barto, Example 6.2.
# Parameters (injected by papermill)
output_dir = "."
images_dir = "./images"
videos_dir = "./videos"
audio_dir = "./audio"
text_dir = "./text"

import os

os.makedirs(images_dir, exist_ok=True)
import numpy as np
import matplotlib.pyplot as plt

states = ["A", "B", "C", "D", "E"]
state_to_index = {s: i for i, s in enumerate(states)}
n_states = len(states)


def generate_episode(start="C"):
    state = state_to_index[start]
    episode = []
    while True:
        action = np.random.choice([-1, 1])  # left or right
        next_state = state + action
        if next_state < 0:
            return episode + [(state, 0)]  # Left terminal, reward 0
        elif next_state >= n_states:
            return episode + [(state, 1)]  # Right terminal, reward 1
        else:
            episode.append((state, 0))
            state = next_state


def mc_prediction(episodes=1000, alpha=0.1):
    V = np.full(n_states, 0.5)
    # V = np.zeros(n_states)
    for _ in range(episodes):
        episode = generate_episode()
        G = episode[-1][1]  # Only final reward matters
        for s, _ in episode:
            V[s] += alpha * (G - V[s])
    return V


def td0_prediction(episodes=10000, alpha=0.01):
    # Episode tuples follow the (S_t, R_{t+1}) convention:
    # episode[t][1] is the reward for the OUTGOING transition from episode[t][0].
    V = np.full(n_states, 0.5)
    for _ in range(episodes):
        episode = generate_episode()
        for t in range(len(episode)):
            s, r = episode[t]
            v_next = V[episode[t + 1][0]] if t + 1 < len(episode) else 0.0
            V[s] += alpha * (r + v_next - V[s])
    return V


# Run both methods (Sutton & Barto Example 6.2 uses constant-alpha MC)
alpha_mc = 0.1
alpha_td = 0.01
V_mc = mc_prediction(alpha=alpha_mc)
V_td = td0_prediction(alpha=alpha_td)

# Ground truth (analytical solution): [1/6, 2/6, 3/6, 4/6, 5/6]
true_V = np.array([1, 2, 3, 4, 5]) / 6

# Plot results
plt.plot(true_V, label="True V", linestyle="--", marker="o")
plt.plot(V_mc, label=f"MC (α={alpha_mc})", marker="x")
plt.plot(V_td, label=f"TD(0) (α={alpha_td})", marker="s")
plt.xticks(ticks=range(n_states), labels=states)
plt.ylabel("Estimated Value")
plt.title("Monte Carlo vs TD(0) on Random Walk")
plt.legend()
plt.grid(True)
plt.savefig(
    f"{images_dir}/monte_carlo_vs_td0_on_random_walk.png", dpi=150, bbox_inches="tight"
)
plt.show()
Output from cell 2

RMS error vs. episodes (Sutton & Barto Example 6.2, right panel)

The plot above shows the final value estimates after a fixed number of episodes. Sutton & Barto’s right panel instead shows how the RMS error of the value estimates from the true values, averaged over the five non-terminal states, evolves across episodes, for several values of α\alpha, comparing MC against TD(0). RMS(V)  =  1SsS(V(s)vπ(s))2\text{RMS}(V) \;=\; \sqrt{\frac{1}{|\mathcal{S}|} \sum_{s \in \mathcal{S}} \bigl(V(s) - v_\pi(s)\bigr)^2} Curves are averaged over 100 independent runs to smooth out per-walk variance.
def mc_rms_history(episodes, alpha, true_V):
    V = np.full(n_states, 0.5)
    history = np.empty(episodes)
    for k in range(episodes):
        episode = generate_episode()
        G = episode[-1][1]
        for s, _ in episode:
            V[s] += alpha * (G - V[s])
        history[k] = np.sqrt(np.mean((V - true_V) ** 2))
    return history


def td0_rms_history(episodes, alpha, true_V):
    V = np.full(n_states, 0.5)
    history = np.empty(episodes)
    for k in range(episodes):
        episode = generate_episode()
        for t in range(len(episode)):
            s, r = episode[t]
            v_next = V[episode[t + 1][0]] if t + 1 < len(episode) else 0.0
            V[s] += alpha * (r + v_next - V[s])
        history[k] = np.sqrt(np.mean((V - true_V) ** 2))
    return history


def averaged_rms(method_fn, episodes, alpha, runs, true_V):
    return np.mean([method_fn(episodes, alpha, true_V) for _ in range(runs)], axis=0)


n_episodes = 100
n_runs = 100
mc_alphas = [0.01, 0.02, 0.03, 0.04]
td_alphas = [0.05, 0.10, 0.15]

plt.figure(figsize=(8, 5))
for a in mc_alphas:
    rms = averaged_rms(mc_rms_history, n_episodes, a, n_runs, true_V)
    plt.plot(rms, linestyle="--", label=f"MC \u03b1={a}")
for a in td_alphas:
    rms = averaged_rms(td0_rms_history, n_episodes, a, n_runs, true_V)
    plt.plot(rms, label=f"TD(0) \u03b1={a}")
plt.xlabel("Episodes (walks)")
plt.ylabel(f"RMS error (averaged over {n_states} states, {n_runs} runs)")
plt.title("Sutton & Barto Example 6.2 \u2014 RMS error vs. episodes")
plt.legend()
plt.grid(True)
plt.savefig(f"{images_dir}/rms_error_vs_episodes.png", dpi=150, bbox_inches="tight")
plt.show()
Output from cell 3

The MC vs TD(0) trade-off

The two curves above are not just two implementations of the same idea — they sit at opposite ends of a real bias / variance / structure trade-off, and that trade-off explains the gap visible in the RMS plot. Bias. The MC target is the actual return GtG_t, an unbiased sample of vπ(St)v_\pi(S_t). The TD(0) target is Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1}), which is biased whenever V(St+1)V(S_{t+1}) is wrong — and at episode 0 it is wrong everywhere because VV is initialised to 0.5. Early in training MC has the more honest target. Variance. GtG_t depends on the entire random trajectory from tt to termination, so its variance grows with horizon. The TD(0) target depends on only one stochastic transition (Rt+1,St+1)(R_{t+1}, S_{t+1}), so its variance is much lower. On the random walk, where episodes can take dozens of steps to terminate, MC’s per-update target swings wildly between 0 and 1 while TD(0)‘s target swings only by the local change in VV. Lower variance is what lets TD(0) take larger learning rates without going unstable, and what produces the smoother curves in the figure above. Use of the Markov property. TD(0) treats V(St+1)V(S_{t+1}) as a sufficient statistic for everything that happens after St+1S_{t+1} — it bootstraps. That is exactly the Markov assumption. MC ignores the Markov structure entirely and just averages observed returns. When the environment really is Markov (as the random walk is), TD(0) gets to reuse all the information it has already accumulated about successor states; MC has to rediscover it from scratch each episode. This is the deepest reason TD(0) wins on this MRP. Asymptotic fixed points. Run on a finite batch of episodes until the parameters stop moving:
  • MC converges to the value function that minimises mean-squared error on the observed returns.
  • TD(0) converges to the value function of the maximum-likelihood MDP fitted to the data — the certainty-equivalence solution.
These are different fixed points. On a Markov environment with limited data, the TD(0) fixed point is usually the better predictor of new returns; on a non-Markov environment it can be worse. When to apply each.
  • Episodic vs continuing. MC needs GtG_t, which requires the episode to terminate. TD(0) updates after every step and works on continuing tasks.
  • Long horizons. MC’s variance scales with horizon; TD(0) does not. The longer the episode, the bigger the gap.
  • Sample efficiency. TD(0) starts learning from the first transition of the first episode; MC waits until the episode ends. On the random walk this is why TD(0) drops below MC within the first few walks.
  • Function approximation. With non-linear function approximation (deep RL), TD’s bootstrapping interacts with the off-policy / function-approximation parts of the deadly triad and can diverge; MC remains a safe (if higher-variance) baseline.
nn-step TD and TD(λ\lambda) interpolate between TD(0) (n=1n=1) and MC (nn \to \infty), letting you tune the bias / variance balance per task. Key references: (Mansour & Singh, 2013; Ma & Yu, 2016; Mandt et al., 2017; Schulman et al., 2017; Bayesian Optimization for Rein, 2016)

References

  • Ma, S., Yu, J. (2016). Transition-based versus State-based Reward Functions for MDPs with Value-at-Risk.
  • Mandt, S., Hoffman, M., Blei, D. (2017). Stochastic Gradient Descent as approximate Bayesian inference.
  • Mansour, Y., Singh, S. (2013). On the Complexity of Policy Iteration.
  • (2016). Bayesian Optimization for Reinforcement Learning.
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal Policy Optimization Algorithms.