Skip to main content
This notebook explains PPO using the simplest possible environments:
  1. Single-state, two-action system
  2. Two-state trajectory system
The goal is conceptual clarity: understand why PPO exists and what clipping does.

1. From REINFORCE to PPO

REINFORCE update: J(θ)=E[R(τ)logπθ]\nabla J(\theta) = \mathbb{E}[R(\tau) \nabla \log \pi_\theta] Problem:
  • Updates can be too large
  • Policy may collapse or diverge
PPO introduces a trust region approximation using probability ratios.

2. PPO Objective

Define the probability ratio: rt(θ)=πθ(atst)πold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{old}(a_t|s_t)} PPO objective: L=min(rtAt,clip(rt,1ϵ,1+ϵ)At)L = \min( r_t A_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t ) Interpretation:
  • If policy changes too much → clip it
  • Prevents overly aggressive updates
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)

3. Single-State Example

Policy: π(a1)=θ,π(a2)=1θ\pi(a_1)=\theta, \quad \pi(a_2)=1-\theta Reward:
  • a1a_1 → 1
  • a2a_2 → 0
We compare PPO dynamics with REINFORCE intuition.
theta = 0.4
lr = 0.05
eps = 0.2
outer_iters = 50          # number of (collect rollout, update) cycles
inner_epochs = 4          # number of PPO mini-epochs per rollout

theta_hist = []
ratio_hist = []
clipped_hist = []
obj_hist = []                 # sampled PPO clipped surrogate
unclipped_hist = []           # sampled r_t · R (unclipped surrogate)
reward_hist = []              # sampled reward
expected_reward_hist = []     # J(θ_old) = E[R | π_old] = θ_old
expected_ratio_hist = []      # E_a[π_new/π_old] = 1 (standard identity)

for i in range(outer_iters):
    # PPO structure: freeze π_old, collect a rollout under it, then take
    # several mini-epochs of surrogate-objective gradient ascent.
    old_theta = theta

    # Collect a single rollout under π_old.
    if np.random.rand() < old_theta:
        action = 1
        reward = 1
    else:
        action = 2
        reward = 0

    for _ in range(inner_epochs):
        # Importance ratio r_t(θ) = π_θ(a) / π_{θ_old}(a)
        if action == 1:
            ratio = theta / old_theta
            grad_ratio = 1.0 / old_theta          # d ratio / d θ
        else:
            ratio = (1 - theta) / (1 - old_theta)
            grad_ratio = -1.0 / (1 - old_theta)

        clipped = np.clip(ratio, 1 - eps, 1 + eps)
        obj = min(ratio * reward, clipped * reward)

        # Gradient of the clipped surrogate: inside the clip window we follow
        # the ratio-times-reward branch; outside (clip bound binding) the
        # gradient is zero — precisely the "don't move too far" property.
        if 1 - eps < ratio < 1 + eps:
            grad = grad_ratio * reward
        else:
            grad = 0.0

        theta += lr * grad
        theta = np.clip(theta, 1e-3, 1 - 1e-3)

        theta_hist.append(theta)
        ratio_hist.append(ratio)
        clipped_hist.append(clipped)
        obj_hist.append(obj)
        unclipped_hist.append(ratio * reward)
        reward_hist.append(reward)
        expected_reward_hist.append(old_theta)
        expected_ratio_hist.append(1.0)

# 1. theta trajectory
plt.figure(figsize=(8, 4))
plt.plot(theta_hist)
plt.title(r"Single-State PPO: parameter $\theta$ across mini-epochs")
plt.xlabel("Mini-epoch step")
plt.ylabel(r"$\theta = \pi(a_1)$")
plt.show()

# 2. Sampled importance ratio vs expected (=1) with clip bounds overlaid
plt.figure(figsize=(8, 4))
plt.plot(ratio_hist, color="#cbd5e1", linewidth=1, label=r"sampled $r_t(\theta)$")
plt.plot(expected_ratio_hist, color="#2563eb", linewidth=2, linestyle="--",
         label=r"expected $\mathbb{E}[r_t] = 1$")
plt.axhline(1 + eps, color="#dc2626", linewidth=1, linestyle=":",
            label=fr"clip bounds $1 \pm \epsilon = ({1-eps:.1f},\,{1+eps:.1f})$")
plt.axhline(1 - eps, color="#dc2626", linewidth=1, linestyle=":")
plt.title("Single-State PPO: sampled importance ratio vs clip window")
plt.xlabel("Mini-epoch step")
plt.ylabel(r"$r_t(\theta)$")
plt.legend(loc="upper right", fontsize=9)
plt.show()

# 3. Clipped vs unclipped surrogate
plt.figure(figsize=(8, 4))
plt.plot(unclipped_hist, color="#f59e0b", linewidth=1, linestyle=":",
         label=r"unclipped $r_t \cdot R$")
plt.plot(obj_hist, color="#2563eb", linewidth=2, label=r"clipped $L^{CLIP}$")
plt.title("Single-State PPO: clipped vs unclipped surrogate objective")
plt.xlabel("Mini-epoch step")
plt.ylabel("objective")
plt.legend(loc="lower right", fontsize=9)
plt.show()

# 4. Sampled reward vs expected objective J(θ)
plt.figure(figsize=(8, 4))
plt.plot(reward_hist, color="#cbd5e1", linewidth=1, label="sampled $R$")
plt.plot(expected_reward_hist, color="#2563eb", linewidth=2, linestyle="--",
         label=r"expected $J(\theta_{old})=\theta_{old}$")
plt.title(r"Single-State PPO: sampled reward vs expected objective $J(\theta)$")
plt.xlabel("Mini-epoch step")
plt.ylabel("Reward")
plt.legend(loc="lower right", fontsize=9)
plt.show()
Output from cell 2 Output from cell 2 Output from cell 2 Output from cell 2

Interpretation

  • Ratio shows how much policy changed
  • Clipping caps updates
  • Learning is smoother than raw policy gradient
Key idea:
PPO limits how far the new policy can move from the old one.

4. Two-State Example (Credit Assignment + PPO)

Now we introduce a trajectory:
  • action in state 1
  • action in state 2
  • reward depends only on state 2
Same reward is applied to both steps.
theta1 = 0.5
theta2 = 0.5
lr = 0.05
eps = 0.2
outer_iters = 50
inner_epochs = 4

theta1_hist = []
theta2_hist = []
ratio1_hist = []
ratio2_hist = []
g1_hist = []                      # sampled clipped-surrogate gradient on θ1
g2_hist = []                      # sampled clipped-surrogate gradient on θ2
reward_hist = []
expected_J_hist = []              # J(θ1, θ2) = θ2 (reward only at s2)
expected_g1_hist = []             # E[∂L/∂θ1] = 0 (score-function trick — same as REINFORCE)
expected_g2_hist = []             # E[∂L/∂θ2] = 1 when clip is inactive (mirrors REINFORCE)

for i in range(outer_iters):
    old_theta1 = theta1
    old_theta2 = theta2

    # Rollout under (π_old_1, π_old_2)
    a1 = 1 if np.random.rand() < old_theta1 else 2
    a2 = 1 if np.random.rand() < old_theta2 else 2
    reward = 1 if a2 == 1 else 0

    for _ in range(inner_epochs):
        # Importance ratios for each decision
        if a1 == 1:
            r1 = theta1 / old_theta1
            gr1 = 1.0 / old_theta1
        else:
            r1 = (1 - theta1) / (1 - old_theta1)
            gr1 = -1.0 / (1 - old_theta1)

        if a2 == 1:
            r2 = theta2 / old_theta2
            gr2 = 1.0 / old_theta2
        else:
            r2 = (1 - theta2) / (1 - old_theta2)
            gr2 = -1.0 / (1 - old_theta2)

        # Clipped surrogate gradients (same rule as single-state)
        g1 = gr1 * reward if 1 - eps < r1 < 1 + eps else 0.0
        g2 = gr2 * reward if 1 - eps < r2 < 1 + eps else 0.0

        theta1 += lr * g1
        theta2 += lr * g2
        theta1 = np.clip(theta1, 1e-3, 1 - 1e-3)
        theta2 = np.clip(theta2, 1e-3, 1 - 1e-3)

        theta1_hist.append(theta1)
        theta2_hist.append(theta2)
        ratio1_hist.append(r1)
        ratio2_hist.append(r2)
        g1_hist.append(g1)
        g2_hist.append(g2)
        reward_hist.append(reward)
        expected_J_hist.append(old_theta2)
        expected_g1_hist.append(0.0)
        expected_g2_hist.append(1.0)

# 1. Policy parameters
plt.figure(figsize=(8, 4))
plt.plot(theta1_hist, label=r"$\theta_1$ (state 1)")
plt.plot(theta2_hist, label=r"$\theta_2$ (state 2)")
plt.title("Two-State PPO: policy parameters across mini-epochs")
plt.xlabel("Mini-epoch step")
plt.ylabel("Parameter value")
plt.legend()
plt.show()

# 2. Sampled ratios vs clip window (one ratio per state)
plt.figure(figsize=(8, 4))
plt.plot(ratio1_hist, color="#cbd5e1", linewidth=1, label=r"sampled $r_1$")
plt.plot(ratio2_hist, color="#fcd34d", linewidth=1, label=r"sampled $r_2$")
plt.axhline(1, color="#2563eb", linewidth=2, linestyle="--",
            label=r"expected $\mathbb{E}[r_i] = 1$")
plt.axhline(1 + eps, color="#dc2626", linewidth=1, linestyle=":",
            label=fr"clip $1 \pm \epsilon$")
plt.axhline(1 - eps, color="#dc2626", linewidth=1, linestyle=":")
plt.title("Two-State PPO: sampled importance ratios vs clip window")
plt.xlabel("Mini-epoch step")
plt.ylabel(r"$r_i(\theta)$")
plt.legend(loc="upper right", fontsize=9)
plt.show()

# 3. Sampled vs expected gradients (key pedagogical plot)
plt.figure(figsize=(8, 4))
plt.plot(g1_hist, color="#cbd5e1", linewidth=1, label="sampled $g_1$")
plt.plot(expected_g1_hist, color="#2563eb", linewidth=2, linestyle="--",
         label=r"expected $\mathbb{E}[g_1] = 0$")
plt.plot(g2_hist, color="#fcd34d", linewidth=1, label="sampled $g_2$")
plt.plot(expected_g2_hist, color="#dc2626", linewidth=2, linestyle="--",
         label=r"expected $\mathbb{E}[g_2] = 1$")
plt.title("Two-State PPO: sampled vs expected clipped-surrogate gradients")
plt.xlabel("Mini-epoch step")
plt.ylabel("gradient value")
plt.legend(loc="upper right", fontsize=9)
plt.show()

# 4. Sampled reward vs expected objective J(θ1, θ2) = θ2
plt.figure(figsize=(8, 4))
plt.plot(reward_hist, color="#cbd5e1", linewidth=1, label="sampled $R$")
plt.plot(expected_J_hist, color="#2563eb", linewidth=2, linestyle="--",
         label=r"expected $J(\theta_1,\theta_2) = \theta_2$")
plt.title("Two-State PPO: sampled reward vs expected objective $J$")
plt.xlabel("Mini-epoch step")
plt.ylabel("Reward")
plt.legend(loc="lower right", fontsize=9)
plt.show()
Output from cell 3 Output from cell 3 Output from cell 3 Output from cell 3

5. Final Interpretation

PPO modifies policy gradient in three ways:
  1. Uses probability ratios instead of raw log-prob gradients
  2. Introduces clipping to avoid large updates
  3. Maintains stability in high-dimensional policies (like LLMs)
Conceptually:
  • REINFORCE = push toward good trajectories
  • PPO = push, but not too far at once
This makes PPO suitable for LLM training where distributions are highly sensitive.