Skip to main content
Given that RL can be posed as an MDP, in this section we continue with a policy-based algorithm that learns the policy directly by optimizing the objective function. The algorithm we treat here, called REINFORCE, is important as it’s the basis of many other algorithms, including modern variants like PPO and GRPO that we use to fine-tune VLMs (see the browser-control GRPO tutorial). It took its name from the fact that during training actions that resulted in good outcomes should become more probable, these actions are positively reinforced. Conversely, actions which resulted in bad outcomes should become less probable. If learning is successful, over the course of many iterations, action probabilities produced by the policy, shift to a distribution that results in good performance in the environment we interact with. Action probabilities are changed by following the policy gradient, therefore REINFORCE is known as a policy gradient algorithm. The algorithm needs three components:
ComponentDescription
Parametrized policy πθ(as)\pi_\theta (a \mid s)Neural networks are powerful and flexible function approximators, so we can represent a policy using a neural network with learnable parameters θ\theta - this is the policy network πθ\pi_\theta. Each specific set of parameters of the policy network represents a particular policy - this means that for θ1θ2\theta_1 \neq \theta_2 and a state ss, πθ1(as)πθ2(as)\pi_{\theta_1}(a \mid s) \neq \pi_{\theta_2}(a \mid s) and a single policy network architecture is therefore capable of representing many different policies.
The objective to be maximized J(πθ)J(\pi_\theta)1The expected discounted return, just like in MDP.
Policy GradientA method for updating the policy parameters θ\theta. The policy gradient algorithm searches for a local maximum in J(πθ)J(\pi_\theta): maxθJ(πθ)\max_\theta J(\pi_\theta). This is the common gradient ascent algorithm that adjusts the parameters according to θθ+αθJ(πθ)\theta ← \theta + \alpha \nabla_\theta J(\pi_\theta) where α\alpha is the learning rate. Note that we can pose the objective as a loss that we try to minimize by negating it.
Out of the three components, the most complicated one is the policy gradient that can be shown to be given by the differentiable quantity: θJ(πθ)=Eπθ[θlogπθ(as)Qπθ(s,a)] \nabla_\theta J(\pi_\theta)= \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (a|s)\, Q^{\pi_\theta}(s, a) \right ] where Qπθ(s,a)Q^{\pi_\theta}(s, a) is the action-value under the current policy. In REINFORCE we estimate Qπθ(s,a)Q^{\pi_\theta}(s, a) by the Monte Carlo return GtG_t (see the algorithm below); using a learned vπ(s)v_\pi(s) instead would make this an actor-critic method, which we treat in a later section. To see where this expression comes from we derive it here using the log-derivative trick; readers who want a textbook-style treatment can cross-check with chapter 2 of Foundations of Deep Reinforcement Learning.

The log-derivative trick

We begin with a function f(x)f(x) and a parametrized distribution p(xθ)p(x\mid\theta). We want to compute the gradient of the expectation Exp(xθ)[f(x)]\mathbb{E}_{x\sim p(x\mid\theta)}[f(x)] and rewrite it in a form suitable for Monte Carlo estimation.

Step-by-step derivation

θExp(xθ)[f(x)]  =  θf(x)p(xθ)dx\nabla_\theta \mathbb{E}_{x\sim p(x\mid\theta)}[f(x)] \;=\; \nabla_\theta \int f(x)\,p(x\mid\theta)\,dxExplanation: This uses the definition of an expectation under a density: E[f(x)]=f(x)p(x)dx.\mathbb{E}[f(x)] = \int f(x)p(x)\,dx.
=θ ⁣(f(x)p(xθ))dx= \int \nabla_\theta\!\left( f(x)\,p(x\mid\theta) \right)\,dxExplanation: Under mild regularity conditions, differentiation and integration commute (Leibniz rule), allowing us to move θ\nabla_\theta inside the integral.
=(f(x)θp(xθ)  +  p(xθ)θf(x))dx= \int \big( f(x)\,\nabla_\theta p(x\mid\theta) \;+\; p(x\mid\theta)\,\nabla_\theta f(x) \big)\,dxExplanation: This applies the product rule to θ(fp)\nabla_\theta(f p).
=f(x)θp(xθ)dx= \int f(x)\,\nabla_\theta p(x\mid\theta)\,dxExplanation: We assume f(x)f(x) does not depend on θ\theta, so θf(x)=0\nabla_\theta f(x)=0 and the second term vanishes. This is the standard situation in policy-gradient and score-function estimators: in our setting, ff is the return GtG_t, which depends on the sampled trajectory but not directly on the policy parameters θ\theta. The θ\theta-dependence is captured by the distribution p(xθ)p(x \mid \theta) itself.
=f(x)p(xθ)  θp(xθ)p(xθ)dx= \int f(x)\,p(x\mid\theta)\; \frac{\nabla_\theta p(x\mid\theta)}{p(x\mid\theta)}\,dxExplanation: Multiply and divide by p(xθ)p(x\mid\theta); this is valid whenever p(xθ)>0p(x\mid\theta)>0 on the support of interest.
=f(x)p(xθ)  θlogp(xθ)dx= \int f(x)\,p(x\mid\theta)\; \nabla_\theta \log p(x\mid\theta)\,dxExplanation: We use the log-derivative identity:θlogp(xθ)=θp(xθ)p(xθ).\nabla_\theta \log p(x\mid\theta) = \frac{\nabla_\theta p(x\mid\theta)}{p(x\mid\theta)}.
=Exp(xθ)[f(x)θlogp(xθ)]= \mathbb{E}_{x\sim p(x\mid\theta)} \left[\, f(x)\,\nabla_\theta \log p(x\mid\theta)\,\right]Explanation: We rewrite the integral back into expectation form.
This transformation is crucial because it converts the problem of differentiating through an expectation into an expectation of a tractable quantity, enabling Monte Carlo estimation even when f(x)f(x) is a black-box function without a closed-form gradient. We can approximate the value at state ss with the return over many sample trajectories τ\tau that are sampled from the policy network. θJ(πθ)=Eτπθ[Gtθlogπθ(as)] \nabla_\theta J(\pi_\theta)= \mathbb{E}_{\tau \sim \pi_\theta} \left[ G_t \nabla_\theta \log \pi_\theta (a|s) \right ] where GtG_t is the return - a quantity we have seen earlier albeit now the return is limited by the length of each trajectory just like in MC method, Gt(τ)=k=0Tt1γkRt+k+1G_t(\tau) = \sum_{k=0}^{T-t-1}\gamma^k R_{t+k+1} The γ\gamma is a hyperparameter that is tuned, typically by searching over values in [0.01, …, 0.99] and selecting the one with the best results. We also have an expectation in the gradient expression that we need to address. The expectation Eτπθ\mathbb E_{\tau \sim \pi_\theta} is approximated with a summation over each trajectory, a Monte Carlo (MC) approximation. Effectively, we are generating the right-hand side as in line 8 in the code below, by sampling a trajectory (line 4) and estimating its return (line 7) in a completely model-free fashion, i.e. without assuming any knowledge of the transition and reward functions. This is implemented next:
Algorithm: Monte Carlo Policy Gradient (REINFORCE)
LineStatement
1Initialize learning rate α\alpha
2Initialize policy parameters θ\theta of policy network πθ\pi_\theta
3for episode =0= 0 to MAX_EPISODE do
4\quad Sample trajectory τ=(s0,a0,r0,,sT,aT,rT)\tau = (s_0, a_0, r_0, \ldots, s_T, a_T, r_T) using πθ\pi_\theta
5\quad θJ(πθ)0\nabla_\theta J(\pi_\theta) \leftarrow 0
6\quad for t=0t = 0 to T1T-1 do
7\quad\quad Compute return Gt(τ)G_t(\tau)
8\quad\quad θJ(πθ)θJ(πθ)+Gt(τ)θlogπθ(atst)\nabla_\theta J(\pi_\theta) \leftarrow \nabla_\theta J(\pi_\theta) + G_t(\tau) \cdot \nabla_\theta \log \pi_\theta(a_t \mid s_t)
9\quad end for
10\quad θθ+αθJ(πθ)\theta \leftarrow \theta + \alpha \cdot \nabla_\theta J(\pi_\theta)
11end for
REINFORCE on-policy Monte Carlo loop: rollout trajectory, compute returns, estimate policy gradient, update parameters, discard trajectory Editable Mermaid source: images/reinforce-loop.mermaid.md It is important that a trajectory is discarded after each parameter update, it cannot be reused. This is because REINFORCE is an on-policy algorithm just like the MC it “learns on the job”. This is evidently seen in line 10 where the parameter update equation uses the policy gradient that itself (line 8) directly depends on action probabilities πθ(atst)\pi_\theta(a_t | s_t) generated by the current policy πθ\pi_\theta only and not some other policy πθ\pi_{\theta'}. Correspondingly, the return Gt(τ)G_t(\tau) where τπθ\tau \sim \pi_\theta must also be generated from πθ\pi_\theta, otherwise the action probabilities will be adjusted based on returns that the policy wouldn’t have generated.

Policy Network

One of the key ingredients that REINFORCE introduces is the policy network that is approximated with a NN eg. a fully connected neural network (e.g. two RELU-layers). 1: Given a policy network net, a Categorical (multinomial) distribution class, and a state 2: Compute the output pdparams = net(state) 3: Construct an instance of an action probability distribution  pd = Categorical(logits=pdparams) 4: Use pd to sample an action, action = pd.sample() 5: Use pd and action to compute the action log probability, log_prob = pd.log_prob(action) Other discrete distributions can be used and many actual libraries parametrize continuous distributions such as Gaussians.

Applying the REINFORCE algorithm

A runnable end-to-end implementation on the CartPole-v1 environment lives in its own section so you can train, inspect episode returns, and experiment with hyperparameters directly. It uses the same policy architecture described above and plots a learning curve. CartPole environment

REINFORCE on CartPole-v1

Runnable policy-gradient agent with a learning-curve plot, modernized for the current Gymnasium API.

Why the learning curve is not monotonic

Train the model above and the episode returns rise on average but oscillate sharply; bursts to the 500 ceiling are routinely followed by drops to 50–150. Monotonic improvement is not expected for vanilla REINFORCE. The reasons are structural:
  1. Gradient variance: a single trajectory is used per update. One lucky or unlucky episode dominates the gradient at that step.
  2. Uncentered returns: without a baseline, every action inside a good episode gets reinforced, including suboptimal ones. The gradient carries “absolute goodness,” not “advantage.”
  3. Unbounded step size: the per-step loss logπθ(as)Gt-\log\pi_\theta(a\mid s)\, G_t scales with the return GtG_t. A lucky 500-return episode produces a gradient step many times larger than a typical one, and a single oversized update can overshoot the current good region of parameter space; the very next episode’s policy is then markedly worse. Later methods (e.g. PPO) explicitly cap how much the policy is allowed to change per update.
  4. On-policy non-stationarity: the trajectory distribution itself changes with θ\theta, so yesterday’s gradient is stale today.
  5. Stochastic policy at eval time: even a well-trained policy samples from πθ(s)\pi_\theta(\cdot\mid s), so episode returns inherently fluctuate.
Each of these motivates a specific fix downstream of REINFORCE.

Reducing variance with a baseline

The first and simplest fix targets reasons 1 and 2 above: subtract a state-dependent baseline b(s)b(s) from the return: θJ(πθ)=Eτπθ[(Gtb(st))θlogπθ(atst)]\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \left( G_t - b(s_t) \right) \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right] Any baseline that does not depend on the action ata_t leaves the gradient unbiased but can dramatically reduce its variance. The most common choice is b(st)=vπ(st)b(s_t) = v_\pi(s_t), the state-value function estimated by a separate network. This gives rise to the advantage A(st,at)=Gtvπ(st)A(s_t, a_t) = G_t - v_\pi(s_t) and leads directly to actor-critic methods, and from there to PPO and GRPO (which also adds ratio clipping to address reason 3). See section 2.5.1 of Foundations of Deep Reinforcement Learning for further improvements. Key references: (Schulman et al., 2017; Schulman et al., 2015; Szepesvári et al., 2010; Wang et al., 2016; Rafati & Noelle, 2019)

References

  • Rafati, J., Noelle, D. (2019). Learning sparse representations in reinforcement learning.
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal Policy Optimization Algorithms.
  • Schulman, J., Levine, S., Moritz, P., Jordan, M., Abbeel, P. (2015). Trust Region Policy Optimization.
  • Szepesvári, C., Cochran, J., Cox, L., Keskinocak, P., Kharoufeh, J., et al. (2010). Reinforcement Learning Algorithms for MDPs. Wiley Encyclopedia of Operations Research and Management Science.
  • Wang, J., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J., et al. (2016). Learning to reinforcement learn.

Footnotes

  1. Notation wise, since we need to have a bit more flexibility in RL problems, we will use the symbol J(πθ)J(\pi_\theta) as the objective function.