| Component | Description |
|---|---|
| Parametrized policy | Neural networks are powerful and flexible function approximators, so we can represent a policy using a neural network with learnable parameters - this is the policy network . Each specific set of parameters of the policy network represents a particular policy - this means that for and a state , and a single policy network architecture is therefore capable of representing many different policies. |
| The objective to be maximized 1 | The expected discounted return, just like in MDP. |
| Policy Gradient | A method for updating the policy parameters . The policy gradient algorithm searches for a local maximum in : . This is the common gradient ascent algorithm that adjusts the parameters according to where is the learning rate. Note that we can pose the objective as a loss that we try to minimize by negating it. |
The log-derivative trick
We begin with a function and a parametrized distribution . We want to compute the gradient of the expectation and rewrite it in a form suitable for Monte Carlo estimation.Step-by-step derivation
Step 1: Definition of expectation
Step 1: Definition of expectation
Explanation:
This uses the definition of an expectation under a density:
Step 2: Leibniz rule
Step 2: Leibniz rule
Explanation:
Under mild regularity conditions, differentiation and integration commute (Leibniz rule), allowing us to move inside the integral.
Step 3: Product rule
Step 3: Product rule
Explanation:
This applies the product rule to .
Step 4: Simplification
Step 4: Simplification
Explanation:
We assume does not depend on , so and the second term vanishes. This is the standard situation in policy-gradient and score-function estimators: in our setting, is the return , which depends on the sampled trajectory but not directly on the policy parameters . The -dependence is captured by the distribution itself.
Step 5: Multiply and divide
Step 5: Multiply and divide
Explanation:
Multiply and divide by ; this is valid whenever on the support of interest.
Step 6: Log-derivative identity
Step 6: Log-derivative identity
Explanation:
We use the log-derivative identity:
Step 7: Back to expectation form
Step 7: Back to expectation form
Explanation:
We rewrite the integral back into expectation form.
Algorithm: Monte Carlo Policy Gradient (REINFORCE)
| Line | Statement |
|---|---|
| 1 | Initialize learning rate |
| 2 | Initialize policy parameters of policy network |
| 3 | for episode to MAX_EPISODE do |
| 4 | Sample trajectory using |
| 5 | |
| 6 | for to do |
| 7 | Compute return |
| 8 | |
| 9 | end for |
| 10 | |
| 11 | end for |
images/reinforce-loop.mermaid.md
It is important that a trajectory is discarded after each parameter update, it cannot be reused. This is because REINFORCE is an on-policy algorithm just like the MC it “learns on the job”. This is evidently seen in line 10 where the parameter update equation uses the policy gradient that itself (line 8) directly depends on action probabilities generated by the current policy only and not some other policy . Correspondingly, the return where must also be generated from , otherwise the action probabilities will be adjusted based on returns that the policy wouldn’t have generated.
Policy Network
One of the key ingredients that REINFORCE introduces is the policy network that is approximated with a NN eg. a fully connected neural network (e.g. two RELU-layers). 1: Given a policy networknet, a Categorical (multinomial) distribution class, and a state
2: Compute the output pdparams = net(state)
3: Construct an instance of an action probability distribution pd = Categorical(logits=pdparams)
4: Use pd to sample an action, action = pd.sample()
5: Use pd and action to compute the action log probability, log_prob = pd.log_prob(action)
Other discrete distributions can be used and many actual libraries parametrize continuous distributions such as Gaussians.
Applying the REINFORCE algorithm
A runnable end-to-end implementation on theCartPole-v1 environment lives in its own section so you can train, inspect episode returns, and experiment with hyperparameters directly. It uses the same policy architecture described above and plots a learning curve.

REINFORCE on CartPole-v1
Runnable policy-gradient agent with a learning-curve plot, modernized for the current Gymnasium API.
Why the learning curve is not monotonic
Train the model above and the episode returns rise on average but oscillate sharply; bursts to the 500 ceiling are routinely followed by drops to 50–150. Monotonic improvement is not expected for vanilla REINFORCE. The reasons are structural:- Gradient variance: a single trajectory is used per update. One lucky or unlucky episode dominates the gradient at that step.
- Uncentered returns: without a baseline, every action inside a good episode gets reinforced, including suboptimal ones. The gradient carries “absolute goodness,” not “advantage.”
- Unbounded step size: the per-step loss scales with the return . A lucky 500-return episode produces a gradient step many times larger than a typical one, and a single oversized update can overshoot the current good region of parameter space; the very next episode’s policy is then markedly worse. Later methods (e.g. PPO) explicitly cap how much the policy is allowed to change per update.
- On-policy non-stationarity: the trajectory distribution itself changes with , so yesterday’s gradient is stale today.
- Stochastic policy at eval time: even a well-trained policy samples from , so episode returns inherently fluctuate.
Reducing variance with a baseline
The first and simplest fix targets reasons 1 and 2 above: subtract a state-dependent baseline from the return: Any baseline that does not depend on the action leaves the gradient unbiased but can dramatically reduce its variance. The most common choice is , the state-value function estimated by a separate network. This gives rise to the advantage and leads directly to actor-critic methods, and from there to PPO and GRPO (which also adds ratio clipping to address reason 3). See section 2.5.1 of Foundations of Deep Reinforcement Learning for further improvements. Key references: (Schulman et al., 2017; Schulman et al., 2015; Szepesvári et al., 2010; Wang et al., 2016; Rafati & Noelle, 2019)References
- Rafati, J., Noelle, D. (2019). Learning sparse representations in reinforcement learning.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal Policy Optimization Algorithms.
- Schulman, J., Levine, S., Moritz, P., Jordan, M., Abbeel, P. (2015). Trust Region Policy Optimization.
- Szepesvári, C., Cochran, J., Cox, L., Keskinocak, P., Kharoufeh, J., et al. (2010). Reinforcement Learning Algorithms for MDPs. Wiley Encyclopedia of Operations Research and Management Science.
- Wang, J., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J., et al. (2016). Learning to reinforcement learn.
Footnotes
- Notation wise, since we need to have a bit more flexibility in RL problems, we will use the symbol as the objective function. ↩

