Policy Gradient Algorithms - REINFORCE

Given that RL can be posed as an MDP, in this section we continue with a policy-based algorithm that learns the policy directly by optimizing the objective function. The algorithm we treat here, called REINFORCE, is important as its the basis of many other algorithms. It took its name from the fact that during training actions that resulted in good outcomes should become more probable—these actions are positively reinforced. Conversely, actions which resulted in bad outcomes should become less probable. If learning is successful, over the course of many iterations, action probabilities produced by the policy, shift to a distribution that results in good performance in the environment we interact with.

Note that currently the ability of RL to generalize between environments is actively researched. We make no claims of generalization in any algorithm we present. For digging further to RL generalization see Robert Kirk’s post and Google Research’s work.

Action probabilities are changed by following the policy gradient, therefore REINFORCE is known as a policy gradient algorithm. The algorithm needs three components:

Component	Description
Parametrized policy $\pi_\theta (a \mid s)$	Neural networks are powerful and flexible function approximators, so we can represent a policy using a neural network with learnable parameters $\theta$ - this is the policy network $\pi_\theta$ . Each specific set of parameters of the policy network represents a particular policy - this means that for $\theta_1 \neq \theta_2$ and a state $s$ , $\pi_{\theta_1}(a \mid s) \neq \pi_{\theta_2}(a \mid s)$ and a single policy network architecture is therefore capable of representing many different policies.
The objective to be maximized $J(\pi_\theta)$ ¹	The expected discounted return, just like in MDP.
Policy Gradient	A method for updating the policy parameters $\theta$ . The policy gradient algorithm searches for a local maximum in $J(\pi_\theta)$ : $\max_\theta J(\pi_\theta)$ . This is the common gradient ascent algorithm that adjusts the parameters according to $\theta ← \theta + \alpha \nabla_\theta J(\pi_\theta)$ where $\alpha$ is the learning rate. Note that we can pose the objective as a loss that we try to minimize by negating it.

Out of the three components, the most complicated one is the policy gradient that can be shown to be given by the differentiable quantity:

\nabla_\theta J(\pi_\theta)= \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (a|s) v_\pi (s) \right ]

We understand that this expression came out of nowhere but the interested reader can find its detailed derivation in the chapter 2 of this reference. For the derivation we use the log-derivative trick.

Log-Derivative TrickWe begin with a function

f(x)

and a parametrized distribution

p(x\mid\theta)

. We want to compute the gradient of the expectation

\mathbb{E}_{x\sim p(x\mid\theta)}[f(x)]

and rewrite it in a form suitable for Monte Carlo estimation.

Step-by-step derivation

Step 1: Definition of expectation

\nabla_\theta \mathbb{E}_{x\sim p(x\mid\theta)}[f(x)] \;=\; \nabla_\theta \int f(x)\,p(x\mid\theta)\,dx

Explanation: This uses the definition of an expectation under a density:

\mathbb{E}[f(x)] = \int f(x)p(x)\,dx.

Step 2: Leibniz rule

= \int \nabla_\theta\!\left( f(x)\,p(x\mid\theta) \right)\,dx

Explanation: Under mild regularity conditions, differentiation and integration commute (Leibniz rule), allowing us to move

\nabla_\theta

inside the integral.

Step 3: Product rule

= \int \big( f(x)\,\nabla_\theta p(x\mid\theta) \;+\; p(x\mid\theta)\,\nabla_\theta f(x) \big)\,dx

Explanation: This applies the product rule to

\nabla_\theta(f p)

Step 4: Simplification

= \int f(x)\,\nabla_\theta p(x\mid\theta)\,dx

Explanation: We assume

f(x)

does not depend on

\theta

, so

\nabla_\theta f(x)=0

and the second term vanishes. This is the standard situation in policy-gradient and score-function estimators.

Step 5: Multiply and divide

= \int f(x)\,p(x\mid\theta)\; \frac{\nabla_\theta p(x\mid\theta)}{p(x\mid\theta)}\,dx

Explanation: Multiply and divide by

p(x\mid\theta)

; this is valid whenever

p(x\mid\theta)>0

on the support of interest.

Step 6: Log-derivative identity

= \int f(x)\,p(x\mid\theta)\; \nabla_\theta \log p(x\mid\theta)\,dx

Explanation: We use the log-derivative identity:

\nabla_\theta \log p(x\mid\theta) = \frac{\nabla_\theta p(x\mid\theta)}{p(x\mid\theta)}.

Step 7: Back to expectation form

= \mathbb{E}_{x\sim p(x\mid\theta)} \left[\, f(x)\,\nabla_\theta \log p(x\mid\theta)\,\right]

Explanation: We rewrite the integral back into expectation form.

This transformation is crucial because it converts the problem of differentiating through an expectation into an expectation of a tractable quantity, enabling Monte Carlo estimation even when

f(x)

is a black-box function without a closed-form gradient. We can approximate the value at state

s

with the return over many sample trajectories

\tau

that are sampled from the policy network.

\nabla_\theta J(\pi_\theta)= \mathbb{E}_{\tau \sim \pi_\theta} \left[ G_t \nabla_\theta \log \pi_\theta (a|s) \right ]

where

G_t

is the return - a quantity we have seen earlier albeit now the return is limited by the length of each trajectory just like in MC method,

G_t(\tau) = \sum_{k=0}^{T-1}\gamma^k R_{t+1+k}

The

\gamma

is usually a hyper-parameter that we need to optimize usually iterating over many values in [0.01,…,0.99] and selecting the one with the best results. We also have an expectation in the gradient expression that we need to address. The expectation

\mathbb E_{\tau \sim \pi_\theta}

we need to take is approximated with a summation over each trajectory aka a Monte-Carlo approximation. Effectively, we are generating the right hand side as in line 8 in the code below, by sampling a trajectory (line 4) and estimating its return (line 7) in a completely model-free fashion i.e. without assuming any knowledge of the transition and reward functions. This is implemented next:

Algorithm: Monte Carlo Policy Gradient (REINFORCE)

Line	Statement
1	Initialize learning rate $\alpha$
2	Initialize policy parameters $\theta$ of policy network $\pi_\theta$
3	for episode $= 0$ to MAX_EPISODE do
4	$\quad$ Sample trajectory $\tau = (s_0, a_0, r_0, \ldots, s_T, a_T, r_T)$ using $\pi_\theta$
5	$\quad$ $\nabla_\theta J(\pi_\theta) \leftarrow 0$
6	$\quad$ for $t = 0$ to $T-1$ do
7	$\quad\quad$ Compute return $G_t(\tau)$
8	$\quad\quad$ $\nabla_\theta J(\pi_\theta) \leftarrow \nabla_\theta J(\pi_\theta) + G_t(\tau) \cdot \nabla_\theta \log \pi_\theta(a_t \mid s_t)$
9	$\quad$ end for
10	$\quad$ $\theta \leftarrow \theta + \alpha \cdot \nabla_\theta J(\pi_\theta)$
11	end for

It is important that a trajectory is discarded after each parameter update—it cannot be reused. This is because REINFORCE is an on-policy algorithm just like the MC it “learns on the job”. This is evidently seen in line 10 where the parameter update equation uses the policy gradient that itself (line 8) directly depends on action probabilities

\pi_\theta(a_t | s_t)

generated by the current policy

\pi_\theta

only and not some other policy

\pi_{\theta'}

. Correspondingly, the return

G_t(\tau)

where

\tau \sim \pi_\theta

must also be generated from

\pi_\theta

, otherwise the action probabilities will be adjusted based on returns that the policy wouldn’t have generated.

Policy Network

One of the key ingredients that REINFORCE introduces is the policy network that is approximated with a NN eg. a fully connected neural network (e.g. two RELU-layers). 1: Given a policy network net, a Categorical (multinomial) distribution class, and a state 2: Compute the output pdparams = net(state) 3: Construct an instance of an action probability distribution pd = Categorical(logits=pdparams) 4: Use pd to sample an action, action = pd.sample() 5: Use pd and action to compute the action log probability, log_prob = pd.log_prob(action) Other discrete distributions can be used and many actual libraries parametrize continuous distributions such as Gaussians.

Applying the REINFORCE algorithm

It is now instructive to see an stand-alone example in python for the so called CartPole-v0 ²

PROTECTED_0 The REINFORCE algorithm presented here can generally be applied to continuous and discreet problems but it has been shown to possess high variance and sample-inefficiency. Several improvements have been proposed and the interested reader can refer to section 2.5.1 of the suggested book. Key references: (Schulman et al., 2017; Schulman et al., 2015; Szepesvári et al., 2010; Wang et al., 2016; Rafati & Noelle, 2019)

References

Rafati, J., Noelle, D. (2019). Learning sparse representations in reinforcement learning.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal Policy Optimization Algorithms.
Schulman, J., Levine, S., Moritz, P., Jordan, M., Abbeel, P. (2015). Trust Region Policy Optimization.
Szepesvári, C., Cochran, J., Cox, L., Keskinocak, P., Kharoufeh, J., et al. (2010). Reinforcement Learning Algorithms for MDPs. Wiley Encyclopedia of Operations Research and Management Science.
Wang, J., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J., et al. (2016). Learning to reinforcement learn.

Edit this page on GitHub or file an issue.

Notation wise, since we need to have a bit more flexibility in RL problems, we will use the symbol $J(\pi_\theta)$ as the objective function. ↩
Please note that SLM-Lab, is the library that accompanies this book. You will learn a lot by reviewing the implementations under the agents/algorithms directory to get a feel of how RL problems are abstracted. ↩

Reinforcement Learning

Model-Based

Prediction

Control

Policy-Based

Policy Gradient Algorithms - REINFORCE

Step-by-step derivation

Policy Network

Applying the REINFORCE algorithm

References

Reinforcement Learning

Model-Based

Prediction

Control

Policy-Based

​Step-by-step derivation

​Policy Network

​Applying the REINFORCE algorithm

​References

Footnotes

Step-by-step derivation

Policy Network

Applying the REINFORCE algorithm

References