Policy Improvement

In the policy improvement step we are given the value function and simply apply the greedy heuristic to it.

\pi^\prime = \mathtt{greedy}(v_\pi)

It can be shown that this heuristic results into a policy that is better than the one the prediction step started (

\pi^\prime \ge \pi

) and this extends into multiple iterations. We can therefore converge into an optimal policy - the interested reader can follow this lecture for a justification. The step itself is a simple one-pass greedification: given

V_\pi

, compute

Q_\pi(s, a)

for every action and take the argmax.

import numpy as np

def greedify(V, P, R, gamma):
    """One-step policy improvement: π'(s) = argmax_a [ R(s,a) + γ Σ P(s'|s,a) V(s') ].

    Args:
        V: value function, shape (nS,)
        P: transition probabilities, shape (nS, nA, nS)
        R: expected immediate rewards, shape (nS, nA)
        gamma: discount factor

    Returns:
        pi: deterministic greedy policy, shape (nS,), values in [0, nA)
    """
    Q = R + gamma * P @ V              # shape (nS, nA)
    return np.argmax(Q, axis=1)

The full policy iteration algorithm alternates policy evaluation with this step until the policy is stable; see Policy Iteration for the outer loop and a gridworld demo. Key references: (Mansour & Singh, 2013; Ahmed et al., 2018; Schulman et al., 2015; Huang et al., 2022)

References

Ahmed, Z., Le Roux, N., Norouzi, M., Schuurmans, D. (2018). Understanding the impact of entropy on policy optimization.
Huang, S., Kanervisto, A., Raffin, A., Wang, W., Ontañón, S., et al. (2022). A2C is a special case of PPO.
Mansour, Y., Singh, S. (2013). On the Complexity of Policy Iteration.
Schulman, J., Levine, S., Moritz, P., Jordan, M., Abbeel, P. (2015). Trust Region Policy Optimization.

Edit this page on GitHub or file an issue.

Single Sweep Optimal Backup

Recycling Robot Example

Markov Decision Processes

Bellman Equations

Dynamic Programming

Policy Improvement

References

​References

References