Skip to main content
In the policy improvement step we are given the value function and simply apply the greedy heuristic to it. π=greedy(vπ)\pi^\prime = \mathtt{greedy}(v_\pi) It can be shown that this heuristic results into a policy that is better than the one the prediction step started (ππ\pi^\prime \ge \pi) and this extends into multiple iterations. We can therefore converge into an optimal policy - the interested reader can follow this lecture for a justification. The step itself is a simple one-pass greedification: given VπV_\pi, compute Qπ(s,a)Q_\pi(s, a) for every action and take the argmax.
import numpy as np

def greedify(V, P, R, gamma):
    """One-step policy improvement: π'(s) = argmax_a [ R(s,a) + γ Σ P(s'|s,a) V(s') ].

    Args:
        V: value function, shape (nS,)
        P: transition probabilities, shape (nS, nA, nS)
        R: expected immediate rewards, shape (nS, nA)
        gamma: discount factor

    Returns:
        pi: deterministic greedy policy, shape (nS,), values in [0, nA)
    """
    Q = R + gamma * P @ V              # shape (nS, nA)
    return np.argmax(Q, axis=1)
The full policy iteration algorithm alternates policy evaluation with this step until the policy is stable; see Policy Iteration for the outer loop and a gridworld demo. Key references: (Mansour & Singh, 2013; Ahmed et al., 2018; Schulman et al., 2015; Huang et al., 2022)

References

  • Ahmed, Z., Le Roux, N., Norouzi, M., Schuurmans, D. (2018). Understanding the impact of entropy on policy optimization.
  • Huang, S., Kanervisto, A., Raffin, A., Wang, W., Ontañón, S., et al. (2022). A2C is a special case of PPO.
  • Mansour, Y., Singh, S. (2013). On the Complexity of Policy Iteration.
  • Schulman, J., Levine, S., Moritz, P., Jordan, M., Abbeel, P. (2015). Trust Region Policy Optimization.