- Interacting with the environment by receiving reward signals from it during each interaction.
- Knowing the model (dynamics) of the environment, they have an internal objective function that they try to optimize based on the experience they accumulate.

Topics
MDP Introduction
States, actions, transitions, rewards, and value functions.
Bellman Expectation
Computing value functions using Bellman expectation equations.
Bellman Optimality
Optimal value functions and the Bellman optimality equations.
Policy Iteration
Finding optimal policies through evaluation and improvement.
Resources
Apart from the notes here that are largely based on David Silver’s (Deep Mind) course material and video lectures, you can consult these additional resources:- Sutton & Barto’s Reinforcement Learning Book - David Silver’s slides and video lectures are based on this book. The code in Python is here.
- Deep Reinforcement Learning in Python - written by Google researchers.
Many of the algorithms presented here like policy and value iteration have been developed in older repos such as rlcode and dennybritz. This site is being migrated to be compatible with Farama and their Gymnasium tooling.

