Note that Markov processes are sometimes erroneously called memoryless but in any MDP above we can incorporate memory aka dependence in more than one state over time by cleverly defining the state St as a container of a number of states. For example, St=[St=s,St−1=s′] can still define an Markov transition using S states. The transition modelp(St∣St−1)=p(st,st−1∣st−1,st−2)=p(st∣st−1,st−2)is called the 2nd-order Markov chain.DeepMind’s Q-learning algorithm playing pac-man converts the non-MDP problem to MDP by accumulating four frames instead of one. With a single frame the problem was not MDP since the state of all players could not be known - with a single frame the pacman could not know if the monster was moving towards it or nor for example. With a number of frames we get to know all the information needed to act optimally on this game.MDP Loop
We define a Markov Decision Process as the 5-tuple M=<S,P,R,A,γ> that produces a sequence of experiences (St,At,Rt+1),(St+1,At+1,Rt+2),.... The MDP (event) loop is shown below:
This generic interface between the agent and the environment captures many problems outside of pure MDP including RL. The environment’s state in non-MDP problems can be experienced via sensor observations and the agent will build its own state estimate internallyAt the beginning of each episode, the environment and the agent are reset (lines 3–4). On reset, the environment produces an initial state. Then they begin interacting—an agent produces an action given a state (line 6), then the environment produces the next state and reward given the action (line 7), stepping into the next time step. The agent.act-env.step cycle continues until the maximum time step T is reached or the environment terminates. Here we also see a new component, agent.update (line 8), which encapsulates an agent’s learning algorithm. Over multiple time steps and episodes, this method collects data and performs learning internally to maximize the objective.The four foundational ingredients of MDP are:
- Policy,
- Reward,
- Value function and
- Model of the environment.
These are obtained from the dynamics of the finite MDP process.p(s′,r∣s,a)=Pr{St+1=s′,Rt+1=r∣St=s,At=a}where s′ simply translates in English to the successor state whatever the new state is.The dynamics probability density function maps S×R×S×A→[0,1] and by marginalizing over the appropriate random variables we can get the following distributions.State transition
The action that the agent takes change the environment state to some other state. This can be represented via the environment state transition probabilistic model that generically can be written as:p(s′∣s,a)=p[St+1=s′∣St=s,At=a]=∑r∈Rp(s′,r∣s,a)This function can be represented as a state transition probability tensor PPss′a=p[St+1=s′∣St=s,At=a]where one dimension represents the action space and the other two constitute a state transition probability matrix.Example:Can you determine the state transition tensor for the 4x3 Gridworld ?
Reward function and Returns
The action will also cause the environment to send the agent a signal called instantaneous reward Rt+1. The reward signal is effectively defining the goal of the agent and is the primary basis for altering a policy. The agent’s sole objective is to maximize the cumulative reward in the long run.Please note that in the literature the reward is also denoted as Rt - this is a convention issue rather than something fundamental. The justification of the index t+1 is that the environment will take one step to respond to what it receives from the agent.Another marginalization of the MDP dynamics allows us to get the reward function that tells us if we are in state St=s, what reward Rt+1, in expectation, we get when taking an action a. It is given by,r(s,a)=E[Rt+1∣St=s,At=a]=∑r∈Rr∑s∈Sp(s′,r∣s,a)This can be written as a matrix Rsa.Returns
To capture the objective, consider first the return defined as a function of the reward sequence after time step t. Note that the return is also called utility in some texts. In the simplest case this function is the total discounted reward,Gt=Rt+1+γRt+2+...=∑k=0∞γkRt+1+kThe discount rate determines the present value of future rewards: a reward received k time steps in the future is worth only γk−1times what it would be worth if it were received immediately. If γ<1, the infinite sum above has a finite value as long as the reward sequence {Rk} is bounded. If γ=0, the agent is “myopic” in being concerned only with maximizing immediate rewards: its objective in this case is to learn how to choose At so as to maximize only Rt+1. If each of the agent’s actions happened to influence only the immediate reward, not future rewards as well, then a myopic agent could maximize by separately maximizing each immediate reward. But in general, acting to maximize immediate reward can reduce access to future rewards so that the return is reduced. As γ approaches 1, the return objective takes future rewards into account more strongly; the agent becomes more farsighted.Notice the two indices needed for its definition - one is the time step t that manifests where we are in the trajectory and the second index k is used to index future rewards up to infinity - this is the case of infinite horizon problems where we are not constrained to optimize the agent behavior within the limits of a finite horizon T. If the discount factor γ<1 and the rewards are bounded (∣R∣<Rmax) then the above sum is finite.∑k=0∞γkRt+1+k<∑k=0∞γkRmax=1−γRmaxThe return is itself a random variable - for each trajectory defined by sampling the policy (strategy) of the agent we get a different return. For the Gridworld of the MDP section:τ1:S0=s11,S1=s12,...ST=s43→G0τ1=5.6
τ2:S0=s11,S1=s21,...,ST=s43→G0τ2=6.9
…Note that these are sample numbers to make the point that the return depends on the specific trajectory.The following is a useful recursion to remind that successive time steps are related to each other:Gt=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4
=Rt+1+γ(Rt+2+γRt+3+γ2Rt+4)
=Rt+1+γGt+1Policy function
The agent’s behavior is expressed via a policy function π - that tells the agent what action to take for every possible state. The policy is a function of the state and can be:
- Deterministic functions of the state the environment is in and by extension, the state that the agent is or believes (think about posterior belief) it is in.
a=π(s)
- Stochastic functions of the state expressed as a conditional probability distribution function (conditional pdf) of actions given the current state:
a∼p(At=a∣St=s)=π(a∣s)The policy is assumed to be stationary i.e. not change with time step t and it will depend only on the state St i.e. At=a∼π(.∣St=s),∀t>0.Value Functions
The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states.State value
The state-value function vπ(s) provides a notion of the long-term value of state s. It is equivalent to what other literature calls expected utility . It is defined as the expected_ return starting at state s and following policy π(a∣s),vπ(s)=Eπ(Gt∣St=s)The expectation is obviously due to the fact that Gt are random variables since the sequence of states of each potential trajectory starting from s is dictated by the stochastic policy. As an example, assuming that there are just two possible trajectories from state s11 whose returns were calculated above, the value function of state s11 will bevπ(s11)=21(G0τ1+G0τ2)One corner case is interesting - if we make γ=0 then vπ(s) becomes the average of instantaneous rewards we can get from that state.Action value
We also define the action-value function qπ(s,a) as the expected return starting from the state s, taking action a and following policy π(a∣s).qπ(s,a)=Eπ(Gt∣St=s,At=a)This is an important quantity as it helps us decide the action we need to take while in state s.