Markov Decision Processes

Created October 25, 2021 · Updated July 22, 2025

Markov decision processes formally describe the framework for reinforcement learning, enabling mathematical treatment of the algorithms. Almost all RL problems can be formalised as MDPs.

Formally, MDP is a tuple $<S,A,P,R,\gamma>$ where,

$$S$$ is finite number of states
$$A$$ is a finite set of actions
$$P$$ is a state transition probability function/matrix $P[S_{t+1}=s'|S_t=s,A_t=a]$
$$R$$ is expected reward function $R = E[R_{t+1}|S_t=s,A_t=a]$
γ is a discount factor ∈ (0, 1]

Once we have the MDP, a policy can be learned by doing Dynamic Programming (RL) > Policy Iteration or Dynamic Programming (RL) > Value Iteration

Examples of application of MDPs:

Robot navigation problem
Inventory management
Portfolio optimization
Purchase and production optimization

Markov Property

In an MDP, all states have Markov property. It means for all the state of an MDP, the future is independent of the present and future states. Each state contains all the useful information from the agent's history.

Variants of MDP

References

Chapter 3, RL:AI, Sutton and Barto 2nd Edition