Markov Decision Processes
Markov decision processes formally describe the framework for reinforcement learning, enabling mathematical treatment of the algorithms. Almost all RL problems can be formalised as MDPs.
Formally, MDP is a tuple $<S,A,P,R,\gamma>$ where,
$S$ is finite number of states
$A$ is a finite set of actions
$P$ is a state transition probability function/matrix $P[S_{t+1}=s'|S_t=s,A_t=a]$
$R$ is expected reward function $R = E[R_{t+1}|S_t=s,A_t=a]$
γ is a discount factor ∈ (0, 1]
Once we have the MDP, a policy can be learned by doing Dynamic Programming (RL) > Policy Iteration or Dynamic Programming (RL) > Value Iteration
Examples of application of MDPs:
- Robot navigation problem
- Inventory management
- Portfolio optimization
- Purchase and production optimization
Markov Property
In an MDP, all states have Markov property. It means for all the state of an MDP, the future is independent of the present and future states. Each state contains all the useful information from the agent's history.
Variants of MDP
References
- Chapter 3, RL:AI, Sutton and Barto 2nd Edition