Model Based Reinforcement Learning

Created December 26, 2020 · Updated March 20, 2026

Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL.

MFRL's high sample complexity limits largely their application to simulated domains.

Advantages:

Can efficiently learn model by supervised learning methods
Can reason about model uncertainty (like in upper confidence bound methods for exploration/exploitation trade offs)
Generalization - If the dynamics (reward) of the environment change, it can use the learned-model and replan.
Incorporate uncertainty - helps exploration/exploitation

Disadvantages

First learn a model, then construct a value function -> two sources of approximation error

However, there are significant challenges.

Challenges of MBRL

These are the challenges identified by Benchmarking Model-Based Reinforcement Learning:

The performance does not increase when more data is collected
Models with learned dynamics get stuck at performance local minima significantly worse than using ground-truth dynamics.
The prediction error accumulates with time, and MBRL inevitably involves prediction on unseen states.
The policy and the learning of dynamics is coupled, which makes the agents more prone to performance local-minima.
Exploration and off-policy learning are barely addressed on current model-based approaches.

While increasing the planning horizon provides more accurate reward estimation, it can result in performance drops.
Planning horizon between 20 to 40 works the best both for the models using ground-truth dynamics and the ones using learned dynamics.
This can be attributed to insufficient planning in a search space which increases exponentially with planning depth, i. e., the curse of dimensionality.

Early termination, when the episode is finalized before the horizon has been reached, is a standard technique used in MFRL algorithms to prevent the agent from visiting unpromising states or damaging states for real robots.
MBRL can correspondingly also apply early termination in the planned trajectories, or generate early terminated imaginary data, but hard to integrate into the existing MB algorithms.
Early termination does in fact decrease the performance for MBRL algorithms of different types.
To perform efficient learning in complex environments, such as Humanoid, early termination is almost necessary. This is an important area for research.