Model Based Reinforcement Learning

Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL.

MFRL's high sample complexity limits largely their application to simulated domains.

Advantages:

  • Can efficiently learn model by supervised learning methods
  • Can reason about model uncertainty (like in upper confidence bound methods for exploration/exploitation trade offs)
  • Generalization - If the dynamics (reward) of the environment change, it can use the learned-model and replan.
  • Incorporate uncertainity - helps exploration/exploitation

Disadvantages

  • First learn a model, then construct a value function -> two sources of approximation error

However, there are significant challenges.

Challenges of MBRL

These are the challenges identified by Benchmarking Model-Based Reinforcement Learning:

Dynamics bottleneck

  • The performance does not increase when more data is collected
  • Models with learned dynamics get stuck at performance local minima significantly worse than using ground-truth dynamics.
  • The prediction error accumulates with time, and MBRL inevitably involves prediction on unseen states.
  • The policy and the learning of dynamics is coupled, which makes the agents more prone to performance local-minima.
  • Exploration and off-policy learning are barely addressed on current model-based approaches.

Planning horizon dilemma

  • While increasing the planning horizon provides more accurate reward estimation, it can result in performance drops.
  • Planning horizon between 20 to 40 works the best both for the models using ground-truth dynamics and the ones using learned dynamics.
  • This can be attributed to insufficient planning in a search space which increases exponentially with planning depth, i. e., the curse of dimensionality.

Early termination dilemma

  • Early termination, when the episode is finalized before the horizon has been reached, is a standard technique used in MFRL algorithms to prevent the agent from visiting unpromising states or damaging states for real robots.
  • MBRL can correspondingly also apply early termination in the planned trajectories, or generate early terminated imaginary data, but hard to integrate into the existing MB algorithms.
  • Early termination does in fact decrease the performance for MBRL algorithms of different types.
  • To perform efficient learning in complex environments, such as Humanoid, early termination is almost necessary. This is an important area for research.

Types of model based techniques

Analytic gradient based

Sampling-based planning

Model-based data generation

Value-equivalence prediction


References

  1. https://bair.berkeley.edu/blog/2019/12/12/mbpo/