Deep Q-Learning

Deep Q-Learning extends Q-learning to high-dimensional state spaces by using neural networks to approximate the Q-function Q(s,a). This enables RL in complex environments like video games and robotics where tabular methods are infeasible.

Q-learning algorithms for function approximators, such as Deep-Q-Network (DQN) (and all its variants) and Deep Deterministic Policy Gradient (DDPG), are largely based on minimizing mean squared bellman error (MSBE) loss function:

L = (Q(s,a) - [r + γ max Q(s',a')])²

Minimizes the squared difference between current Q-estimates and Bellman targets.

Essential Techniques

Experience Replay: Store and randomly sample past experiences to break correlation and improve stability

Target Networks: Use separate, slowly-updated networks for computing targets to prevent instability during training (Multi-Network Training with Moving Average Target).

Key Algorithms

  • Deep-Q-Network (DQN): Discrete action spaces, ε-greedy exploration
  • Double DQN: Addresses overestimation bias
  • Deep Deterministic Policy Gradient (DDPG): Continuous action spaces, actor-critic architecture
  • Rainbow DQN: Combines multiple improvements (prioritized replay, dueling networks, etc.)

References

  1. https://spinningup.openai.com/en/latest/algorithms/ddpg.html