Reinforcement Learning

Open-Endedness Reinforcement Learning Quality Diversity Exploration

Go-Explore

How to explore RL search space better than intrinsic motivation.

MAP-Elites Monte-Carlo Tree Search PPO - Proximal Policy Optimization

Intrinsic Motivation Reinforcement Learning Exploration

Intrinsically-Motivated Humans and Agents in Open-World Exploration

How do humans approach exploration? Turns out its empowerment with entropy early on.

Large Language Models (LLMs) Reinforcement Learning

Group Relative Policy Optimization (GRPO)

Avoid learning an explicit value function in RL alignment setup

TRPO - Trust-Region Policy Optimization RLHF - Reinforcement Learning with Human Feedback KL Divergence Control Variates PPO - Proximal Policy Optimization

Reinforcement Learning

Advantage Functions

Raw action values mix state quality with action quality. Subtract the state value baseline to isolate the advantage of each action.

Control Variates

Reinforcement Learning

Deep Q-Learning

Q-learning with tabular methods doesn't scale to large state spaces. Approximate the Q-function with a deep neural network.

Deep-Q-Network (DQN) Multi-Network Training with Moving Average Target

Reinforcement Learning Partial Observability

Eligibility Trace

RL task is partially observable or need memory of past states (non-Markov)

Temporal Difference Learning Partial Observability

Reinforcement Learning

Partial Observability

Agent cannot observe the full environment state. Maintain a belief state or use memory to act under incomplete information.

Eligibility Trace Recurrent Neural Networks (RNN)

Reinforcement Learning

Inverse Reinforcement Learning

The reward function is unknown but expert demonstrations are available. Infer the reward that explains observed expert behavior.

Reinforcement Learning Language Models (Classical)

RLHF - Reinforcement Learning with Human Feedback

Specifying a reward function for complex tasks like language generation is intractable. Learn a reward model from human preferences and optimize with RL.

Inverse Reinforcement Learning Transformers BERT TRPO - Trust-Region Policy Optimization Policy Gradient PPO - Proximal Policy Optimization

Reinforcement Learning Model Free Reinforcement Learning

PPO - Proximal Policy Optimization

TRPO's constrained optimization is complex to implement. Use a clipped surrogate objective to approximate trust region behavior simply.

Model Free Reinforcement Learning TRPO - Trust-Region Policy Optimization Policy Gradient KL Divergence

Reinforcement Learning Markov Decision Processes

State Update Functions in Partially Observable MDP

In POMDPs the agent receives observations, not states. Maintain and update a belief distribution over hidden states.

Markov Decision Processes

Reinforcement Learning Model Based Reinforcement Learning

Dyna-Q - Planning and Learning

Pure model-free RL is sample-inefficient. Interleave real experience with simulated experience from a learned model.

Model Based Reinforcement Learning

Reinforcement Learning Policy Gradient

Generalized Advantage Estimate

Monte Carlo advantage has low bias but high variance; TD has low variance but high bias. Exponentially weight multi-step TD errors to control the tradeoff.

TRPO - Trust-Region Policy Optimization Policy Gradient

Reinforcement Learning Policy Gradient

Natural Policy Gradient

Vanilla policy gradient steps in parameter space distort the policy unevenly. Use the Fisher information to take equal-size steps in distribution space.

Policy Gradient KL Divergence

Reinforcement Learning

Prioritized Sweeping

Uniform random updates in model-based RL waste computation on low-priority states. Prioritize updates by expected value change.

Reinforcement Learning Model Free Reinforcement Learning

TRPO - Trust-Region Policy Optimization

Policy gradient updates can be too large and collapse performance. Constrain the KL divergence between old and new policies for monotonic improvement.

Model Free Reinforcement Learning Policy Gradient

Reinforcement Learning

Bellman Equation and Value Functions

How to recursively define the expected long-term return from a state? Express value as immediate reward plus discounted future value.

Markov Decision Processes

Reinforcement Learning

Dynamic Programming (RL)

How to compute optimal policies when the full MDP model is known? Iteratively apply Bellman updates to converge to optimal value functions.

Bellman Equation and Value Functions

Reinforcement Learning

Incremental Implementation of Estimating Action Values

Storing all past rewards to compute action value averages is memory-inefficient. Use incremental running averages instead.

Reinforcement Learning

Markov Decision Processes

How to mathematically formalize sequential decision-making? Define states, actions, transitions, and rewards with the Markov property.

Semi-Markov Decision Processes Markov Reward Processes Dynamic Programming (RL) Partial Observability

Reinforcement Learning

Multi-Armed Bandits

How to balance exploring unknown options vs exploiting the best known option to maximize cumulative reward?

Incremental Implementation of Estimating Action Values

Reinforcement Learning

Off-policy learning with approximation

Off-policy learning with function approximation can diverge (the deadly triad). Use importance sampling corrections or gradient methods for stability.

Importance Sampling

Reinforcement Learning

On-policy learning with approximation

Tabular value functions don't scale to large state spaces. Use function approximation to generalize values across similar states on-policy.

Reinforcement Learning Model Free Reinforcement Learning

PGT Actor-Critic

REINFORCE has high variance from using full returns. Use a learned value function (critic) as baseline to reduce policy gradient variance.

REINFORCE - Monte Carlo Policy Gradient Temporal Difference Learning Model Free Reinforcement Learning

Reinforcement Learning REINFORCE - Score Function Estimator

REINFORCE - Monte Carlo Policy Gradient

How to optimize a policy when environment dynamics are unknown? Use sampled returns to estimate the policy gradient via Monte Carlo rollouts.

Multi-Armed Bandits REINFORCE - Score Function Estimator Stochastic Gradients Policy Gradient Control Variates

Reinforcement Learning

Reinforcement Learning Problem Setup

How to formalize sequential decision-making under uncertainty? Define agents, environments, states, actions, and rewards.

Model Based Reinforcement Learning Dynamic Programming (RL) PGT Actor-Critic Model Free Reinforcement Learning Markov Decision Processes

Reinforcement Learning

Temporal Difference Learning

Monte Carlo methods require waiting until episode end to update. Bootstrap from current value estimates to learn online from incomplete episodes.

Maximum Likelihood Estimation

Reinforcement Learning

Policy Gradient

Value-based methods struggle with continuous actions or stochastic policies. Directly differentiate expected return w.r.t. policy parameters.

Stochastic Gradients REINFORCE - Score Function Estimator

Reinforcement Learning

Monte-Carlo RL Methods

Model dynamics are unknown and bootstrapping introduces bias. Use complete episode returns for unbiased value estimates.

Dynamic Programming (RL) Multi-Armed Bandits

Reinforcement Learning

Model Free Reinforcement Learning

Learning the environment model is hard or unnecessary. Learn value functions or policies directly from interaction without modeling dynamics.

Reinforcement Learning Model Free Reinforcement Learning

Deep-Q-Network (DQN)

Q-learning with neural networks is unstable due to correlated samples and moving targets. Use experience replay and target networks for stability.

Loss Functions Model Free Reinforcement Learning Stochastic Gradient Descent

Reinforcement Learning

Monte-Carlo Tree Search

Exhaustive game tree search is intractable for large state spaces. Use random simulations to selectively expand promising branches.

Dynamic Programming (RL) Multi-Armed Bandits

Reinforcement Learning Markov Decision Processes

Markov Reward Processes

How to evaluate expected long-term reward in a stochastic process without actions? Define value functions over Markov chains with rewards.

Markov Decision Processes

Reinforcement Learning Markov Decision Processes

Semi-Markov Decision Processes

Standard MDPs assume fixed time steps. Extend MDPs to handle actions with variable durations.

Markov Decision Processes

Reinforcement Learning

Model Based Reinforcement Learning

Model-free RL is sample-inefficient. Learn a model of environment dynamics and plan or generate synthetic experience from it.

Reinforcement Learning

Topics

Notes

Linked