Reinforcement Learning

Topics

Notes

Linked

Intrinsically-Motivated Humans and Agents in Open-World Exploration

How do humans approach exploration? Turns out its empowerment with entropy early on.

Advantage Functions

Raw action values mix state quality with action quality. Subtract the state value baseline to isolate the advantage of each action.

Deep Q-Learning

Q-learning with tabular methods doesn't scale to large state spaces. Approximate the Q-function with a deep neural network.

Eligibility Trace

RL task is partially observable or need memory of past states (non-Markov)

Partial Observability

Agent cannot observe the full environment state. Maintain a belief state or use memory to act under incomplete information.

Inverse Reinforcement Learning

The reward function is unknown but expert demonstrations are available. Infer the reward that explains observed expert behavior.

RLHF - Reinforcement Learning with Human Feedback

Specifying a reward function for complex tasks like language generation is intractable. Learn a reward model from human preferences and optimize with RL.

PPO - Proximal Policy Optimization

TRPO's constrained optimization is complex to implement. Use a clipped surrogate objective to approximate trust region behavior simply.

State Update Functions in Partially Observable MDP

In POMDPs the agent receives observations, not states. Maintain and update a belief distribution over hidden states.

Dyna-Q - Planning and Learning

Pure model-free RL is sample-inefficient. Interleave real experience with simulated experience from a learned model.

Generalized Advantage Estimate

Monte Carlo advantage has low bias but high variance; TD has low variance but high bias. Exponentially weight multi-step TD errors to control the tradeoff.

Natural Policy Gradient

Vanilla policy gradient steps in parameter space distort the policy unevenly. Use the Fisher information to take equal-size steps in distribution space.

Prioritized Sweeping

Uniform random updates in model-based RL waste computation on low-priority states. Prioritize updates by expected value change.

TRPO - Trust-Region Policy Optimization

Policy gradient updates can be too large and collapse performance. Constrain the KL divergence between old and new policies for monotonic improvement.

Bellman Equation and Value Functions

How to recursively define the expected long-term return from a state? Express value as immediate reward plus discounted future value.

Dynamic Programming (RL)

How to compute optimal policies when the full MDP model is known? Iteratively apply Bellman updates to converge to optimal value functions.

Incremental Implementation of Estimating Action Values

Storing all past rewards to compute action value averages is memory-inefficient. Use incremental running averages instead.

Markov Decision Processes

How to mathematically formalize sequential decision-making? Define states, actions, transitions, and rewards with the Markov property.

Multi-Armed Bandits

How to balance exploring unknown options vs exploiting the best known option to maximize cumulative reward?

Off-policy learning with approximation

Off-policy learning with function approximation can diverge (the deadly triad). Use importance sampling corrections or gradient methods for stability.

On-policy learning with approximation

Tabular value functions don't scale to large state spaces. Use function approximation to generalize values across similar states on-policy.

PGT Actor-Critic

REINFORCE has high variance from using full returns. Use a learned value function (critic) as baseline to reduce policy gradient variance.

REINFORCE - Monte Carlo Policy Gradient

How to optimize a policy when environment dynamics are unknown? Use sampled returns to estimate the policy gradient via Monte Carlo rollouts.

Reinforcement Learning Problem Setup

How to formalize sequential decision-making under uncertainty? Define agents, environments, states, actions, and rewards.

Temporal Difference Learning

Monte Carlo methods require waiting until episode end to update. Bootstrap from current value estimates to learn online from incomplete episodes.

Policy Gradient

Value-based methods struggle with continuous actions or stochastic policies. Directly differentiate expected return w.r.t. policy parameters.

Monte-Carlo RL Methods

Model dynamics are unknown and bootstrapping introduces bias. Use complete episode returns for unbiased value estimates.

Model Free Reinforcement Learning

Learning the environment model is hard or unnecessary. Learn value functions or policies directly from interaction without modeling dynamics.

Deep-Q-Network (DQN)

Q-learning with neural networks is unstable due to correlated samples and moving targets. Use experience replay and target networks for stability.

Monte-Carlo Tree Search

Exhaustive game tree search is intractable for large state spaces. Use random simulations to selectively expand promising branches.

Markov Reward Processes

How to evaluate expected long-term reward in a stochastic process without actions? Define value functions over Markov chains with rewards.

Semi-Markov Decision Processes

Standard MDPs assume fixed time steps. Extend MDPs to handle actions with variable durations.

Model Based Reinforcement Learning

Model-free RL is sample-inefficient. Learn a model of environment dynamics and plan or generate synthetic experience from it.