Reinforcement Learning
Topics
Notes
Linked
How to explore RL search space better than intrinsic motivation.
How do humans approach exploration? Turns out its empowerment with entropy early on.
Avoid learning an explicit value function in RL alignment setup
Raw action values mix state quality with action quality. Subtract the state value baseline to isolate the advantage of each action.
Q-learning with tabular methods doesn't scale to large state spaces. Approximate the Q-function with a deep neural network.
RL task is partially observable or need memory of past states (non-Markov)
Agent cannot observe the full environment state. Maintain a belief state or use memory to act under incomplete information.
The reward function is unknown but expert demonstrations are available. Infer the reward that explains observed expert behavior.
Specifying a reward function for complex tasks like language generation is intractable. Learn a reward model from human preferences and optimize with RL.
TRPO's constrained optimization is complex to implement. Use a clipped surrogate objective to approximate trust region behavior simply.
In POMDPs the agent receives observations, not states. Maintain and update a belief distribution over hidden states.
Pure model-free RL is sample-inefficient. Interleave real experience with simulated experience from a learned model.
Monte Carlo advantage has low bias but high variance; TD has low variance but high bias. Exponentially weight multi-step TD errors to control the tradeoff.
Vanilla policy gradient steps in parameter space distort the policy unevenly. Use the Fisher information to take equal-size steps in distribution space.
Uniform random updates in model-based RL waste computation on low-priority states. Prioritize updates by expected value change.
Policy gradient updates can be too large and collapse performance. Constrain the KL divergence between old and new policies for monotonic improvement.
How to recursively define the expected long-term return from a state? Express value as immediate reward plus discounted future value.
How to compute optimal policies when the full MDP model is known? Iteratively apply Bellman updates to converge to optimal value functions.
Storing all past rewards to compute action value averages is memory-inefficient. Use incremental running averages instead.
How to mathematically formalize sequential decision-making? Define states, actions, transitions, and rewards with the Markov property.
How to balance exploring unknown options vs exploiting the best known option to maximize cumulative reward?
Off-policy learning with function approximation can diverge (the deadly triad). Use importance sampling corrections or gradient methods for stability.
Tabular value functions don't scale to large state spaces. Use function approximation to generalize values across similar states on-policy.
REINFORCE has high variance from using full returns. Use a learned value function (critic) as baseline to reduce policy gradient variance.
How to optimize a policy when environment dynamics are unknown? Use sampled returns to estimate the policy gradient via Monte Carlo rollouts.
How to formalize sequential decision-making under uncertainty? Define agents, environments, states, actions, and rewards.
Monte Carlo methods require waiting until episode end to update. Bootstrap from current value estimates to learn online from incomplete episodes.
Value-based methods struggle with continuous actions or stochastic policies. Directly differentiate expected return w.r.t. policy parameters.
Model dynamics are unknown and bootstrapping introduces bias. Use complete episode returns for unbiased value estimates.
Learning the environment model is hard or unnecessary. Learn value functions or policies directly from interaction without modeling dynamics.
Q-learning with neural networks is unstable due to correlated samples and moving targets. Use experience replay and target networks for stability.
Exhaustive game tree search is intractable for large state spaces. Use random simulations to selectively expand promising branches.
How to evaluate expected long-term reward in a stochastic process without actions? Define value functions over Markov chains with rewards.
Standard MDPs assume fixed time steps. Extend MDPs to handle actions with variable durations.
Model-free RL is sample-inefficient. Learn a model of environment dynamics and plan or generate synthetic experience from it.