Go-Explore

Rewards can be sparse and even deceptive, constituting "hard-exploration" problems.

Many important tasks require effective exploration to be solved, i.e. to explore and learn about the world even when rewards are sparse or deceptive.

Deceptive-reward problems are even harder, because instead of feedback rarely being provided, the reward function actually provides misleading feedback for reaching the overall global objective, which can lead to getting stuck on local optima.

The go-to approach has been to construct an intrinsic-motivation reward signal to encourage exploration.

Prior to Go-Explore, the typical approach to sparse reward problems has been intrinsic motivation (IM) [ 4, 9– 11], which supplies the RL agent with intrinsic rewards (IRs) that encourage exploration (augmenting or replacing extrinsic reward that comes from the environment)

IM is often motivated by psychological concepts such as curiosity [12 , 13] or novelty-seeking [7, 14 ], which play a role in how humans explore and learn.

They hypothesize intrinsic-motivation are disappointing because of "detachment" and "derailment".

Detachment is the idea that an agent driven by IM could become detached from the frontiers of high intrinsic reward.

Once it has exhausted that IR, it is difficult for it to rediscover the frontier it detached from in the initial area, because it has already consumed the IR that led to that frontier (Fig. 1), and it likely will not remember how to return to that frontier due to catastrophic forgetting.

Replay buffers could help, but hard to maintain data that may be needed in the future but not now.

In theory a replay buffer could prevent detachment, but in practice it would have to be large to prevent data about the abandoned frontier to not be purged before it becomes needed, and large replay buffers introduce their own optimization stability difficulties.

Derailment is when policies cant return to a promising state and explore from there.

Derailment can occur when an agent has discovered a promising state and it would be beneficial to return to that state and explore from it. However, the longer, more complex, and more precise a sequence of actions needs to be in order to reach a previously-discovered high-IR state, the more likely it is that such stochastic perturbations will “derail” the agent from ever returning to that state. That is because the needed precise actions are naively perturbed by the basic exploration mechanism, causing the agent to only rarely succeed in reaching the known state to which it is drawn, and from which further exploration might be most effective.

To address derailment, an insight in Go-Explore is that effective exploration can be decomposed into first returning to a promising state (without intentionally adding any exploration) before then exploring further.

Go-Explore solves both of these problems explicitly.

Go-Explore is an explicit response to both detachment and derailment that is also designed to achieve robust solutions in stochastic environments.

The fundamental idea of Go-Explore is simple and thus is more like a family of algorithms.

The insight that remembering and returning reliably to promising states is fundamental to effective exploration in sparse-reward problems is at the core of Go-Explore.

Two phases, although the crux of it comes in Phase 1 where everything is made deterministic so returning to exact same state again is possible.

Similar to IM algorithms, Phase 1 focuses on exploring infrequently visited states, which forms the basis for dealing with sparse-reward and deceptive problems. In contrast to IM algorithms, Phase 1 addresses detachment and derailment by accumulating an archive of states and ways to reach them through two strategies: (a) add all interestingly different states visited so far into the archive, and (b) each time a state from the archive is selected to explore from, first Go back to that state (without adding exploration), and then Explore further from that state in search of new states (hence the name “Go-Explore”).

Bombshell — Phase 1 doesn't even use neural networks, just purely random exploration!!!

Interestingly, such exploration does not require a neural network or other controller, and indeed no neural network was used for the exploration phase (Phase 1) in any of the experiments in this paper (we do not train a neural network until Phase 2). The fact that entirely random exploration works so well highlights the surprising power of simply returning to promising cells before exploring further

Phase 2 is mostly to make the policy robust to be able to work in un-modified stochastic environment.

If necessary, the second phase of Go-Explore robustifies high-performing trajectories from the archive such that they are robust to the stochastic dynamics of the true environment. Go-Explore robustifies via imitation learning (aka learning from demonstrations or LfD [ 26 –29 ]), a technique that learns how to solve a task from human demonstrations. The only difference with Go-Explore is that the solution demonstrations are produced automatically by Phase 1 of Go-Explore instead of being provided by humans. The input to this phase is one or more high-performing trajectories, and the output is a robust policy able to consistently achieve similar performance. The combination of both phases instantiates a powerful algorithm for hard-exploration problems, able to deeply explore sparse- and deceptive-reward environments and robustify high-performing trajectories into reliable solutions that perform well in the unmodified, stochastic test environment.

Two hard-exploration benchmarks are Montequma's Revenge and Pitfall, both Atari games.

Montezuma’s Revenge has become an important benchmark for exploration algorithms (including intrinsic motivation algorithms) [4, 16 , 32 – 39 ] because precise sequences of hundreds of actions must be taken in between receiving rewards. Pitfall is even harder because its rewards are sparser (only 32 positive rewards are scattered over 255 rooms) and because many actions yield small negative rewards that dissuade RL algorithms from exploring the environment.

Constant negative rewards are especially pernicious:

On Pitfall, the lack of positive rewards and frequent negative rewards causes RL algorithms to learn a policy that effectively does nothing, either standing completely still or moving back and forth near the start of the game

Low-dimensional Representation

Cell representation in low-dimensions is important to maintain a diverse archive. They use simple representations.

To be tractable in high-dimensional state spaces like Atari, Phase 1 of Go-Explore needs a lower-dimensional space within which to search (although the final policy will still play in the same original state space, in this case pixels). Thus, the cell representation should conflate similar states while not conflating states that are meaningfully different. In this way, a good cell representation should reduce the dimensionality of the observations into a meaningful low-dimensional space.

One was just image downsampling:

We found that a very simple dimensionality reduction procedure produces surprisingly good results on Montezuma’s Revenge. The main idea is simply to downsample the current game frame.

(1) convert each game frame image to grayscale (2) downscale it to an 11 × 8 image with area interpolation (i.e. using the average pixel value in the area of the downsampled pixel), (3) rescale pixel intensities so that they are integers between 0 and 8, instead of the original 0 to 255

Next is using the actual state information from pixels.

In Montezuma’s Revenge, domain knowledge is provided as unique combinations of the x, y position of the agent (discretized into a grid in which each cell is 16 × 16 pixels), room number, level number, and in which rooms the currently-held keys were found. In the case of Pitfall, only the x, y position of the agent and the room number were used. All this information was extracted directly from pixels with simple hand-coded classifiers to detect objects such as the main character’s location combined with our knowledge of the map structure in the two games

But final policy still plays directly from the pixels apparently.

While Go-Explore provides the opportunity to leverage domain knowledge in the cell representation in Phase 1, the robustified neural network produced by Phase 2 still plays directly from pixels only.

Cell Selection

Each cell is given a score consisting of several sub-scores to get their sampling probability.

Cells are selected at each iteration by first assigning them a score, which is then normalized across all cells in the archive, yielding the probability of each cell being selected. The score of a cell is the sum of separate subscores.

Count of interaction with cell is used heavily.

the count subscores, are computed from attributes that represent the number of times a cell was interacted with in different ways. Specifically: the number of times a cell has already been chosen (i.e. selected as a cell to explore from), the number of times a cell was visited at any point during the exploration phase, and the number of times a cell has been chosen since exploration from it last produced the discovery of a new or better cell. In the case of each of these attributes, a lower count likely indicates a more promising cell to explore from.

$$ \operatorname{CntScore}(c, a)=w_a \cdot\left(\frac{1}{v(c, a)+\varepsilon_1}\right)^{p_a}+\varepsilon_2 $$

Here $c$ is the cell for which we are calculating the score, $v(c, a)$ is a function that returns the value of attribute $a$ for cell $c, w_a$ is the weight hyperparameter for attribute $a$, and $p_a$ is the power hyperparameter for attribute $a . \varepsilon_1$ helps prevent division by 0 and determines the relative weight of cells for which a given value is $0 . \varepsilon_2$ helps guarantee that no cell ever has a 0 probability of being chosen. In our implementation, $\varepsilon_1=0.001$ and $\varepsilon_2=0.00001$, which we chose after preliminary experiments showed that they worked well.

Neighborhood information is also used in experiments that use "domain knowledge". If no neighbors, they are likely at frontier of exploration.

When cell representations are informed by domain knowledge (Section 3.1.2), giving us the x, y position of the agent, it is possible to determine the possible neighbors of given cells, and whether these neighbors are already present in the archive. For those cases, we define a set of neighbor subscores.

Each neighbor subscore is defined as wn if neighbor n does not exist in the archive, and is 0 otherwise. The motivation behind these neighbor subscores is that cells that are lacking neighbors are likely at the edge of the current frontier of knowledge and are thus more likely to yield new cells.

$$ \operatorname{NeighScore}(c, n)=w_n \cdot(1-\operatorname{HasNeighbor}(c, n)) $$

We consider 3 types of neighbors: vertical (2 neighbors), horizontal (2 neighbors), and (in the case of Montezuma’s Revenge) cells that are in the same level, room and x, y position, but are holding a larger number of keys (the intuition is that if a cell lacks a “more keys” neighbor, then it is the cell that is most capable of opening doors from its location).

There's also a "Level Weight" to favor progress in furthest level reached.

Finally, in the case of Montezuma’s Revenge with domain knowledge, cells are exponentially downweighted based on the distance to the maximum level currently reached, thereby favoring progress in the furthest level reached, while still keeping open the possibility of improving previous levels’ trajectories:

$$ \text { LevelWeight }(c)=0.1^{\text {MaxLevel-Level }(c)} $$

Final score is given as following, and selection probability is gained by normalizing across all cells.
$$ \operatorname{CellScore}(c)=\operatorname{LevelWeight}(c) \cdot\left[\left(\sum_n \operatorname{NeighScore}(c, n)\right)+\left(\sum_a \operatorname{CntScore}(c, a)\right)+1\right] $$

Hyperparameters (the different values of wa, pa and wn) were found through separate grid searches on each game.

Returning to Cells

A major requirement for Go-Explore to work is that determinism in training is allowable. And that seems to be a bit of a controversy.

The easiest way to return to a cell is if the world is deterministic and resettable, such that one can reset the state of the simulator to a previous visit to that cell. Whether performing such resets is allowable for RL research is an interesting subject of debate that was motivated by the initial announcement of Go-Explore [ 55 ].

They say its fine, which I agree with. Purists will always find some other things to nag about. The phase 2 is designed to produced the final stochasticity-friendly policy anyway.

An insight of Go-Explore is that we can take advantage of the fact that such simulators can be made deterministic to improve performance, especially on hard-exploration problems. For many types of problems, we want a reliable final solution (e.g. a robot that reliably finds survivors after a natural disaster) and there is no principled reason to care whether we obtain this solution via initially deterministic training

But what to do when this is simply not possible? They mention goal-conditioned policies, but leave it to future work.

There are also cases where a simulator is not available and where learning algorithms must confront stochasticity during training. To create and test algorithms for this second type of problem, we cannot exploit determinism and resettability. Examples of this class of problems include when we must learn directly in the real world (and an effective simulator is not available and cannot be learned), or when studying the learning of biological animals, including ourselves. We believe Go-Explore can handle such situations by training goal-conditioned policies [ 62 , 63] that reliably return to cells in the archive during the exploration phase, which is an interesting area for future research.

But sometimes its simply impossible.

We note that there are some problems where the environment has forms of stochasticity that prevent the algorithm from reliably returning to a particular cell, regardless of which action the agent takes (e.g. in poker, there is no sequence of actions that reliably leads you to a state where you have two aces).

But when possible, simply save the state of the simulator and save yourself a ton of compute.

For the experiments in this paper, because we harness deterministic training, we could return to a cell by storing the sequence of actions that lead to it and subsequently replay those actions. However, simply saving the state of the emulator (in addition to this sequence of steps) and restoring that state when revisiting a cell gains additional efficiency. Doing so reduced the number of steps that needed to be simulated by at least one order of magnitude

Exploration from Cells

Purely random exploration. Crazy, I know.

In this work the agent explores by taking random actions for k = 100 training frames, with a 95% probability of repeating the previous action at each training frame (frames at which the agent is allowed to take an action, thus not including any frames skipped due to frame skip, see Appendix A.1). Besides reaching the k = 100 training frame limit for exploration, exploration is also aborted at the episode’s end (defined in Appendix A.2), and the action that led to the episode ending is ignored because it does not produce a destination cell.

They think exploring with a policy would likely improve results, which I think is so obvious and makes me wonder why they didn't try.

though we believe exploring intelligently (e.g. via a trained policy) would likely improve our results and is an interesting avenue for future work

Updating the Archive

Update archive cell when its empty or when a "better" i.e. higher scoring or less number of steps.

The archive updates in two conditions. The first condition is when the agent visits a cell that was not yet in the archive. The second condition is when a newly-encountered trajectory is “better” than that belonging to a cell already in the archive. For the experiments below, we define a new trajectory as better than an existing trajectory when the new trajectory either has a higher cumulative score or when it is a shorter trajectory with the same score. In either case, the existing cell in the archive is updated with the new trajectory, the new trajectory length, the new environment state, and the new score.

Metadata stored:

(1) how the agent got to that cell (here, a full trajectory from the starting state to that cell), (2) the state of the environment at the time of discovering the cell (if the environment supports such an operation, which is true for the two Atari-game domains in this paper), (3) the cumulative score of that trajectory, and (4) the length of that trajectory.

Counters are not reset apparently:

We do not reset the counter that records the number of times the cell has been visited because that would make recently discovered cells indistinguishable from recently updated cells, and recently discovered cells (i.e. those with low visit counts) are more promising to explore because they are likely near the surface of our expanding sphere of knowledge.

Trajectories are all independent and cannot be linked together:

Because cells conflate many states, we cannot assume that a trajectory from start state A through cell B to cell C will still reach C if we substitute a different, better way to get from A to B; therefore, the better way of reaching a cell is not integrated into the trajectories of other cells that built upon the original trajectory.

Phase 2: Robustification

Phase 2 now introduces stochasticity back and uses the explored trajectories as learning data.

stochasticity is added during Phase 2 so that the final policy is robust to the stochasticity it will face during its evaluation in the test environment. Thus the policy being trained has to learn how to mimic and/or perform as well as the trajectory obtained from the Go-Explore exploration phase while simultaneously dealing with circumstances that were not present in the original trajectory. Depending on the stochasticity of the environment, this adjustment can be highly challenging, but nevertheless is far easier than attempting to solve a sparse-reward problem from scratch.

The imitation learning algorithm ideally has to be able to improve upon the demonstrations, so they choose Backward Algorithm.

LfD algorithms that try to closely mimic the behavior of the demonstration may struggle to improve upon it. For this reason, we chose an LfD algorithm that has been shown capable of improving upon its demonstrations: the Backward Algorithm from Salimans and Chen.

This is iteratively running a normal RL (PPO - Proximal Policy Optimization) starting from T-n step, n times, using the exploration as curriculum and still able to improve on the actual scores.

It works by starting the agent near the last state in the trajectory, and then running an ordinary RL algorithm from there (in this case Proximal Policy Optimization (PPO) [64 ]). Once the algorithm has learned to obtain the same or a higher reward than the example trajectory from that starting place near the end of the trajectory, the algorithm backs the agent’s starting point up to a slightly earlier place along the trajectory, and repeats the process until eventually the agent has learned to obtain a score greater than or equal to the example trajectory all the way from the initial state.

However its not that reliable and needed training with multiple trajectories.

However, we found it somewhat unreliable in learning from a single demonstration (Fig. 5a). Indeed, only 40% of our attempts at robustifying trajectories that solved level 1 were successful when using a single demonstration.

Why Go-Explore outperforms a learned model of the world is an interesting thread they touch on and hypothesize that not keeping track of which states were explored and thus cycling through same states is a real problem.

In FMC, a planning process is initiated from each state the agent visits. Planning is done within a deterministic version of the game emulator. A fixed number of workers are started in the state from which planning is occurring, and they perform random walks in state space. Periodically, workers that have accumulated lower reward and/or are in less novel states are replaced by “clones” of more successful workers. Novelty is approximated as the Euclidean distance of the worker’s state (in the original, raw, observation space) to that of a randomly selected other worker.