OMNI - Open-endedness via Models of human Notions of Interestingness
The main issue in Open-Endedness is how to prioritize learning tasks better to maximize learning efficiency.
An Achilles Heel of open-endedness research is the inability to quantify (and thus prioritize) tasks that are not just learnable, but also interesting (e.g., worthwhile and novel).
In pursuit of an algorithm that is applicable in any domain and enables perpetual learning, handcrafting curricula proves to be an impractical solution.
Instead. of hand-crafted formulations of "interestingness" which are bound to have unforeseen issues, use foundation models instead.
There have been many attempts to quantify interestingness, but, as we detail in Section 2, such simple, hand-crafted formulas consistently fall short of truly capturing the essence of interestingness, creating crippling pathologies. This paper proposes a different path forward.
Use the ineffable notions of interestingness pretrained models picked up along the way and use that to weight the next good tasks to learn from.
OMNI leverages the power of FMs that have already been trained on extensive human-generated data and have an inherent understanding of human notions of interestingness
There is no learning of this metric of any kind. Just ask an LLM if a task is interesting or boring given the current progress status of an RL agent. Use knowledge and intuitions of LLMs to guide the learning progress:
Ultimately, OMNI offers a general recipe for accelerated learning in open-ended environments with potentially infinite tasks, and steers open-ended learning towards meaningful and interesting progress, instead of meandering aimlessly amidst endless possibilities.
Learning Progress Curriculum (LP)
Need to create a curriculum at the frontier of an agent's current ability, which needs to quantify "learning progress".
To automatically identify tasks at the frontier of the agent’s capabilities, we extend the learning-progress-based curriculum (without the dynamic exploration bonus) from Kanitscheider et al. (2021).
Define "learning progress" as two time-varying EMA of task success rates evaluate periodically during training. The EMA can be biased, so normalize by success rate of random action policy. Details from paper:
For each task, we calculate the task success rate $t_{r d n}$ achieved by a random action policy. At fixed evaluation intervals during training, the evaluated task success rate $t_{\text {eval }}$ of the RL policy is normalized as such:
The above is our contributed extension to the learning progress method in Kanitscheider et al. (2021). $t_{\text {norm }}$ is smoothed with an EMA function to obtain $p_{\text {recent }}$. $p_{\text {recent }}$ is smoothed with a second identical EMA function to obtain $p_{\text {gradual }}$. The exponential smoothing constant applied in all experiments is 0.1 . The bidirectional learning progress measure is given by $L P=\mid f\left(p_{\text {recent }}-f\left(p_{\text {gradual }}\right) \mid\right.$ (Figure 6, blue), where $f$ is the reweighting function:
with parameter $p_\theta=0.1$.
They employ a sampling function to transform the measure of learning progress into task sampling weights, focusing mostly on tasks with the largest learning progress. The steps are as follows:
- Z-score the reweighted learning progress (subtract mean and divide by standard deviation).
- Apply a sigmoid to the result.
- Normalize resulting weights to sampling probabilities.
Model of Interestingness (MoI)
An LP curriculum can be distracted by endless variations of uninteresting tasks. This is where subjective notion of interestingness comes in by querying a pre-trained foundation/world/language model. The idea is:
We prompt the FM in a few-shot manner by providing it with examples of choosing which tasks are interesting. It takes into account the agent’s existing proficiency on a given set of tasks and suggests what humans would typically find interesting to learn next.
Basically ask if LM thinks the task would be interesting or boring given the current context as a binary classification task, then re-weight the sampling weights from Learning Progress method above. Details from paper:
Crafter Experiment
They remove the survival component of the task, interesting. Would've liked to see it included.
We modify the game to focus on gathering and crafting skills by eliminating the survival component. This removes the need for the agent to learn and continually apply survival tactics against enemies or for food gathering. The “sleep” and “place plant” actions are important for survival in the original game and have been omitted due to their reduced relevance in our modified context, which excludes the survival aspect. The original game consists of 22 tasks, of which, the 15 tasks unrelated to survival are selected and considered interesting.
They also create an additional set of boring and extremely difficult tasks.
we dilute the 15 interesting tasks with 90 “boring” tasks and 1023 “extremely challenging” tasks that serve as potential distractors for learning-progress-based approaches. Boring tasks are generated as numerical repeats of interesting tasks, e.g., “collect N wood” where N ∈ [2, 10], analogous to how minor numerical variations of real-world tasks are less interesting than tasks that differ qualitatively.
They use PPO - Proximal Policy Optimization for the RL agent. Average success rate is pretty good with LP and MoI (took 33 hours on an A10 GPU).

Discussions
Use LLM judge for both learning progress and interesting:
A different version of OMNI could have the FM judge both learning progress and interestingness by giving it a history of task selection and task performance. That could allow for a more flexible notion of learning progress, as it can potentially recognize not just success or failure, but also important stepping stones and patterns leading to a solution, echoing the complexity of human learning progression.
Another idea is to allow the MoI to autonomously analyze quantitative performance measures, make its own assessment of learning progress, and incorporate that into its notion of interestingness.
Goodhart's law can always strike though:
Critically, by Goodhart’s law, we expect that any model will have pathologies uncovered once it is a target metric being optimized against.
Their suggestion is to fine-tune with human feedback:
refining and updating the MoI with additional human feedback could lead to more effective learning systems, an algorithm within the OMNI paradigm we call Open-Endedness with Human Feedback (OEHF).