Foundation Model Self-Play (FMSP)

SP are open-ended process with ever-shifting goals, allowing them to bootstrap their way to superhuman-level play. But they easily get stuck. This is because the curriculum is optimized towards a singular goal of winning. SP also require tremendous amount of compute. And they can saturate after they become better than their opponents.

Self-play (SP) algorithms try to harness these dynamics by pit- ting agents against ever-improving opponents, thereby creating an implicit curriculum toward learning high-quality solutions. However, SP often fails to produce diverse solutions and can get stuck in locally optimal behaviors.

Traditional SP approaches can converge to local optima and have trouble learning a diversity of high-quality policies (Balduzzi et al., 2018), as the curriculum points only toward a singular goal of winning. This directed nature of the curriculum only incentivizes exploration that easily helps exploitation, which can lead to policies being stuck in local optima (Nor- man & Clune, 2024).

training models up to superhuman levels with SP has historically required enormous computational resources, especially in complex or sparse-reward problems (Cusumano-Towner et al., 2025; Ope- nAI et al., 2019; Silver et al., 2016; Vinyals et al., 2019) limiting the applicability of SP to quickly simulatable domains.

Finally, learning can stagnate if a policy becomes detached from its opponents (i.e., it loses no matter what it does), depleting the losing agent of training signal (Bansal et al., 2018; Czarnecki et al., 2020; Ecoffet et al., 2021).

They propose to solve these issues of SP by using LLMs to write code policies for RL problems.

we propose Foundation-Model Self-Play (FMSP) as a new direction, merging SP with foundation models. To address this, FMSPs write code-based policies and continually improve the policies via the implicit SP curriculum.

This is different than other Open-Endedness approaches because this is not about generating/grading environments, but improving the agents themselves.

While prior work has leveraged FMs in open-ended learning by having FMs generate environments for agents to learn in (Faldor et al., 2024; Zhang et al., 2023), we believe this paper introduces the first examples of FM-driven self-play wherein FMs endlessly improve agents that play against themselves.

They try three flavors: Vanilla (vFMSP), Novelty-Search (NSSP) and, Quality-Diversity (QDSP) which outperforms all. Quality Diversity for the win again.

but QDSP’s blend of diversity and incremental improvement achieves the best overall performance, producing algorithms sampled from across multiple fields of computer science that learn high-performing policies over a series of episodes.

Method

Each evolution step do "policy generation" and for QDSP also do "archive improvement".

Policy generation: LLM context is randomly sampled members of two populations, head-to-head performance of these sampled against each other and neighboring policies.

Given populations p1 and p2, when generating a new member of p1—px1 — NSSP provides the FM-search operator with context including randomly sampled members of both populations (pa1 , pa2 ), the score of how well they perform against each other in a head-to-head match (pa1 vs pa2 ), as well as neighboring policies similar to pa1 , (pb1, pc1, pd1).

Then ask LLM to generate one distinct from those in context. Then create a embedding vector.

The FM generates px1 to be distinct from those in the context (pa1 , pa2 , pb1, pc1, pd1) and then continually refines px1 to remove implementation bugs (Shinn et al., 2023). px1 is then embedded into an n-dimenisonal embedding vector (n=64) via a text-embedding model (Kusupati et al., 2024) (OpenAI’s text-embedding-3- small).

Use KNN search to retrieve most similar from archive and ask LLM if they are "truly" novel. If yes, add to archive. NSSP variant doesn't remove or replace old policies because its just maximizing diversity.

NSSP then asks if px1 is truly novel vs its neighboring policies already in the archive (pg1, ph1 , pi1) to the FM-as-judge (Zheng et al., 2023). If px1 is novel, px1 is added to the archive

Archive improvement: QDSP replaces the most similar policy if performance of new is better:

If the new policy is determined to be too similar to an existing policy in the archive, QDSP compares the performance of the new policy and its (singular) nearest neighbor in the archive (determined by embedding distance), retaining only the better-performing policy.

They call it the first dimensionless MAP-Elites because there's no explicitly defined behaviors/features. It comes from the embeddings.

This FM-based update (add to archive if novel or replace neighbor if better) yields the first dimen- sionless MAP-Elites algorithm (whether in self-play or not, an important, independent contribution). By “dimensionless”, we mean (1) that humans do not have to specify the dimensions of variation ahead of time, as is typically done (Cully et al., 2015; Mouret & Clune, 2015), nor (2) is the algo- rithm confined to dimensions of variation that already exist within the data generated up until that point in the run, which is true of methods that apply dimensionality reduction techniques like PCA to already-generated data (Cully, 2019), and (3) QDSP can pick an infinite number of conceptual dimensions of variation within its vast knowledge base

Implementation and main experiment

Uses Car Tag, a 2-player game with evader and pursuer with their own constraints, min-maxing their survival and catching time.

We first evaluate our family of FMSP ap- proaches on Car Tag (Isaacs & Corporation, 1951), a classic 2-body evader-pursuer game implemented as a finite discrete-time system on the unbounded XY-plane, and each agent outputs a continuous heading value to deter- mine its next movement direction. The game is asymmetric because the pursuer moves more quickly but has a limited turning radius, while the evader is slower but has no such restriction.

Uses manually designed unit test, so not truly fit for out-of-the-box use.

The inner-loop search- operator FM (GPT-4o, here) iterates on a new policy class until it passes a set of manually de- signed unit tests (ensuring the policy can exe- cute quickly and does not crash when executed against a simple heuristic opponent) and then outputs code implementing the policy class.

The two policies for two opponents in the games are separate. Loop goes like this:

After a new evader and pursuer have both been created, they compete against each other; the pursuer has 1000 timesteps to catch the evader before it loses. The only reward signal comes at the end of the game; the evader receives a reward of n1000 and the pursuer receives 1 − n1000 (either n = 1000 if the evader wins or n is the amount of time before the evader was caught). The simulation is run 100 times with random starting locations for each agent and the final score is the mean win-rate over those hundred sparse-reward games. Finally, each algorithm updates its archive according to its archiving rule and the loop begins anew.

Experiment in text-based game

Pretty cool game with AI-safety flavor.

We further test FMSPs in a new text-based AI safety puzzle game named Gandalf. > In Gandalf, the objective is to extract a secret password from a large language model (LLM) across multiple levels of increasingly strict defenses. The game has a binary win-loss reward signal where to succeed, the attacker must extract an exact match for the password held by the LLM.

Increasing level of difficulty:

The Gandalf defenders are structured as follows. Level 1 has no protection and encourages the LLM to give away the password. Level 2 instructs the LLM to keep the password secret except to correct wrong guesses. Levels 3 and 4 implement output guards on the base LLM’s response (a regex filter looking for the password, and LLM-as-judge (Zheng et al., 2023) asked to determine if the model’s response gives away the password, respectively). Levels 5 and 6 examine incoming attack prompts (a regex filter on the incoming attack looking for the words “password” or “secret” and LLM-as- judge asked to determine if the attack prompt is attempting to extract the password, respectively). Finally, level 7 combines each of the guards from levels 3, 4, 5, and 6.

None of the variants default level 7. Open Loop (just base model) fails from level 5. QDSP highest win rate.

QDSP, NSSP, and vFMSP each discover a high- quality set of attackers that defeat levels 1–6 of Gandalf (Table 1).

They hypothesize that just asking LLM fails because the solution is OOD.

Open Loop algorithm never beats levels 5–7, indicating that the problem solutions likely are outside the model’s training data and thus solving these challenges requires a more intelligent algorithm than asking the base model repeatedly

They also try with algos generating defense, and it seems they patch everything pretty quickly!

Interestingly, defense appears simpler than attack in this particular puzzle because (1) within 10 iterations of generating new defenses, none of the attackers tested against could extract the password and (2) even the Open- Loop algorithm was able to lock down the vulnerabilities. This result suggests that an automated self-play framework would likely lock down vulnerabilities as soon as they are discovered.

Evaluation

Evaluation was tricky so lots of ideas used.

There is no shared benchmark to compare each run against (besides the few human-designed policies) given the fact that each algorithm boot- straps its populations of policies; therefore, we do a post-hoc quality analysis where we construct a large shared population built out of the human-designed policies (to serve as baselines) and policies from each experiment.

Elo analysis to the rescue again:

Namely, we combine the human-designed policies, the top 3 policies from each experiment (calculated using ELO scores via an intra-treatment tournament), and 15 random policies from each experiment. We found this created a good mixture of high-quality policies while also providing a mixture of medium- to low-quality policies to evaluate against.

ELO (Elo, 1978) analyses reveal that QDSP is the only algorithm that generates strong policies for both the evader and pursuer populations (Figure 3); although statistical tests were unable to reject the hypothesis that the difference was not due to sampling.

Measuring diversity is done using QD-Score:

we calculate the QD-Score (Pugh et al., 2015), a holistic measure of both quality and diversity of an algorithm, which is the mean of the map (including unfilled cells as zeroes). Related policies will likely share the same cell, resulting in more cells being unfilled. Having unfilled cells will decrease the QD-Score of an algorithm because that algorithm did not explore new strategies very well. Under the QD-Score metric, the best algorithms are ones that balance filling the map (exploration) with improving policy types mapped to each cell (exploitation).