Darwin Godel Machine (DGM)

Self-improving system that modifies its own codebase and verifies its changes using benchmarks.

We introduce the Darwin Gödel Machine (DGM), a novel self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks.

And gets crazy gains:

increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%.

Baesd on Schmidhuber's concept of Godel machines:

Schmidhuber (2007) presented a class of mathematically rigorous, self-referential, self-improving problem solvers. It relies on formal proofs to justify code rewrites, ensuring that any self-modification is provably beneficial. However, in practice and without restrictive assumptions about the system, it is impossible to formally prove whether a modification to an AI system will be beneficial.

Idea: Self-modify and validate if it works + maintain library of previous agents as stepping stones.

Instead of requiring formal proofs, we empirically validate self-modifications against a benchmark, allowing the system to improve and explore based on observed results.

We also take inspiration from Darwinian evolution (Darwin, 2023) and investigate the effectiveness of maintaining a library of previously discovered agents to serve as stepping stones for future generations.

Do this alternatively:

The DGM alternates between self-modification and evaluation phases. During the self-modification phase, selected coding agents from the archive generate modified versions of themselves. During the evaluation phase, each modified agent is tested on a coding benchmark, estimating the agent’s coding capabilities, and then added to the archive

They postulate and show that the archive does lead to discovery of better coding agents:

This open-ended exploration encourages the discovery of novel and potentially useful self-modifications beyond immediate performance gains.

The DGM outperforms the baseline of not having open-ended exploration (i.e., a baseline without the accumulation of an archive of interestingly different stepping stones), where the coding agent always builds off the most recent version of itself.

Basically a promising POC for self-improving AI through Open-Endedness:

Through self-improvement and open-ended exploration, the DGM shows a continuous increase in performance, bringing us one step closer to self-accelerating, self-improving AI systems.

Parent selection and improvement

Parent selection is roughly proportional to each agent’s performance score and the number of its children with codebase-editing functionality

Selection is inspired from Go-Explore. More details from the paper:

At each iteration $t$ of the DGM run, let

$$ \mathcal{A}^t=\left\{a_1^t, a_2^t, \ldots, a_N^t\right\} $$

be the archive of candidate agents. We first define the eligible set

$$ \mathcal{E}^t=\left\{a_i^t \in \mathcal{A}^t: \alpha_i<1\right\}, $$

i.e. only those agents whose performance score is not yet perfect. We then sample $k$ parents (with replacement) from $\mathcal{E}^t$ in proportion to a combined measure of performance and number of children with codebase-editing functionality. Concretely, for each agent $a_i^t \in \mathcal{E}^t$ :

$$ \begin{aligned} \text{(performance)} \quad & \alpha_i = \operatorname{performance}(a_i^t), \\ \text{(children count)} \quad & n_i = \operatorname{functioning\_children\_count}(a_i^t), \\ \text{(sigmoid-scaled performance)} \quad & s_i = \frac{1}{1+\exp(-\lambda(\alpha_i - \alpha_0))}, \\ \text{(novelty bonus)} \quad & h_i = \frac{1}{1+n_i}, \\ \text{(unnormalized weight)} \quad & w_i = s_i \, h_i, \\ \text{(selection probability)} \quad & p_i = \frac{w_i}{\sum_{j=1}^N w_j}, \\ \text{(sampling)} \quad & \{\text{parents}\} \sim \operatorname{Categorical}\!\left(\{p_i\}_{a_i^t \in \mathcal{E}^t}\right) \quad (\text{draw } k \text{ samples}). \end{aligned} $$

Here, $\lambda>0$ controls the sharpness of the sigmoid, and $\alpha_0$ is the midpoint accuracy. We set $\lambda=10$ and $\alpha_0=0.5 . k$ determines how many self-improvements are done in parallel during each iteration of the DGM run. We set $k=2$ in SWE-bench experiments, and $k=4$ in Polyglot experiments.

Self-improvment

Each selected parent analyzes its own benchmark evaluation logs, proposes the next feature to implement, and receives this proposal as a problem statement to execute (Appendix C.3). The parent then implements the suggested feature into its own codebase, generating a new coding agent.

Run the loop for 80 iterations with parallel parent improvements:

We run the DGM for 80 iterations (generating one new agent per iteration), with two iterations running in parallel for SWE-bench and four for Polyglot.

The solutions archive

Interestingly this is not a classical MAP-Elites with behavior dimensions like AlphaEvolve. This is just a tree of solutions that worked despite their performance.

darwin-godel-machine