AlphaEvolve
AlphaEvolve = LLM's coding and responding to feedback ability + verifiable solutions + evolutionary search.
LM code superoptimization agent called AlphaEvolve, that takes on this challenge using a combination of evolutionary computation and LLM-based code generation... discovery problems in which the candidates of discovery can be automatically evaluated.
It is applicable both to problems where discovering new algorithms is the intrinsic goal, as well as to the broad range of problems where the solution of interest is not an algorithm itself but an algorithm can describe how that solution is to be constructed or found. In the latter case, discovering the algorithm is only an instrumental goal, but it turns out to be a surprisingly effective strategy compared to searching for the solution directly
Main limitation is automatic evaluation with code.
While the use of an automated evaluation metric offers AlphaEvolve a key advantage, it is also a limitation—in particular, it puts tasks that require manual experimentation out of our scope.
BUT, it can also use LLM-judge style evaluation:
While AlphaEvolve does allow for LLM-provided evaluation of ideas, this is not a setting we have optimized for. However, concurrent work shows this is possible [ 34], and a natural step would be to link the two settings, with LLMs providing feedback on high-level ideas before transitioning to an implementation stage, for which machine feedback is available through code execution.
Verifiable problems it successfully tackles:
for 4 × 4 matrices, AlphaEvolve improves Strassen (1969)’s algorithm by discovering an algorithm using 48 multiplications to multiply 4 × 4 complex-valued matrices.
On ∼20% of the problems, AlphaEvolve surpasses the SOTA and discovers new, provably better constructions. This includes an improvement on the Minimum Overlap Problem set by Erdős [25] and an improved construction on the Kissing Numbers problem in 11 dimensions
we use AlphaEvolve in four engineering problems spanning different layers of Google’s compute stack: discovering scheduling heuristics for Google’s cluster management system, optimizing matrix-multiplication kernels used to train LLMs, optimizing arithmetic circuits used within TPUs, and optimizing the runtime of attention in Transformers.
At a high level, the orchestrating procedure is an evolutionary algorithm that gradually develops programs that improve the score on the automated evaluation metrics associated with the task.
Architecture
The "API" is to simply annotate codebase on which parts to evolve.
To support evolving multiple components across a codebase, AlphaEvolve exposes an input API where blocks of code can be annotated as to-be-evolved-by-the-system ... simply by adding special markers (# EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END) as comments into the code.
Can be used in meta ways in different levels of abstraction:
For example, AlphaEvolve can evolve the solution in raw string representation (as in classical evolutionary algorithms); evolve a function of a definite form that specifies how to construct the solution from scratch (the approach taken in [ 83]); evolve a bespoke search algorithm to find the solution within some fixed compute budget; or even co-evolve intermediate solutions and search algorithms together, such that each search algorithm is specifically tailored to further improve upon a particular intermediate solution.
Prompt Sampling for LLM calls
Prompt sampling is basically a pipeline to construct prompts using rich context from different sources per step to get the best solution of out LLMs:
This prompt comprises multiple previously discovered solutions sampled from the program database, as well as system instructions on how to propose changes to a particular solution.
Can get extremely creative with it:
Explicit context: details about the problem being solved, such as fixed human-written instructions, equations, code snippets, or relevant literature (e.g., pdf files). • Stochastic formatting: template placeholders with human-provided alternatives for increased diversity, instantiated using probability distributions provided in a separate config file. • Rendered evaluation results: usually this will include a program, the result of executing that program, and the scores assigned by the evaluate function. • Meta prompt evolution: instructions and context suggested by the LLM itself in an additional prompt-generation step, co-evolved in a separate database analogous to the solution programs.
Output Format
Mostly uses a specific diff format, BUT can also be ask for complete rewrites apparetnly.
In cases where the code being evolved is very short, or when a complete rewrite is more appropriate than a small modification, AlphaEvolve can be configured to instruct the LLM to output the entire code block directly, rather than using the diff format.
Evaluation
To track AlphaEvolve’s progress and to select which ideas to propagate in future generations, each new solution proposed by the LLMs is automatically evaluated.
Evaluation function can return multiple metrics too and is quite powerful.
a Python function, called evaluate, with a fixed input/output signature, returning a dictionary of scalars.
In more complicated cases, the function ℎ might involve performing an evolved search algorithm, or training and evaluating a machine learning model.
Supports optional mechanisms to make this evaluation more flexible and efficient:
Evaluation cascade (hypothesis testing): the user can specify ensembles of test cases of increasing difficulty, such that new solutions are evaluated on the next stage only if they achieve sufficiently promising results in all earlier stages. This helps to prune out less promising solutions more quickly. Moreover, new solutions are initially evaluated on a small scale before being subjected to the main test cases, to filter out faulty programs early.
LLM-generated feedback: in some applications, desirable solutions have certain characteristics that are difficult to capture precisely in the user-provided evaluation function ℎ; for example, simplicity of the discovered program. These properties can be graded using separate LLM calls and added to the dictionary of scores to steer evolution, or they can be used to discard solutions when a criterion is not fulfilled.
Handling multiple scores:
While in multiple applications we genuinely care about developing solutions for multiple evaluation metrics (or one solution that is strong on all of them simultaneously), we find that even if one metric is of particular interest, optimizing for multiple metrics often improves results for the single target metric.
programs excelling under different evaluation criteria often possess distinct structures or logic and, by incorporating examples of these diverse, high-performing programs—each representing a different definition of “good”—into the prompts provided to the language model, we can stimulate the generation of more varied candidate solutions, increasing the chances of discovering novel approaches that are highly effective for the target metric.
Evolution
AlphaEvolve uses LLMs to automate the construction of these operators—it leverages the LLM’s world knowledge to mutate programs without the need to pre-define a set of allowed mutation operations.
Maintain a database of previous solutions with respective evaluation scores using MAP-Elites.
During its evolutionary procedure, AlphaEvolve continually generates a growing number of solutions with evaluation results (scores and program outputs) attached to them. These solutions are stored in an evolutionary database
The primary goal of this is to optimally resurface previously explored ideas in future generations.
So how is it actually implemented?
In AlphaEvolve, the evolutionary database implements an algorithm that is inspired by a combination of the MAP elites algorithm [74] and island-based population models [ 83 , 97].
Ablations
Components considered in ablation study:
- Evolutionary Approach: Alternative is:
-
repeatedly feeds the same initial program to the language model. We refer to this approach as “No evolution”
-
- Context in prompts:
-
To test the importance of context, we consider an alternative approach where no explicit context is added to the prompt. We refer to this approach as “No context in the prompt”
-
- Meta prompts:
-
AlphaEvolve also uses meta prompts in order to improve the prompts that are provided to the language model. This allows it to potentially surpass the performance one can obtain using a human prompter.
-
- Full-file evolution:
-
Unlike previous approaches such as FunSearch, AlphaEvolve can evolve an entire codebase instead of focusing on a single function. To test the importance of full-file evolution, we consider an alternative in the context of tensor decomposition where only the loss function is evolved.
-
- Powerful language models:
-
AlphaEvolve relies on a mixture of small and large lan- guage models in order to obtain highly diverse samples. To understand the importance of this component, we consider an alternative where only a single small base model is used.
-
They show all of these contributes significantly.
Curiously though, they don't ablate on the "diff generation rather than rewrite" part which I think is doing heavy lifting as it preserves the entire history of how a solution has been arrived at, a rich source of information for what works and what doesn't.
Related Work
Surina et al. [96] complement the evolution process by continuously finetuning the LLM through reinforcement learning.
Grayeli et al. [35] enhance the evolution process with an LLM-directed concept learning step that summarizes high-performing programs in the pool into natural language.
While AI Co-Scientist represents scientific hypotheses and their evaluation criteria in natural language, AlphaEvolve focuses on evolving code, and directs evolution using programmatic evaluation functions. This choice enables us to substantially sidestep LLM hallucinations, which allows AlphaEvolve to carry on the evolution process for a large number of time steps.
it is possible in principle to combine the two approaches, leading to a method that allows a flexible combination of natural-language and programmatic idioms.
AlphaEvolve can also be described as a method "code superoptimization", which goes back to 1980s. It is also part of the wider "AIForScience" stream.
There are numerous recent LLM-based methods that target scientific problems in multiple disciplines, such as materials science [ 45, 71, 94, 119], chemistry [12, 64], bioinformatics [ 67, 85], geoscience [79], and quantum physics [30, 78] (for surveys on the topic, see [36, 65, 81]).