Mathematical Exploration and Discovery at Scale (with AlphaEvolve)

Created February 25, 2026 · Updated March 5, 2026

AlphaEvolve isn't suited to all types of mathematical problems but unprecedented scale is its greatest strength.

We have found that AlphaEvolve excels at discovering constructions that were already within reach of current mathematics, but had not yet been discovered due to the amount of time and effort required to find the right combination of standard ideas that works well for a particular problem. On the other hand, for problems where genuinely new, deep insights are required to make progress, AlphaEvolve is likely not the right tool to use.

Their power comes from meta-level evolution, which means they not just optimize a program, they can optimize over the space of techniques.

This meta-level evolution represents a new form of recursion where the optimization process itself becomes the object of optimization. For example, AlphaEvolve might evolve a program that uses a set of heuristics, a SAT solver, a second order method without convergence guarantee, or combinations of them. This hierarchical approach is particularly evident in AlphaEvolve’s treatment of complex mathematical problems (suggested by the user), where the system often discovers specialized search heuristics for different phases of the optimization process. Early-stage heuristics excel at making large improvements from random or simple initial states, while later-stage heuristics focus on fine-tuning near-optimal configurations. This emergent specialization mirrors the intuitive approaches employed by human mathematicians.

Search in program space is not only extremely powerful, but itself is a great prior.

Searching in program space might act as a powerful prior for simplicity and structure, helping us navigate away from messy local maxima towards elegant, often optimal, solutions.

Search Mode

A very powerful model of operation for AlphaEvolve is "search mode" where the program evolved itself is a search program.

Still, for problems where the scoring function is cheap to compute, the sheer brute-force advantage of traditional methods can be hard to overcome. Our proposed solution to this problem is as follows. Instead of evolving programs that directly generate a construction, AlphaEvolve evolves programs that search for a construction.

The performance metric then becomes not just the best scoring solution found by this search, but its ability to improve the best scoring solution! This creates a dynamic and adaptive search process where exploration and exploitation happens naturally.

Each program in AlphaEvolve’s population is a search heuristic. It is given a fixed time budget (say, 100 seconds) and tasked with finding the best possible construction within that time. The score of the heuristic is the score of the best object it finds. This resolves the speed disparity: a single, slow LLM call to generate a new search heuristic can trigger a massive cheap computation, where that heuristic explores millions of candidate constructions on its own.
We emphasize that the search does not have to start from scratch each time. Instead, a new heuristic is evaluated on its ability to improve the best construction found so far.

This is an extremely efficient way of utilizing LLM for search.

The evaluator gives one of these evolved heuristics a fixed time budget and scores it based on the quality of the best construction it can find in that time. This method turns the expensive, creative power of the LLM towards designing efficient search strategies, which can then be executed cheaply and at scale. This allows AlphaEvolve to effectively navigate vast and complex mathematical landscapes, discovering the novel constructions we detail in this paper.

Generalizer Mode

Instead of solving for a single n, have the process find solutions for any n. Another finding that shows Quality Diversity is very important for maintaining stepping stones to globally optimal solutions.

we have experimented with a more ambitious generalizer mode. Here, we tasked AlphaEvolve with writing a program that can solve the problem for any given 𝑛. We evaluate the program based on its performance across a range of 𝑛 values. The hope is that by seeing its own (often optimal) solutions for small 𝑛, AlphaEvolve can spot a pattern and generalize it into a construction that works for all 𝑛.

This created solution that led to TWO Terence Tao paper!

This mode is more challenging, but it has produced some of our most exciting results. In one case, AlphaEvolve’s proposed construction for the Nikodym problem (see Problem 6.1) inspired a new paper by the third author [281]. On the other hand, when using the search mode, the evolved programs can not easily be interpreted. Still, the final constructions themselves can be analyzed, and in the case of the artihmetic Kakeya problem (Problem 6.30) they inspired another paper by the third author.

The Trade-off Between Speed of Discovery and Evaluation Cost

Number of parallel threads significantly accelerated time-to-discovery.

Increasing the number of parallel threads significantly accelerated the time-to-discovery. Runs with 20 threads consistently surpassed the state-of-the-art bound much faster than those with 2 threads. However, this speed comes at a higher total cost. Since each thread operates semi-independently and makes its own calls to the LLM to generate new heuristics, doubling the threads roughly doubles the rate of LLM queries. The optimal strategy depends on the researcher’s priority: for rapid exploration, high parallelism is effective; for minimizing direct costs, fewer threads over a longer period is the more economical choice.

mathematical-exploration-and-discovery-at-scale

The Role of Model Choice: Large vs. Cheap LLMs

Best LLM is NOT always the best strategy. But depends on the problem.

We observed that the more capable LLM tends to produce higher-quality suggestions (see Figure 2), often leading to better scores with fewer evolutionary steps. However, the most effective strategy was not always to use the most powerful model exclusively. For this simple autocorrelation problem, the most cost-effective strategy to beat the literature bound was to use the cheapest model across many runs. The total LLM cost for this was remarkably low: a few USD. However, for the more difficult problem of Nikodym sets (see Problem 6.1), the cheap model was not able to get the most elaborate constructions.

In fact it's sometimes worse!

We also observed that an experiment using only high-end models can sometimes perform worse than a run that occasionally used cheaper models as well.

Again, diversity is key.

One explanation for this is that different models might suggest very different approaches, and even though a worse model generally suggests lower quality ideas, it does add variance. This suggests a potential benefit to injecting a degree of randomness or “naive creativity” into the evolutionary process

mathematical-exploration-and-discovery-at-scale 1

Verifier Insights

Verifier-hacking remains a possibility.

We have found that the selection of the verifier is a critical component that significantly influences the system’s performance and the quality of the discovered results. For example, sometimes the optimizer will be drawn more towards more stable (trivial) solutions which we want to avoid. Designing a clever verifier that avoids this behavior is key to discover new results.

During our experiments, we also observed a “cheating phenomenon”, where the system would find loopholes or exploit artifacts

Continuous fitness scores are better than discrete.

Similarly, employing continuous (as opposed to discrete) loss functions proved to be a more effective strategy for guiding the evolutionary search process in some cases. By looking at a continuous scoring function depending on the distances led to a more successful and faster optimization process.

Prompting Insights

Prompt and experience of the prompter matter a lot unfortunately. Not totally autonomous.

Moreover, in the hands of a user who is a subject expert in the particular problem that is being attempted, AlphaEvolve has always performed much better than in the hands of another user who is not a subject expert: we have found that the advice one gives to AlphaEvolve in the prompt has a significant impact on the quality of the final construction. Giving AlphaEvolve an insightful piece of expert advice in the prompt almost always led to significantly better results: indeed, AlphaEvolve will always simply try to squeeze the most out of the advice it was given, while retaining the gist of the original advice. We stress that we think that, in general, it was the combination of human expertise and the computational capabilities of AlphaEvolve that led to the best results overall.

More data in the prompt is NOT always good.

Having access to a large amount of data does not necessarily imply better generalization performance. Instead, when we were looking for interpretable programs that generalize across a wide range of the parameters, we constrained AlphaEvolve to have access to less data by showing it the previous best solutions only for small values of n.

This “less is more” approach appears to encourage the emergence of more fundamental ideas.

Related and correlated problems within a single experiment help.

Results are also significantly improved when the system is trained on correlated problems or a family of related problem instances within a single experiment.