Towards an AI co-scientist

Created February 6, 2026 · Updated March 18, 2026

Mirror scientific process with LLM agents:

The system’s design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute.

Key contributions:

multi-agent architecture with an asynchronous task execution framework for flexible compute scaling;
a tournament evolution process for self-improving hypotheses generation.

Collection of several ideas:

Underpinning the system are thinking and reasoning steps—notably a self-play based scientific debate step for generating novel research hypotheses; tournaments that compare and rank hypotheses via the process of finding win and loss patterns, and a hypothesis evolution process to improve their quality. Finally, the agentic nature of the system enables it to recursively self-critique its output and use tools such as web-search to provide itself with feedback to iteratively refine its hypotheses and research proposals.

They define the criteria for hypotheses and research proposal generation as below:

Alignment with the provided research goal. The generated outputs must precisely align with the research goals, preferences and constraints defined by the scientist
Plausibility. The system outputs should be free of readily apparent flaws. Any potential contradictions with prior literature or established knowledge must be explicitly stated and justified.
Novelty. A key objective of the co-scientist system is to generate novel hypotheses, conjectures, and research plans grounded in prior literature, rather than simply synthesizing existing information (a capability already addressed by existing “deep research” tools
Testability. The system outputs should be amenable to empirical validation within the constraints specified by the scientist.
Safety. The system outputs will be controlled to prevent enabling unsafe, unethical, or harmful research.

Architecture and Specialized Agents

It seems like they opt for a "supervisor agent" architecture:

Based on the research plan configuration, the Supervisor agent initiates the creation of a task queue and begins orchestrating the specialized agents. The system operates continuously and asynchronously. Periodically, the Supervisor agent calculates a comprehensive set of summary statistics, reflecting the system’s state and progress toward the specified research goal. These statistics inform decisions regarding resource allocation and the determination of whether a terminal state for the overall computation has been reached. The state is periodically written to the associated context memory of the system and leveraged as feedback in subsequent rounds of computation. It also enables easy restarts in-case of any failure in the system components.

Then there are 6 different specialized agents (system prompts) that have tool access.

Generation agent curates an initial list of research hypotheses satisfying a research goal. These are then reviewed by the Reflection agent and evaluated in a tournament by the Ranking agent. The Evolution, Proximity, and Meta-review agents operate on the tournament state to help improve the quality of the system outputs.

The Supervisor agent periodically computes and writes to the context memory, a comprehensive suite of statistics, including the number of hypotheses generated and requiring review, and the progress of the tournament. These statistics also include analyses of the effectiveness of different hypothesis generation methodologies (e.g., generating new ideas via the Generation agent vs. improving existing ideas via the Evolution agent). Based on these statistics, the Supervisor agent then orchestrates subsequent system operations, i.e., generating new hypotheses, reviews, tournaments, and improvements to existing hypotheses, by strategically weighting and sampling the specialized agents for execution via the worker processes.

Importantly, the Meta-review agent enables feedback propagation and learning without back-propagation techniques (e.g., fine-tuning or reinforcement learning) [65]. The Meta-review agent generates feedback applicable to all agents, which is simply appended to their prompts in the next iteration—a capability facilitated by the long-context search and reasoning capabilities of the underlying Gemini 2.0 models. Through this feedback loop, the co-scientist continuously learns and improves in subsequent iterations with more compute scaling.

Tournament states and reasoning patterns are used quite heavily!

The Reflection agent adapts its full reviews based on the co-scientist’s growing knowledge. By analyzing reviewed hypotheses and results of the tournament conducted by the Ranking agent, the Reflection agent identifies recurring issues and improvement opportunities, refining its reviews accordingly.

The proximity agent is a curious addition to cluster the hypothesis and inform pairings in tournaments.

The Proximity agent calculates the similarity between research hypotheses and proposals, and builds a proximity graph, taking into account the specific research goal.

The evolution agent is perhaps doing the most heavy lifting by creating new hypotheses from current ones. Lots of different "scientific method" techniques.

And they claim the Meta-review agent is critical! It has three main duties:

Provide feedback to Reflection agent

The Meta-review agent plays a crucial role in the co-scientist’s feedback loop, enabling self-improvement in scientific reasoning. This agent operates on the tournament state and summarizes common patterns identified in reviews and scientific debates in the tournament matches into a meta-review critique. By synthesizing insights from all reviews, the meta-review provides valuable feedback to the Reflection agent, leading to more thorough and reliable future reviews. This helps prevent oversight of critical details.

Generate periodic research overview which they say is pretty important:

The Meta-review agent periodically synthesizes top-ranked hypotheses into a research overview, providing a roadmap for future research. This overview outlines potential research areas and directions relevant to the research goal, justifying their importance and suggesting specific experiments within each. Each area includes illustrative example topics. The research overview also serves as an additional input to the Generation agent in subsequent iterations.

Identifies other potential collaborator scientists.

Validation of Elo ratings

I find this quite interesting. It's hard to show quantitively Elo ratings are very proxy.

To assess this, we analyzed the concordance between the Elo rating and the system’s accuracy on the GPQA benchmark dataset. Ideally, higher Elo ratings should correlate with a higher probability of correct answers.

They use Concordance. According to google its a vague concept. But there are tools like Concordant and Discordant Pairs, Concordance Index, Concordance correlation coefficient (CCC) etc.

concordance refers to the degree of agreement or consistency between two or more sets of measurements, rankings, or predictions.

Their analysis still uses some sort of accuracy measure instead of pure correlation, which makes sense for categorical/non-numeric ground truth.

For each question, we first compared each co-scientist response against the ground truth answer to evaluate its correctness. Then, we categorized all generated responses across all considered questions based on their Elo rating into discrete buckets: Elo rating of 1001-1050, 1051-1100, 1101-1150, etc. in 50 point increments, until the highest rating achieved. Finally, we calculated the average accuracy for each Elo rating bucket, as the percentage of correct responses within each bucket.