Automated Design of Agentic Systems (ADAS)

Inspiration: ML history shows everything is learned eventually.

The history of machine learning reveals a recurring theme: manually created artifacts be- come replaced by learned, more efficient solutions (Clune, 2019) over time as we get more compute and data

For example, the current best- performing CNN models come from Neural Architecture Search (Elsken et al., 2019; Shen et al., 2023) instead of manual design; in LLM alignment, learned loss functions (Lu et al., 2024a) out- perform most hand-designed ones such as DPO (Rafailov et al., 2024); The AI Scientist (Lu et al., 2024b) demonstrates an automated research pipeline, including the development of novel ML algo- rithms; and an endless number of robotics learning environments can be automatically generated in works like OMNI-EPIC (Faldor et al., 2024), which demonstrate surprising creativity in generated environments and allow more efficient environment creation than the manual approach

Define entire agentic systems in code try to improve them with a diversity archive. Proven recipe. Very similar to Darwin Godel Machine (DGM).

We can define the entire agentic system in code and new agents can be automatically discovered by a “meta” agent programming ever better ones in code.

The core concept of Meta Agent Search is to instruct a meta agent to iteratively create interestingly new agents, evaluate them, add them to an archive that stores discovered agents, and use this archive to help the meta agent in subsequent iterations create yet more interestingly new agents

Discovered agents substantially outperform SOTA hand-designed baselines. They tackle Chollet's ARC and math benchmarks.

For instance, our agents improve F1 scores on reading comprehension tasks in DROP (Dua et al., 2019) by 13.6/100 and accuracy rates on math tasks in MGSM (Shi et al., 2023) by 14.4%. Additionally, they improve accuracy over baselines by 25.9% and 13.2% on GSM8K (Cobbe et al., 2021) and GSM-Hard (Gao et al., 2023) math tasks, respectively, after trans- ferring across domains

Baselines are:

(1) Chain-of-Thought (COT, Wei et al. (2022)), which instructs the agent to output the reasoning before answering to improve complex problem-solving through intermediate steps; (2) Self-Consistency with Chain-of- Thought (COT-SC, Wang et al. (2023b)), which ensembles multiple parallel answers from COT to produce a more accurate answer; (3) Self-Refine (Madaan et al., 2024; Shinn et al., 2023), which allows iterative self-reflection to correct mistakes made in previous attempts; (4) LLM-Debate (Du et al., 2023), which enables different LLMs to debate with each other, leveraging diverse perspec- tives to find better answers; (5) Quality-Diversity, a simplified version of Intelligent Go-Explore

Few more good baselines for reasoning tasks:

(1) Step-back Abstraction (Zheng et al., 2023), which instructs agents to first consider the principles involved in solving the task for better reasoning; (2) Role Assignment (Xu et al., 2023), which assigns different roles to FMs to obtain better answers. Furthermore, we compare our approach with the state-of-the- art prompt optimization baseline OPRO (Yang et al., 2024) to highlight the advantages of learning all possible components of agents rather than focusing solely on prompts.

Results are impressive:

We want to highlight the substantial gap between the learned agents and hand-designed agents in the Reading Comprehension and Math domains, with improvements in F1 scores by 13.6/100 and accuracy rates by 14.4%, respectively. While Meta Agent Search also outperforms baselines in the Multi-task and Science domains, the gap is smaller. We hypothesize that for challenging questions in the Science and Multi-task domains, the knowledge in FMs is not sufficient to solve the questions, limiting the improvement through optimizing agentic systems, which is a problem that will diminish as FMs improve.

Even outperforms prompt optimization:

Additionally, when compared to prompt optimization methods, the results demon- strate that our proposed Meta Agent Search consistently outperforms them across all domains.

Could even do higher-order ADAS

Since the meta agent used in ADAS to program new agents in code is also an agent, ADAS can become self-referential where the meta agent can be improved through ADAS as well. It would be an exciting direction to have a higher order of meta-learning to allow the learning of the meta agent and even the meta-meta agent, etc