The AI Scientist: How a Breakthrough System is Automating Discovery and Writing Expert-Level Scientific Code

Contents Show

The Software That Powers Discovery

Imagine a world where the pace of scientific discovery is not limited by human effort, but accelerated by artificial intelligence. A world where the painstaking process of writing, testing, and refining the complex software that underpins modern research is handled not by teams of PhDs over months or years, but by an AI system in a matter of hours or days. This is not a distant science fiction fantasy; it is the reality presented in a groundbreaking new preprint paper from a large team of researchers at Google and other institutions.

The paper, titled "An AI system to help scientists write expert-level empirical software," details the development of a novel AI agent that leverages the creative power of Large Language Models (LLMs) and the rigorous, strategic exploration of Tree Search (TS) to automatically generate high-performing code for scientific problems.

This system isn't just a code autocomplete tool; it's a "scientific discovery engine" capable of innovating, combining research ideas from disparate fields, and producing software that consistently meets and exceeds expert-level performance across a stunning range of domains—from predicting COVID-19 hospitalizations to analyzing the neural activity of an entire zebrafish brain.

This article will provide a comprehensive, deep dive into this fascinating work. We'll unpack the problem it aims to solve, demystify the complex technology that makes it work, explore its extraordinary results across six scientific fields, and ponder the profound implications this system has for the future of scientific progress itself.

The Ubiquity and Burden of "Empirical Software"

To understand the significance of this AI system, we must first appreciate the role of software in modern science. The authors introduce a crucial term: "empirical software." This is not just any software; it is specifically designed to maximize a definable, measurable quality score. This "score" is typically a metric of how well the software's output fits existing observations or predicts new ones.

Examples of empirical software are everywhere:

A deforestation detector that analyzes satellite imagery to assess land cover change over time.
A protein structure prediction tool (like AlphaFold, a Nobel Prize-winning achievement rooted in empirical software).
A climate model that parameterizes the Earth's atmosphere to forecast weather.
A single-cell RNA sequencing pipeline that removes technical "batch effects" to allow comparisons of data from different labs.

The authors posit two key hypotheses about this software:

Scorable tasks are ubiquitous in science. Nearly every sub-field of science, applied mathematics, and engineering now relies on software designed to maximize a quality metric.
Empirical software for science is slow and difficult to create. Developing this software is a tedious, years-long process. When testing complex hypotheses, it becomes nearly impossible to write perfect code from first principles. Choices are often governed by intuition or expediency rather than exhaustive experimentation. This creation bottleneck severely limits the number of ideas and possibilities a scientist can productively explore.

The cycle of scientific discovery is thus frequently stuck, waiting for the manual creation of the computational tools needed to test the next great idea. This AI system aims to break that bottleneck.

The Core Innovation: LLM + Tree Search

So, how does the system work? It's a sophisticated fusion of two powerful AI paradigms: the generative creativity of LLMs and the strategic, goal-oriented search of algorithms.

The Large Language Model (The "Ideator"): The system uses a powerful LLM (like Gemini) as its code-writing engine. You give it a prompt that includes:
- A description of the scorable task (e.g., "Predict COVID-19 hospitalizations for the next 4 weeks").
- The evaluation metric (e.g., "Weighted Interval Score (WIS)").
- Relevant data information.
- (Crucially) Optional text containing "research ideas"—summaries of methods from highly cited papers, textbooks, or search results.
The LLM's job is to generate Python code that attempts to solve this task.
The Code Sandbox (The "Reality Check"): The generated code is then executed in a secure, sandboxed environment on real or representative data. Its performance is rigorously scored based on the predefined metric.
Tree Search (The "Strategic Explorer"): This is where the magic happens. Instead of just generating one piece of code, the system generates many, creating a "tree" of candidate solutions. Each node on the tree is a different version of the code. A specialized Tree Search algorithm (a variant of the PUCT algorithm used in AlphaZero) is employed to decide which nodes to "explore" further. It makes this decision by balancing:
- Exploitation: Focusing on nodes (code versions) that already have high scores.
- Exploration: Sampling from nodes that haven't been explored much but might have high potential.
The algorithm selects a node and asks the LLM to rewrite that existing code, attempting to improve it. This creates a new child node. The score of this new node is evaluated, and the process repeats. Over hundreds or thousands of iterations, the system intelligently hill-climps towards better and better solutions.

This iterative refinement loop is key. Unlike tools like ChatGPT or Codex that perform "one-shot" generation, this system uses the score as a feedback signal to guide the LLM's rewriting process, creating a powerful evolutionary feedback loop where only the best code "mutations" survive.

The Secret Sauce: Injecting Research Ideas

A critical component that elevates the system from a good code optimizer to a true scientific innovator is its ability to ingest and utilize external knowledge. The system can be prompted with research ideas from a multitude of sources:

Human Guidance: A scientist can directly provide a summary of a method from a paper.
Automated Literature Review: The system can use tools like Gemini Deep Research or an "AI Co-Scientist" to automatically scour scientific literature, understand new concepts, and generate novel ideas for solving the problem.
Idea Recombination: In one of its most powerful modes, the system can take two existing successful methods it has generated (e.g., Method A and Method B), use an LLM to analyze their core principles and differences, and then explicitly prompt itself to create a hybrid strategy that combines the best of both approaches.

This ability to read, understand, and recombine scientific concepts allows the AI to make leaps that would be time-consuming and non-obvious even for human experts.

Benchmarking on Kaggle – Proving the Core Concept

Before tackling grand scientific challenges, the team first proved their system's mettle on a more constrained but highly competitive playground: Kaggle competitions.

Kaggle's "Playground" competitions are ideal for testing because they offer fast iteration, clear leaderboards ranked by percentile against thousands of human data scientists, and tasks that require complex code without the overhead of deep scientific domain knowledge.

The results were clear (see Fig. 1b in the paper):

A single call to the LLM performed decently but was not elite.
Taking the "best of 1000" random LLM calls was better, but still relied on brute force.
Tree Search (TS) substantially outperformed both, demonstrating that strategic exploration is far more efficient and effective than random generation.
TS with expert advice (e.g., "try model ensembling") and TS with specific algorithmic guidance (e.g., "implement a boosted decision tree from scratch") performed even better, showing the system's ability to effectively integrate high-level strategic advice.

This success on Kaggle provided the validation needed to apply the method to more impactful, real-world scientific problems.

Having proven its core functionality on Kaggle competitions, the research team unleashed their AI system on a carefully selected set of six "scorable tasks" from across the scientific landscape.

The selection criteria were twofold: the tasks had to be important to practicing scientists (including the co-authors themselves) and areas where progress had been historically slow. The results were not merely incremental; they were, in many cases, state-of-the-art.

Case Study 1: Genomics – Single-Cell RNA Sequencing Batch Integration

The Problem: Single-cell RNA sequencing (scRNA-seq) has revolutionized biology by allowing researchers to measure gene expression in individual cells. This enables the discovery of new cell types, the understanding of developmental trajectories, and much more. However, data from different labs or experiments contain technical variations known as "batch effects." These non-biological differences can obscure real biological signals. Removing them—a process called batch integration—is a monumental challenge. There are nearly 300 existing tools for this task, and public benchmarks like OpenProblems.bio constantly evaluate them.

The Experiment: The team used their AI system to tackle the OpenProblems v2.0.0 batch integration benchmark. To avoid overfitting, they trained the system on a separate dataset from CELLxGENE with similar characteristics and then evaluated the final, best solution on the official benchmark's holdout dataset containing over 1.7 million cells.

The Results: The outcomes were extraordinary:

Unguided Discovery: Even without any guidance ("No advice"), the system's Tree Search discovered a method conceptually similar to a established algorithm called ComBat, yet it improved upon the current benchmark leaderboard.
Improving Existing Methods: The team then selected nine top-performing methods from the literature. For each, they obtained the original paper, used an LLM to summarize it, and fed that summary into their system as a starting point. The AI then proceeded to optimize that method. In pairwise comparisons, the AI's optimized versions outperformed the original human-written implementations in 8 out of 9 cases.
A New State-of-the-Art: The top-performing method was the AI's implementation of BBKNN (TS) (Batch Balanced K-Nearest Neighbors). It achieved a 14% overall improvement over the best published method (ComBat) and matched or beat the original BBKNN in every dataset and across 11 out of 13 metrics. The key innovation? The AI had intelligently combined two ideas: it used ComBat first to remove global batch effects and then applied BBKNN on the corrected data, a recombination that proved significantly more powerful than either method alone.
Mass-Scale Innovation via Recombination: The most striking demonstration of the system's power came next. The team programmed the AI to generate 55 "recombinations" of all pairs of 11 base methods. Of these 55 novel hybrid methods generated by the AI, 24 (44%) outperformed both of their parent methods. Furthermore, when combined with ideas from Gemini Deep Research and an AI co-scientist, the system produced a total of 40 distinct methods that outperformed all previously published methods on the OpenProblems leaderboard.

This case study shows the system is not just an optimizer but a true innovator, capable of understanding the strengths of existing approaches and synthesizing them into superior new ones.

Case Study 2: Public Health – Forecasting COVID-19 Hospitalizations

The Problem: During the COVID-19 pandemic, accurate forecasting of hospitalizations was critical for public health policy and resource allocation. The U.S. Centers for Disease Control and Prevention (CDC) coordinated a large forecasting hub ("CovidHub") where dozens of expert-led teams from top institutions submitted weekly forecasts. The gold standard was the "CovidHub-ensemble," an aggregate forecast that integrated all these submissions.

The Experiment: The team designed a rigorous retrospective study. For every forecasting period, they ran the AI system to optimize a model using data from the preceding six weeks, creating a rolling validation window throughout the 2024-2025 season. The resulting model, named 'Google Retrospective,' was compared against all teams on the CovidHub leaderboard.

The Results: The AI system demonstrated formidable performance:

Beating the Gold Standard: The 'Google Retrospective' model achieved an average Weighted Interval Score (WIS) of 26, outperforming the official CovidHub-ensemble's average WIS of 29. Lower scores are better, indicating both greater accuracy and better-calibrated uncertainty.
Winning in Most States: A direct jurisdiction-level comparison showed the AI model achieved a lower (better) WIS in a majority of U.S. states and territories.
Replicating and Surpassing Experts: The system was tasked with replicating eight existing models from other teams using only their brief public descriptions. Its implementations not only adhered to the instructions but exceeded the performance of the original human submissions in six of the eight cases.
Creating Superior Hybrids: Just as in the genomics task, the AI was asked to recombine pairs of existing models. Of the 28 hybrid models generated, 11 achieved a WIS score superior to both of their parent models. The most successful hybrids often combined a simple, climatology-based model with a complex statistical or machine learning model, providing a robust foundation that could be enhanced by more complex patterns.
Novel Concept Generation: Using Gemini Deep Research, the system also invented entirely new strategies. For example, the DEEP-RESEARCH-CounterfactualSimulation model introduced unconditional uncertainty quantification by running thousands of Monte Carlo simulations over plausible future scenarios (e.g., new variant emergence). Another, CO-SCIENTIST-STGNN-AgACI, implemented a complex Spatio-Temporal Graph Neural Network to model inter-state dynamics explicitly.

In one of the most competitive forecasting environments in the world, the AI system developed 14 distinct strategies that beat the CDC's ensemble, demonstrating its ability to be both a meticulous optimizer and a bold innovator.

Case Study 3: Geospatial Analysis – Segmenting Satellite Imagery

The Problem: Semantic segmentation of high-resolution remote sensing images is a core computer vision task for monitoring land use, assessing environmental impact, and managing disasters. The difficulty lies in the extreme visual heterogeneity: images of the same location vary by time, season, and weather, and objects like "buildings" come in all shapes, sizes, and colors.

The Experiment: The team used the Dense Labeling Remote Sensing Dataset (DLRSD) and prompted their AI to train a model to classify every pixel into one of 17 land cover classes.

The Results: The top three solutions generated by the AI significantly outperformed recent academic papers on the DLRSD benchmark, achieving a mean Intersection over Union (mIoU) score greater than 0.80—a bar that had not been crossed before.

The solutions leveraged state-of-the-art architectures like U-Net, UNet++, and SegFormer, but their true strength came from the AI's optimization of the surrounding pipeline.
A key differentiator was the extensive use of Test-Time Augmentation (TTA), where the model predicts masks for multiple augmented versions of a test image (e.g., flipped, rotated) and then averages the results to produce a final, more robust prediction.
The AI also expertly paired these architectures with powerful pre-trained encoders and tuned extensive data augmentation strategies from libraries like Albumentations.

This shows the system's strength in not just selecting models but also expertly engineering the entire training and inference process for maximum performance.

Case Study 4: Neuroscience – Whole-Brain Neural Activity Prediction

The Problem: Understanding how the brain works requires predicting the activity of vast networks of neurons. The Zebrafish Activity Prediction Benchmark (ZAPBench) is a monumental challenge that involves modeling and predicting the activity of over 70,000 neurons across an entire vertebrate brain (a larval zebrafish) over time. State-of-the-art approaches included both time-series forecasting on extracted neural signals and computationally expensive video-based models that process 3D brain volumes directly to exploit spatial information.

The Experiment: The AI was prompted to solve the multivariate time-series forecasting problem: predict the future activity of all neurons for up to 32 time steps ahead, given only their past 4 time steps of activity.

The Results: The AI system's solution was remarkably effective and efficient:

Outperforming Video Models: The best model produced by Tree Search used a rich feature set combining temporal convolutions, a learned "global brain state" vector, and neuron-specific embeddings. Remarkably, this model outperformed all other baselines, including the best-performing video-based model, on all prediction horizons except 1-step-ahead.
Specialization and Speed: The team then ran a separate Tree Search specifically tuned for 1-step-ahead predictions, which also achieved leading performance. Both solutions were orders of magnitude faster to train than the video model—less than two hours on a single T4 GPU compared to 36 hours on 16 A100 GPUs.
Incorporating Biophysics: In an exploratory step, the system was prompted to use Jax, a differentiable biophysical neuron simulator. The resulting solution simulated each neuron with a Hodgkin-Huxley model and modulated its parameters based on recent activity, then processed the outputs through a latent autoencoder to model a "functional connectome." While it didn't outperform the top video model, it was competitive with time-series baselines, showing the system's ability to integrate domain-specific simulators into a coherent solution.

This demonstrates the system's prowess in creating highly efficient, specialized solutions that can leverage cross-neuron information—a major challenge in neuroscience.

Case Study 5: Time Series Forecasting – The GIFT-Eval Benchmark

The Problem: General time series forecasting is notoriously difficult due to the diverse semantics of data (from ECG readings to stock prices) and vastly different time scales (seconds to years). The GIFT-Eval benchmark comprises 28 datasets from seven domains and is a fiercely competitive arena with new submissions from foundation models and deep learning approaches every month.

The Experiment: The team applied their method in two distinct ways:

Per-Dataset Solution: The AI was allowed to create an independent, highly specialized solution for each of the 92 datasets, with access to a full suite of machine learning libraries.
Unified Solution: A more audacious challenge: could the AI create a single, general-purpose forecasting library from scratch that could perform well across all datasets, using only basic libraries like NumPy and Pandas?

The Results: The AI excelled at both tasks:

The Specialist (Per-Dataset): The discovered solutions converged towards powerful gradient boosting and ensemble models. The aggregate performance of these 92 individual solutions topped the official GIFT-Eval leaderboard, outperforming specialized foundation models like Chronos and TimesFM.
The Generalist (Unified): The unified solution was a masterpiece of algorithmic engineering. The AI developed a library that could adapt its strategy by selecting from 8 preset configurations. The core algorithm was an Iterative Decomposition Model that sequentially forecasts and subtracts individual components of a time series: a base level, a trend, seasonality, datetime-based features (including holidays for specific countries), and a final residual correction. This unified solution was highly competitive on the leaderboard, demonstrating that the AI could indeed reason from first principles to create a general, robust, and configurable forecasting tool.

This case highlights the system's flexibility, capable of producing both finely-tuned specialists and elegant, powerful generalists.

Case Study 6: Numerical Analysis – Solving "Unsolvable" Integrals

The Problem: Even well-established numerical algorithms can fail. The scipy.integrate.quad() function is a gold-standard library for numerical integration in Python, but it can fail to converge, miss important features, or lose precision in problems with delicate cancellations, particularly with oscillatory integrals on infinite domains.

The Experiment: The team built a benchmark of 38 particularly pathological integrals from a standard reference text (Gradshteyn and Ryzhik) where quad() returned an incorrect answer. They split these into training and hold-out sets and tasked the AI with improving upon a simple quad() invocation.

The Results: The evolved code was a clever and robust enhancement of the standard algorithm:

Novel Algorithm: The best solution partitioned the infinite domain into a sequence of finite subintervals, whose lengths increased geometrically to handle the "tail" efficiently. It calculated the integral over each segment using quad(), transforming the problem into an infinite series.
Intelligent Acceleration: For slowly converging series, the algorithm applied Euler's transformation, a powerful series acceleration technique, to extrapolate an accurate estimate of the integral's true value from a finite number of terms.
Performance: The resulting code correctly evaluated 17 out of 19 held-out integrals to within a fractional error of less than 3%. Meanwhile, scipy.integrate.quad() failed on every single one.
Practical Design: Crucially, the evolved code was designed as a drop-in replacement: it always tries quad() first and only falls back to its more specialized method if quad() fails or returns a large error estimate. This means it is as accurate as quad() on easy problems and superior on hard ones.

This application shows the system's ability to innovate even in mature, mathematically rigorous fields, improving upon decades-old algorithmic standards.

How It Fits Into the Larger AI Landscape

The authors thoughtfully contextualize their work within several related fields:

Genetic Programming (GP): Their system is a modern evolution of GP. The key difference is that instead of using random mutations or swapping syntax tree sub-parts, it uses an LLM to perform intelligent, semantic-aware "mutations" by rewriting code, leading to more complex and meaningful variations.
Generative Programming: This can be viewed as an AI-driven realization of this concept, using an LLM instead of templates or domain-specific languages, offering far greater flexibility.
LLMs for Code (e.g., Codex, AlphaCode): Unlike these "one-shot" generators, this system uses the LLM in an iterative refinement loop, guided by a search algorithm using a quality score as feedback.
Automated Machine Learning (AutoML): While AutoML focuses on optimizing within fixed ML frameworks, this system is far more general. It can rewrite any software, including pre-processing steps, complex simulations, or mathematical heuristics.
AI Agents for Science: Unlike agents that automate workflows in a single domain, this system demonstrates a general problem-solving capability, achieving expert-level performance across multiple, unrelated fields.

The most similar work is Google DeepMind's FunSearch, which also pairs an LLM with an evaluator for mathematical discovery. This system generalizes that concept using a more robust Tree Search algorithm and explicitly incorporates knowledge from the scientific literature.

Limitations and Ethical Considerations

The system is powerful, but not without limits:

The Need for a Score: It requires a well-defined, automatable quality metric. Not all scientific problems can be easily scored by a machine (e.g., some theoretical or interpretative tasks).
Computational Cost: Running thousands of iterations of Tree Search is computationally expensive, though likely still cheaper than months of a scientist's salary.
Sandbox Constraints: The code must be executable in a sandbox, limiting problems that require specialized hardware, extremely long runtimes, or human-in-the-loop evaluation.
Interpretability: The code it produces can be complex and non-intuitive. While it works, understanding why it works so well may require significant human effort and analysis.

The Dawn of a New Scientific Paradigm

This research represents more than an incremental advance; it points toward a paradigm shift in how scientific inquiry is conducted. The system is a prototype for a new kind of scientific instrument: an automated hypothesis tester and algorithmic inventor.

Its ability to tirelessly explore vast solution spaces, integrate knowledge from across the literature, and recombine ideas in novel ways fundamentally accelerates the cycle of trial and error that is central to scientific progress. It can reduce the exploration of a set of ideas from "weeks or months, to hours or days."

This is not about replacing scientists. It is about augmenting and empowering them. By offloading the arduous task of coding and initial optimization, it frees researchers to focus on what they do best: asking profound questions, designing elegant experiments, interpreting complex results, and providing the creative direction that guides tools like this one.

The authors conclude with a powerful vision: "Based on this work, we believe that progress in scientific fields where solutions can be scored by machines is on the precipice of a revolutionary acceleration." We are entering an era of AI-augmented science, where the synergy between human intuition and machine-scale exploration will unlock discoveries at a pace and scale we are only beginning to imagine.

Author Profile

Leticia Monroe

Leticia (a.k.a Letty) is a bibliophile who loves to read and write, she is also a Content Associate and Curator at Clue Media. She spends her spare time researching diverse topics and lives in New York with her dog.

The AI Scientist: How a Breakthrough System is Automating Discovery and Writing Expert-Level Scientific Code

The Software That Powers Discovery

The Ubiquity and Burden of "Empirical Software"

The Core Innovation: LLM + Tree Search

The Secret Sauce: Injecting Research Ideas

Benchmarking on Kaggle – Proving the Core Concept

Case Study 1: Genomics – Single-Cell RNA Sequencing Batch Integration

Case Study 2: Public Health – Forecasting COVID-19 Hospitalizations

Case Study 3: Geospatial Analysis – Segmenting Satellite Imagery

Case Study 4: Neuroscience – Whole-Brain Neural Activity Prediction

Case Study 5: Time Series Forecasting – The GIFT-Eval Benchmark

Case Study 6: Numerical Analysis – Solving "Unsolvable" Integrals

How It Fits Into the Larger AI Landscape

Limitations and Ethical Considerations

The Dawn of a New Scientific Paradigm

Leticia Monroe

Related

Leave a Reply Cancel reply

The Software That Powers Discovery

The Ubiquity and Burden of "Empirical Software"

The Core Innovation: LLM + Tree Search

The Secret Sauce: Injecting Research Ideas

Benchmarking on Kaggle – Proving the Core Concept

Case Study 1: Genomics – Single-Cell RNA Sequencing Batch Integration

Case Study 2: Public Health – Forecasting COVID-19 Hospitalizations

Case Study 3: Geospatial Analysis – Segmenting Satellite Imagery

Case Study 4: Neuroscience – Whole-Brain Neural Activity Prediction

Case Study 5: Time Series Forecasting – The GIFT-Eval Benchmark

Case Study 6: Numerical Analysis – Solving "Unsolvable" Integrals

How It Fits Into the Larger AI Landscape

Limitations and Ethical Considerations

The Dawn of a New Scientific Paradigm

Leticia Monroe

Share

Related

Leave a Reply Cancel reply