Auto Research: Multi-Agent Autonomous Scientific Discovery

Jingxuan Kang · March 2026

Auto Research — Multi-Agent Autonomous Scientific Discovery

1 / 22

The Starting Point

Two heavyweight voices form the premise:

"...there is a very wide spread in capability (several orders of magnitude) depending on what resources and assistance one gives the tool." — Terence Tao, Mathstodon (July 2025)

"[LLMs] are extremely useful and I don't think the industry has realized anywhere near 10% of their potential even at present capability." — Andrej Karpathy, 2025 LLM Year in Review

The models are capable enough. The bottleneck is how we use them.

Five Core Problems in Auto Research

💡

1. Where Do Ideas Come From?

LLMs generate plausible ideas, not truly novel ones. They recombine training data — sounds new but isn't. Literature coverage has blind spots. No reliable novelty verification exists.

🧪

2. How to Discover & Validate Ideas?

Pilot experiments are expensive. Ablation design is often skipped. No ground truth for "novelty" — unlike code bugs, you can't automatically test whether an idea is truly new.

⚙️

3. Process Stability

CUDA version conflicts, dependency hell, SSH tunnel failures. Dataset downloads fail, corrupt, or get blocked. OOM crashes, silent NaN losses, checkpoint corruption. The most mundane but most frequent failure mode.

📊

4. How to Evaluate Results?

Models can't judge their own output — self-evaluation converges to "looks good" after 3–4 rounds. Lower loss ≠ better paper. Baselines often have subtle bugs. No clear stopping criteria.

🔄

5. How to Keep Running?

Session lifecycle management, state persistence across failures, resource scheduling (GPU allocation, API rate limits, cost budgets), and decision loops without human involvement.

Human in the Loop

Humans aren't an optional final check — they're a core component of the system.

🧭

Direction

Which problems matter?

🎯

Judgment

Surprising or mundane?

⚖️

Kill / Go

When to cut losses

✅

Verification

Every claim, every number

reproduce.md: The Future of Reproducibility

📄

Open-Source the Prompt, Not Just the Code

If AI wrote the code, AI should verify the results. Every paper ships a reproduce.md — a prompt containing everything an AI needs to reproduce the experiment: environment, data, training, expected metrics, verification criteria. We don't need open-source code anymore — we need open-source prompts.

The Ultimate Vision

Every GPU contributes. Every cycle counts. Science never sleeps.

The Auto Research wave is unstoppable. The final state: some GPUs drive LLM inference (agents that think, plan, write, review), while others explore every sub-direction (continuous training, testing, improving across every domain).

🎓

Talk Complete

This talk was presented in March 2026. The slides above contain the full visual presentation with architecture diagrams, model comparisons, and the complete ARIS framework.

Auto Research: Multi-Agent Autonomous Scientific Discovery ​

The Starting Point ​

Five Core Problems in Auto Research ​