The Overnight Shift

An empty office at 3am, screens glowing with running code, chairs vacant, amber desk lamps casting warm pools while AI agents work through the night

On March 7, Andrej Karpathy went to sleep. When he woke up, an AI agent had made 700 changes to his training code, found 20 improvements he'd missed after weeks of manual tuning, and committed each one to a git branch. He open-sourced the setup — a 630-line script, a prompt file, a five-minute clock — and within ten days, independent builders had pointed the same loop at financial markets, chess engines, rendering pipelines, and argumentative reasoning. Nobody coordinated this. The structural conditions just happened to be right.

The Loop

Karpathy's setup is deliberately minimal. A single GPU. A 630-line training script. A prompt file that describes the research direction. The agent modifies the code, trains a small language model for exactly five minutes, checks the validation loss, commits the result to a git branch, and loops. Every dot in his visualization is a complete training run. The constraint — five minutes, no exceptions — is what makes it work. Without a fixed clock, the agent would spend hours on dead ends. With it, each experiment is cheap enough to be disposable.

Karpathy left the loop running for two days on a depth-12 model. Seven hundred autonomous changes. The agent didn't just try random mutations — it examined the sequence of prior results and planned subsequent experiments around what had worked. It found that his attention was too diffuse (missing a scaler multiplier), his regularization was incomplete, his banded attention was too conservative, and his optimizer betas were misconfigured. Twenty of those changes transferred to larger models and improved the leaderboard by 11%. All on top of tuning he'd already done manually over weeks.

The key line is the one Karpathy almost tosses off: "any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm." If you have a number and a fast evaluation function, you have a loop. If you don't, you don't. Everything that followed was an exploration of that boundary.

The Instantiation Cascade

What Lior Alexander called "the quiet genius" of autoresearch is the fixed five-minute clock. It's the constraint that makes the loop legible — every experiment costs the same, results are directly comparable, and the agent can never disappear into a multi-hour rabbit hole. This matters because it makes the pattern portable. You don't need Karpathy's specific setup. You need a metric, a clock, and a loop.

Chris Worsey was the first to demonstrate portability at scale. He took the autoresearch loop and pointed it at financial markets. Twenty-five agents debating macro, rates, commodities, sectors, and single stocks. Every recommendation scored against real outcomes. The worst agent by rolling Sharpe ratio gets its prompt rewritten by the system. Keep or revert. Same loop, same logic — prompts are the weights, Sharpe is the loss function.

prompt mutations tested

survived natural selection

return over 173 days

But the structural insight isn't the return. It's the substitution. In Karpathy's version, the agent edits Python code and the loss function evaluates the result. In Worsey's version, the agent edits English prompts and the market evaluates the result. The loop is identical. Only the substrate changed.

March 2026

A look at Andrej Karpathy's “autoresearch” experiment, where an AI agent runs in a recursive self-improvement loop to improve an AI model on one testable metric

Fortune

The domains kept multiplying. Tobi Lütke, Shopify's CEO, ran autoresearch on his company's Liquid rendering engine and got 53% faster parse+render time and 61% fewer object allocations — then open-sourced the plugin. Deedy Das pointed it at a vibe-coded Rust chess engine and pushed it from "expert" level to a top-50 grandmaster — engine #311 on the leaderboard — through 70 autonomous experiments. Kaspars Dancis used it to optimize a canvas rendering engine and saw a 10x improvement on the slowest test in hours.

Varun Mathur's team made the pattern fully generic. Their system lets anyone propose an optimization problem in plain English, and a distributed swarm spins up to solve it. Two hundred thirty-seven agents, 14,832 experiments across five domains, zero human intervention. The abstraction had fully separated from Karpathy's specific implementation. The pattern was replicating itself — not through coordination but through structural inevitability: if you have the pieces (cheap agents, fast evaluation, a score), the loop assembles itself.

The Structural Force

The cost drop that enables this is not one thing but three things converging. First, inference costs have fallen far enough that running hundreds of agent calls per hour is economically viable for individuals, not just companies. Second, evaluation infrastructure — loss functions, benchmarks, scoring APIs — has matured to the point where "check if this worked" can be automated. Third, git provides a free, universal ledger for experiment tracking. The agent commits every attempt. The human reviews a log, not a process.

Together, these three drops cross a threshold. Below that threshold, running experiments overnight requires a team: someone to design them, someone to monitor them, someone to interpret the results. Above the threshold, a single person writes a prompt and goes to sleep. The organizational unit for research shrinks from a lab to a laptop.

The overnight shift isn't about speed. It's about what compounds while the org chart sleeps.

Heinrich (@arscontexta) articulated this most vividly. He proposed that agents should "dream" — literally process their sessions during idle time, maintain and evolve their notes, synthesize and explore, even hallucinate. The biological metaphor is imprecise but the structural intuition is correct: downtime is wasted capacity, and wasted capacity is a cost that compounds against you when your competitor's agents don't sleep.

Where the Loop Breaks

Everything described above works because there is a number. Validation loss. Sharpe ratio. Render quality score. Frame rate. The loop compounds because the fitness function is unambiguous — lower is better, higher is better, and the agent doesn't need to understand why.

Most real work doesn't have a score.

Writing doesn't have a loss function. Strategy doesn't have a Sharpe ratio. Design doesn't have a validation metric that an agent could exploit without gaming it. The overnight shift is powerful precisely where work is reducible to a number, and silent where it isn't. Karpathy named this boundary himself — "reasonably efficient to evaluate" — and it's harder than the cascade of instantiations suggests.

Alibaba tested this boundary empirically. They ran 18 AI coding agents on 100 real codebases spanning 233 days each. The agents could pass tests — the equivalent of a validation score — on first attempt. But maintaining code over eight months, where the metric is "does everything still work after this change and the next fifty changes," proved catastrophic. Seventy-five percent of models broke previously working code during maintenance. The loss function for a single commit is tractable. The loss function for a codebase over time is not.

This is not a temporary limitation waiting for better models. It's a structural property of the domains. In ML training, the validation loss is a sufficient statistic — it captures everything you need to know about whether the change was good. In code maintenance, no single metric captures "this change is good for the system over the next six months." In strategy, no metric captures "this is the right direction." The loop needs a score. Some work resists scoring.

The First Attempt at Subjective Compounding

Which is why @SHL0MS's AutoReason project is the most interesting development in this cascade — not because it works, but because it identifies the right problem. AutoReason extends the autoresearch loop to subjective domains by constructing a synthetic fitness function through adversarial debate.

The mechanism: generate version A. A fresh agent attacks it as a strawman. A separate author produces version B incorporating the critique. A third agent synthesizes A and B. A blind judge panel picks the strongest. The winner becomes the new A. The loop repeats until judges consistently pick the incumbent — convergence through argumentation rather than optimization.

Instead of "lower loss is better," the fitness function becomes "survives adversarial scrutiny from independent evaluators." The number doesn't exist in the domain, so AutoReason manufactures a proxy by simulating the process humans use to evaluate subjective work: argument, counter-argument, synthesis, judgment.

But the structural question remains: does the proxy capture what matters? In autoresearch, validation loss correlates with model quality because that's what validation loss measures. In AutoReason, "survives blind judging" correlates with... what? Persuasiveness? Logical consistency? Rhetorical polish? The gap between "the judges picked it" and "it's actually good" is the gap between a real fitness function and a synthetic one. Every proxy metric in history has eventually been Goodharted. The question is whether adversarial debate is robust enough to resist it, or whether AutoReason loops will converge on outputs that are maximally judge-friendly rather than maximally good.

The Therefore

Seven hundred experiments while you sleep isn't an incremental improvement. It's a different production function. The overnight shift restructures every domain where the score is honest — where validation loss, Sharpe ratio, or render quality can stand in for "better" without being gamed. That covers more territory than most people realize.

The boundary is everything else. Subjective work doesn't lack evaluation — it lacks evaluation that stays honest under optimization pressure. AutoReason is the first serious attempt to manufacture that honesty through adversarial debate. Whether it works depends on whether "survives blind judging" can resist the same Goodhart dynamics that corrupt every other proxy metric. History suggests it can't. The structure of the attempt suggests it might.

The org chart hasn't priced any of this in yet. The companies running overnight loops aren't announcing it — they're shipping the results at 9am. The gap will widen quietly, the way structural advantages always do: invisible until the delta is too large to close.

For the full mechanics — the instantiation cascade across five domains, the Alibaba maintenance study, and the AutoReason frontier — see Autoresearch: The Overnight Loop That Changed the Production Function on MMNTM. For the enterprise eval design implications (how to construct a fitness function when your domain doesn't have one natively), see The Five-Minute Clock.