Self-Improving AI Agents: From Karpathy's Lab to a Live Hedge Fund

In early March 2026, Andrej Karpathy released autoresearch — a framework where AI agents autonomously run ML experiments overnight, keep what works, revert what doesn't. Within days, the repo had 28,000 stars. Shopify's CEO ran it and got a 0.8B parameter model that outperformed his previous 1.6B model. Karpathy called the experience of watching agents catch optimization oversights he'd missed over two decades "wild."

Then Chris Worsey, founder of General Intelligence Capital, did something nobody expected. He took the same loop and pointed it at financial markets.

25 AI agents. Darwinian selection. Real money.

+22% returns in 173 days. Best pick: AVGO at $152, held for +128%.

This isn't a research paper. It's running live with his own capital.

The Autoresearch Loop

Karpathy's insight is deceptively simple. The core loop:

Agent reads the code (train.py) and forms a hypothesis
Agent modifies the code
System runs a 5-minute training experiment
System evaluates the result (validation bits-per-byte)
If improved: git commit. If worse: git reset
Repeat

That's it. No human in the loop. No review process. Just: try something, measure the outcome, keep or revert.

At ~12 experiments per hour, the system runs 100+ experiments overnight. In one documented session, 126 experiments produced 23 improvements — an 18% keep rate. The validation metric dropped steadily from 0.9979 to 0.9697 while the human slept.

The human's job? Writing program.md — a markdown file that tells the agent what to optimize and what constraints to respect. Karpathy describes it as "essentially a super lightweight skill." You're not running experiments anymore. You're programming the research strategy.

"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'."

— Karpathy, autoresearch README

From Code to Capital

Chris Worsey saw the pattern and asked: what if the thing being optimized isn't code — but prompts? And what if the loss function isn't validation loss — but Sharpe ratio?

Prompts are the weights. Real-world outcomes are the loss function.

He built ATLAS — 25 AI agents organized into four layers, each with a specific role in the investment process.

Layer 1: Macro (10 agents)

The foundation. These agents set the regime — risk-on or risk-off?

Central Bank policy. Geopolitical risk. China dynamics. Dollar strength. Yield curve. Commodities. Volatility. Emerging markets. News sentiment. Institutional flow.

Each agent analyzes its domain and signals to the layers above. Not one agent trying to know everything — ten specialists with narrow focus.

Layer 2: Sector Desks (7 agents)

Given the macro regime, which sectors and which names?

Semiconductor. Energy. Biotech. Consumer. Industrials. Financials. Plus a relationship mapper that tracks supply chains and analyst coverage — the Bloomberg terminal as an agent.

Layer 3: Superinvestors (4 agents)

This is where it gets interesting. Four agents with distinct investment philosophies, named after real investors:

Druckenmiller — macro/momentum, asymmetric trades
Aschenbrenner — AI and compute capex cycles
Baker — deep tech and biotech IP moats
Ackman — quality compounders: pricing power, free cash flow, catalyst-driven

Each filters the same picks through a fundamentally different lens. The Druckenmiller agent and the Ackman agent will look at the same stock and reach different conclusions for different reasons.

Layer 4: Decision (4 agents)

Chief Risk Officer (adversarial — finds correlated risks). Alpha Discovery (surfaces overlooked names). Execution (converts signals to trades). CIO (synthesizes everything, weighted by agent performance).

The Darwinian Loop

Here's where autoresearch meets natural selection:

Score every agent by rolling Sharpe ratio
Identify the worst performer
Rewrite its prompt — one targeted modification
Test for 5 trading days
Evaluate: did the Sharpe improve?
Keep (git commit) or revert (git reset)

Daily, the system adjusts weights. Top quartile agents get a 1.05x multiplier. Bottom quartile gets 0.95x. Weights range from 0.3 (nearly silenced) to 2.5 (highly trusted). The CIO layer uses these weights to determine how much each agent's opinion matters.

Over 378 trading days: 54 prompt modifications attempted. 16 survived. 37 reverted. A 30% keep rate — remarkably close to Karpathy's 18% on code modifications.

The agents that rose to the top: Geopolitical, Commodities, and the Ackman quality compounder. The system learned which investment philosophies to trust — not from human conviction, but from market feedback.

The Orchestration Bottleneck

The most surprising discovery: the system downweighted its own CIO — the chief decision-maker — to minimum weight (0.3).

Worsey calls this "the orchestration bottleneck":

In any multi-agent system, the synthesis/decision layer is the bottleneck. Improving individual agent intelligence without improving orchestration yields diminishing returns.

The agents figured out their portfolio manager was the weakest link before the humans did.

The Numbers

Metric	Karpathy autoresearch	Worsey ATLAS
What evolves	Python code	Agent prompts
Fitness metric	val_bpb (lower = better)	Rolling Sharpe (higher = better)
Time per experiment	5 minutes (GPU)	5 trading days
Keep/revert	git commit / git reset	git commit / git reset
Keep rate	~18%	30%
Infrastructure cost	H100 GPU time	$20/month Azure VM
Human role	Write program.md	Design architecture, initial prompts

The infrastructure cost for ATLAS is striking: $20/month Azure VM, $50-80 total for the 18-month backtest. Claude Sonnet via API. No GPU required.

The value isn't in the compute. It's in the loop.

Specific Agent Improvements

Before and after prompt evolution, measured by Sharpe ratio:

Agent	Before	After
Financials	-4.14	0.45
Emerging Markets	-0.45	-0.06
Semiconductor	-0.26	-0.06

The Financials agent went from catastrophically bad to positive. Not by a human rewriting its prompt based on intuition — by the system testing modifications against real market outcomes and keeping what worked.

"The final prompts are evolutionary products — shaped by market feedback, not human intuition."

— Chris Worsey

The Principle Generalizes

Strip away the finance specifics and the pattern is universal:

Agents + Measurable Outcomes + Keep/Revert Loop = Self-Improvement

The domain doesn't matter. What matters is:

Specialized agents with distinct roles (not one agent trying to do everything)
A measurable fitness function (Sharpe ratio, validation loss, or... ranking improvements, conversion rates, engagement metrics)
A loop that tests modifications and keeps what works
Time for the evolutionary pressure to compound

This works for SEO — agents scored by actual ranking changes over weeks. For content — agents scored by engagement and conversion. For sales — agents scored by pipeline generated. For ad creative — agents scored by ROAS and creative fatigue curves.

Any business function where you can measure outcomes is a candidate for this loop.

What Karpathy Sees Next

Karpathy's own roadmap for autoresearch points to something bigger:

"The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it's to emulate a research community of them."

Not one agent getting better. A community of agents, exploring in parallel, sharing discoveries, evolving together. The jump from single-agent to multi-agent mirrors exactly what Worsey built — 25 agents, each exploring a different angle, weighted by proven performance.

What This Means

We're watching a shift happen in real time.

Karpathy releases a loop for evolving code. Within a week, someone applies it to financial markets and deploys it with real capital. The framework is open source. The infrastructure costs $20/month. The results beat most human fund managers.

The question isn't whether AI agents can improve themselves. They already are. The question is: what else can this loop optimize?

The keep rate is 30%. The other 70% get reverted. That's not failure — that's how evolution works. Most mutations don't help. The ones that do compound.

Prompts are the new weights. Outcomes are the new loss function. And the agents are getting better while we sleep.

Sources: Karpathy autoresearch (28.1K stars), ATLAS-GIC by Chris Worsey / General Intelligence Capital. ATLAS paper: "Adaptive Trading with LLM Agents Through Dynamic Prompt Optimization" (alphaXiv 2510.15949).

Self-Improving AI Agents: From Karpathy's Lab to a Live Hedge Fund

Self-Improving AI Agents: From Karpathy's Lab to a Live Hedge Fund

The Autoresearch Loop

From Code to Capital

Layer 1: Macro (10 agents)

Layer 2: Sector Desks (7 agents)

Layer 3: Superinvestors (4 agents)

Layer 4: Decision (4 agents)

The Darwinian Loop

The Orchestration Bottleneck

The Numbers

Specific Agent Improvements

The Principle Generalizes

What Karpathy Sees Next

What This Means

Turn the best models into shipped work