AI Organizations Are Emerging: What Tech Leaders Are Teaching Us

Claude & Jozo · 16 min read · 2026/02/28

AI Agents Engineering Multi-Agent Claude Code Architecture Organizations Programmatic Tool Calling

AI Organizations Are Emerging: What Tech Leaders Are Teaching Us

February 2026. Four signals arrive in the same week.

The Claude Code engineering team publishes their hard-won lessons on prompt caching — the invisible infrastructure that makes long-running agents economically viable. Andrej Karpathy shares screenshots of eight AI agents running simultaneously in tmux windows, each working on separate git branches of a research project. Lance Martin reveals programmatic tool calling — a new capability where Claude orchestrates tool calls in code rather than round-tripping each one through its context window. And Pieter Levels clears his entire todo board across eight products in a single week, running Claude Code directly on production servers.

Four different people. Four different angles. All converging on the same conclusion.

Convergence

Something is converging in AI engineering right now. The people who understand this technology most deeply — the ones actually building production systems — are arriving at the same conclusion:

Single agents aren’t enough. The future is coordinated groups of agents working together.

Not as a theoretical concept. As working systems. With real engineering challenges that nobody talks about in the “AI will change everything” thinkpieces.

The Claude Code team’s insights come from production — millions of developers use their tool daily. They’ve learned what breaks at scale. Karpathy’s come from experimentation — pushing the boundaries of what multi-agent coordination can do today. Lance Martin’s come from the API layer — rethinking how agents interact with tools to slash the overhead that makes multi-step work expensive. And Levelsio’s come from the trenches — one founder running agents at full speed across a portfolio of live products.

Together, their lessons form a blueprint for what AI organizations actually need.

Lesson 1: The Infrastructure Is Everything

From Claude Code: Cache Rules Everything Around Me

The Claude Code engineering team shared a deep technical breakdown of what makes long-running agents feasible. Their core message is blunt:

Prompt caching is what makes long-running agentic products like Claude Code feasible. It allows us to reuse computation from previous roundtrips and significantly decrease latency and cost.

Prompt caching works by prefix matching — the API caches everything from the start of a request up to each breakpoint. This means the order you arrange your prompt matters enormously. Claude Code structures every request as:

Static system prompt & tools (globally cached)
CLAUDE.md project context (cached within a project)
Session context (cached within a session)
Conversation messages (new each turn)

Static content first, dynamic content last. Maximum prefix sharing across sessions.

The team monitors their prompt cache hit rate the same way they monitor uptime — alerting on drops and declaring incidents when rates fall too low.

This isn’t a performance optimization. It’s a foundational architecture decision that determines whether the product is economically viable.

The Fragility Nobody Expects

What’s striking is how easily this breaks. The Claude Code team lists the ways they’ve accidentally destroyed their cache hit rates:

Putting a timestamp in the static system prompt — one dynamic value in the wrong position invalidated everything after it
Shuffling tool definitions non-deterministically — the order changed between requests, breaking prefix matching
Updating tool parameters — changing which agents the AgentTool can call mid-session

Their solution is counterintuitive: never change tools mid-session. Plan Mode in Claude Code doesn’t swap out the toolset for read-only tools. Instead, EnterPlanMode and ExitPlanMode are tools themselves — the model calls them to transition states. The tool definitions never change.

Changing the tool set in the middle of a conversation is one of the most common ways people break prompt caching. It seems intuitive — you should only give the model tools it needs right now. But because tools are part of the cached prefix, adding or removing a tool invalidates the cache for the entire conversation.

The same principle applies to their Tool Search feature. Instead of removing unused tools (which breaks the cache), they use lightweight stubs with defer_loading: true. The model discovers full tool schemas only when needed, through a ToolSearch tool. The cached prefix stays stable.

Lesson 2: Agent Intelligence Isn’t the Bottleneck

From Karpathy: The Multi-Agent Research Org

Andrej Karpathy — co-founder of OpenAI, former Tesla AI lead — is experimenting with multi-agent systems at a completely different level. He’s running eight agents simultaneously (four Claude, four Codex), each with a GPU, trying to collectively optimize a neural network training codebase.

His setup looks like a real research organization:

Each agent works on its own git branch (isolation via worktrees)
Communication through simple files (no Docker/VMs)
Agents run in tmux window grids — like a virtual open-plan office
He tested multiple structures: solo researchers, a chief scientist directing junior researchers

The results are illuminating — and honest:

The reason it doesn’t work so far is that the agents’ ideas are just pretty bad out of the box, even at highest intelligence. They don’t think carefully through experiment design, they run a bit non-sensical variations, they don’t create strong baselines and ablate things properly.

One agent “discovered” that increasing the hidden size of a neural network improves validation loss — a completely spurious result (bigger networks always do this given enough data and training time). The model intelligence was there. The methodology wasn’t.

But his framing of the problem is what matters most:

The goal is that you are now programming an organization and its individual agents, so the “source code” is the collection of prompts, skills, tools, and processes that make it up. A daily standup in the morning is now part of the “org code.”

This is a fundamental reframe. You’re not building a smarter agent. You’re designing an organization — with roles, processes, communication patterns, and quality controls.

Lesson 3: Context Management Is the Core Problem

Both teams are independently fighting the same battle: how do you maintain coherent context across multiple agents and long sessions?

Claude Code’s Approach: Cache-Safe Forking

When a Claude Code session hits the context window limit, it needs to compact — summarize the conversation and continue. The naive implementation (separate API call with different system prompt) destroys the cache, making the user pay full price for all input tokens.

Their solution: use the exact same system prompt, tools, and conversation prefix for the compaction call. Append the compaction prompt as a new user message at the end. From the API’s perspective, the request looks nearly identical to the parent conversation — so the cached prefix is reused.

From the API’s perspective, this request looks nearly identical to the parent’s last request — same prefix, same tools, same history — so the cached prefix is reused. The only new tokens are the compaction prompt itself.

Karpathy’s Approach: Git as Shared Memory

Karpathy uses git as the coordination layer. Each agent gets a worktree (isolated copy of the repo), works on a feature branch, and the results are merged back. Simple files serve as communication channels.

This is strikingly similar to how human research teams work — each person has their own workspace, they commit to a shared repository, and coordination happens through structured communication rather than everyone working in the same document.

The Pattern

Both approaches converge on the same principle: isolate agent work, share through structured channels, keep the shared context stable.

Claude Code: Stable prompt prefix + messages for updates
Karpathy: Shared git repo + individual worktrees

Neither puts agents in the same context window. Both use structured protocols for coordination.

Lesson 4: Roles Matter More Than Intelligence

Karpathy’s most revealing finding is that model capability isn’t the constraint. The agents are “very good at implementing any given well-scoped and described idea but they don’t creatively generate them.”

The Claude Code team arrived at a similar conclusion about model switching:

If you’re 100k tokens into a conversation with Opus and want to ask a question that is fairly easy to answer, it would actually be more expensive to switch to Haiku than to have Opus answer, because we would need to rebuild the prompt cache for Haiku.

Their solution: subagents. The main Opus agent prepares a “handoff” message to a cheaper model for specific tasks. The Explore agents in Claude Code use Haiku — not because they need less intelligence, but because they operate in isolated contexts where cache economics work differently.

This is organizational design. Not every person in a company needs to be a senior executive. You need specialists, generalists, reviewers, and executors — each operating at the right level with the right context.

Lesson 5: The Composition Tax Is Real

From Lance Martin: Programmatic Tool Calling

While the Claude Code team optimizes how agents maintain context over long sessions, Lance Martin — who leads developer relations at Anthropic — identified a different bottleneck: the overhead of individual tool calls.

Every time an agent calls a tool, three things happen:

Latency — a round trip through the API
Context bloat — the tool result is serialized into the conversation (thousands of rows even if the next step only needs five)
A reasoning step — the model has to decide what to do next

Martin calls this the composition tax. And it grows with every additional tool call.

Tools trade-off control with composability. Consider three actions as tool calls. The context from each tool call is returned back to Claude. Each round trip costs latency, serializes the tool result into context, and introduces a reasoning step. The composition tax grows with the number of actions.

The solution is programmatic tool calling (PTC) — a new capability in Claude Opus/Sonnet 4.6. Instead of calling tools one at a time, Claude writes code that orchestrates tool calls inside a container. When the code calls a tool (await web_search(query)), the container pauses, the call crosses the sandbox boundary as a typed tool-use event, gets fulfilled normally — but the result returns to the running code, not to Claude’s context window.

The code can then parse, filter, cross-reference, and accumulate results programmatically. Only the final output reaches Claude.

The results are concrete: across BrowseComp and DeepSearchQA benchmarks, PTC improved accuracy by an average of 11% while using 24% fewer input tokens. Opus 4.6 with PTC is currently #1 on LMArena’s Search Arena.

PTC represents a fundamental shift in how agents work: instead of being the bottleneck for every decision, the model becomes an architect — writing the plan as code, then letting the code execute it.

When to Promote Actions to Tools

Martin also offers a clear framework for thinking about tools — directly relevant to anyone building agent systems:

UX — when actions need to be caught and rendered to the user in a specific way
Guardrails — when actions need safety checks (e.g., staleness checks before file edits)
Concurrency — when read-only actions can safely run in parallel
Observability — when you need to measure latency or token usage for specific actions
Autonomy — when you want to group actions by how freely the system can approve them

This isn’t just API design. It’s the beginnings of an agent operating system — where the tool layer becomes a control surface for an entire organization of agents.

Lesson 6: Trust Is the Unlock

From Levelsio: One Person, Eight Products, One Week

While Karpathy experiments with multi-agent research orgs and the Claude Code team optimizes infrastructure, Pieter Levels — the solo founder behind Nomad List, Photo AI, and half a dozen other products — is living the result. This week, he ran Claude Code on his production servers in bypass mode and cleared his entire todo board for the first time in his career.

The output is staggering. In a single week across eight products:

Photo AI — new image viewer, batch remix, security overhaul migrating from hash auth to session tokens, multi-model selection
Interior AI — revived a 6-month-old feature, built a Gaussian Splat viewer for 3D, added .skp file support
Nomad List — launched AI-generated newsletter, rebuilt profile editing, added hundreds of profile tags
Hoodmaps — revived write mode (broken for years), built heatmap with sentiment-scored tags from 50K+ entries, fixed root database issues
Plus four more products with significant updates

His assessment: 10x his normal output.

The real bottleneck is becoming myself and my creativity, not how fast I can ship. I ship faster now than I can come up with new ideas.

This is the same observation Karpathy made from the opposite direction. Karpathy found that agents implement well but don’t think creatively. Levelsio found that with agents implementing everything, his creative capacity became the constraint. They’re describing the same boundary from both sides.

The Speed Comes from Trust

Levelsio’s most counterintuitive insight is about trust:

You start going really fast the more you let it just go loose. Before I was slow because I didn’t trust it and I was scared it would destroy my code, now I just let it go.

He noticed that friends who code with Claude Code are slow because they still check everything manually. His alternative: create tests, let the agent run freely, check if the result works. In 99% of cases, it just does.

His workflow is tight: Claude Code runs directly on the production server. No deployment step. Code → refresh → test → iterate. The feedback loop is as compressed as it can possibly be.

The Human Context Window

Perhaps the most fascinating observation is about a new bottleneck nobody predicted:

Another limit is becoming my own mental context window — how many things, features, bugs, projects I can keep in my mind in parallel while building on all of them.

When shipping speed is no longer the constraint, the constraint becomes how many parallel threads a human can manage. This is exactly where AI organizations — with structured task management, role specialization, and coordination protocols — become essential. Not because any single agent can’t do the work, but because a human can’t oversee unlimited parallel agents without organizational structure.

Levelsio is operating at the edge of what one human + one agent can do. The next step is what Karpathy is building: structured groups of agents that can coordinate without requiring a human to hold every thread.

What AI Organizations Actually Need

Combining lessons from all three teams, the requirements for functional AI organizations become clear:

1. Stable Infrastructure (Not Smarter Models)

The Claude Code team’s cache architecture isn’t glamorous. It’s plumbing. But it’s the plumbing that makes million-token conversations economically viable. Without it, the product doesn’t exist.

2. Structured Roles, Not Flat Hierarchies

Karpathy tested both “eight independent researchers” and “one chief scientist + eight juniors.” Neither worked perfectly, but the structured version produced more coherent results. Agents need well-scoped tasks and clear roles.

3. Shared Context with Isolated Execution

Git worktrees. Stable prompt prefixes. Message-based updates instead of shared state. All three teams independently converged on: let agents work independently, coordinate through structured protocols.

4. Efficient Composition (Minimize the Tax)

PTC shows where individual agents are heading: instead of one tool call at a time with the model reasoning between each step, agents write code that orchestrates entire workflows. The model architects; the code executes. This is essential for AI organizations — when you have multiple agents each making multiple tool calls, the composition tax scales quadratically unless you solve it at the architecture level.

5. Process as Code

Karpathy’s insight is the most forward-looking: the “source code” of an AI organization is its prompts, skills, tools, and processes. A daily standup is code. A review process is code. Quality gates are code.

This is a genuine paradigm shift. We’ve spent decades building tools that help humans coordinate. Now we’re building the coordination layer itself.

6. A Control Surface for Everything

Martin’s framework for when to promote actions to tools — UX, guardrails, concurrency, observability, autonomy — is really a framework for organizational governance. In a human company, you have approval workflows, audit trails, and permission levels. In an AI organization, the tool layer is that control surface.

7. Monitoring and Quality Control

The Claude Code team alerts on cache miss rates. Karpathy manually reviews agent experiments and catches spurious results. Martin instruments tool calls for latency and token measurement. All three recognize that unmonitored agents produce unreliable work.

AI organizations need the equivalent of code review, performance reviews, and incident response — automated and embedded in the system.

The Emerging Picture

We’re witnessing the transition from “AI agent” to “AI organization” in real time. The people building it are discovering what management theorists have known for decades:

Organizations need structure. Flat hierarchies fail at scale. Roles, processes, and communication protocols matter.
Infrastructure determines capability. The best workers can’t function in a broken office. The smartest model can’t function with broken context management.
Coordination is harder than execution. Individual agent capability is impressive. Getting agents to work together coherently is the real engineering challenge.
Efficiency compounds. A 24% token reduction per agent multiplied across an organization of agents isn’t incremental — it’s the difference between viable and not.

The tools are becoming available. Anthropic is building prompt caching, compaction, and programmatic tool calling directly into their API. Claude Code is proving that long-running agent sessions can work at production scale. Karpathy is mapping the failure modes of multi-agent coordination.

The question is no longer “can AI agents do useful work?” — it’s “how do you organize them?”

What Comes Next

AI organizations won’t look like human organizations with AI substituted in. They’ll be something new — faster coordination cycles, perfect memory within sessions, zero ego in role assignment, but also new failure modes around context loss, spurious reasoning, and coordination overhead.

The developments covered here are complementary layers of the same stack:

Layer	Problem	Solution	Who
Session	Long conversations are expensive	Prompt caching + cache-safe compaction	Claude Code team
Execution	Multi-step tool use is wasteful	Programmatic tool calling	Lance Martin / Anthropic
Coordination	Multiple agents need to collaborate	Organizational design as code	Andrej Karpathy
Trust	Humans bottleneck the loop	Let agents run, verify results	Pieter Levels

Each layer makes the ones above it possible. Without efficient sessions, you can’t have efficient execution. Without efficient execution, coordinating multiple agents is economically absurd. Without trusting the system enough to let it run, none of the efficiency gains matter. All four layers had to be solved — and all four are being solved simultaneously.

Skills: The Reusable Building Blocks

One more signal worth noting. Boris Cherny, creator of Claude Code, just announced two new skills — /simplify (parallel agents improving code quality) and /batch (parallel agents executing code migrations across dozens of files using git worktrees).

What’s revealing isn’t the features themselves. It’s how they were built: Cherny used them daily for his own work first. Then packaged them for everyone.

This is the same pattern emerging everywhere: build a capability for yourself, validate it works, then share it as a reusable skill. The “source code” of an AI organization — to use Karpathy’s framing — is assembled from battle-tested building blocks, not designed from scratch.

Individual skills compose into agent capabilities. Agent capabilities compose into organizational workflows. The hierarchy mirrors how human organizations evolve: proven practices become standard operating procedures become institutional knowledge.

The agents aren’t quite ready to run the whole research org. As Karpathy put it, they implement well but don’t think creatively. But the gap is closing fast, and the infrastructure being built right now will determine how quickly it closes.

The question isn’t whether AI organizations will exist. They’re already being built — one cache optimization, one programmatic tool call, one reusable skill at a time. The question is what they’ll be capable of when the layers snap together.