From Vibe Coding to Agent Engineering: What Actually Changed
In February 2025, Andrej Karpathy posted a tweet that defined an era: “There’s a new kind of coding I call vibe coding, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.”
Thirteen months later, on the No Priors podcast, he retired his own term.
“I went from 80/20 of writing code myself versus delegating to agents to like 2/98. I don’t think I’ve typed a line of code since December.”
The man who gave us “vibe coding” now calls it passé. His new term: agentic engineering. Not because AI got worse at generating code — but because generating code was never the hard part.
What Karpathy Actually Said
The No Priors interview is worth watching in full, but three shifts stand out:
1. The ratio flipped. Karpathy went from writing 80% of his code to writing 2%. Agents handle the rest. The bottleneck moved from typing speed to orchestration skill — how well you direct multiple agents working in parallel.
2. “Code’s not even the right verb anymore.” Software development became macro-action orchestration. You don’t write functions; you delegate features. You don’t debug line by line; you review at the architecture level. Peter Steinberger runs dozens of agents simultaneously, each on 20-minute tasks across multiple repositories.
3. AutoResearch removes humans from the loop entirely. Karpathy built an autonomous research loop for nanoGPT that runs overnight, optimizing hyperparameters. Despite his years of hand-tuning, the agent found improvements he’d missed — forgotten weight decay on value embeddings, insufficiently tuned Adam betas. His conclusion: “To get the most out of the tools, you have to remove yourself as the bottleneck.”
The consistent thread: the value shifted from execution to judgment. Vibe coding was about execution — prompt, generate, ship. Agentic engineering is about judgment — architecture, verification, orchestration.
The Engineering Manual for What Comes Next
The same week, Tw93 — creator of Pake, Mole, and a prolific open-source engineer — published “You Don’t Know AI Agents,” a deep technical guide covering what it actually takes to make agents reliable in production. Where Karpathy provides the vision, Tw93 provides the engineering manual.
His central thesis: harnesses matter more than models.
“Using a more expensive model doesn’t always yield the massive improvements you’d expect. Instead, the quality of your harness and validation tests has a far greater impact on success rates.”
This isn’t theoretical. OpenAI’s own engineering team demonstrated it: three engineers wrote a million lines of code in five months — ten times traditional speed. The key wasn’t a better model. It was correct engineering decisions about constraints, validation, and agent infrastructure.
Five Principles That Separate Vibe Coding from Agent Engineering
1. Context Engineering, Not Prompt Engineering
The attention complexity of a Transformer is O(n²). The longer the context, the easier crucial signals get diluted. The most common failure mode isn’t “the model can’t do it” — it’s Context Rot: irrelevant content accumulating until the agent’s decision quality visibly degrades.
The solution is layered context management:
- Permanent layer: Identity, conventions, hard constraints. Short, stable, always loaded.
- On-demand layer: Skills and domain knowledge. Descriptors stay resident; full content loads only when triggered.
- Runtime injection: Timestamps, user preferences, dynamic state. Appended per turn.
- Memory layer: Cross-session experience. Read only when relevant, not stuffed into every prompt.
The key insight: don’t put deterministic logic into the context. Anything expressible as code rules, linters, or hooks should be handled by external systems. The model should think, not read rulebooks.
2. Tool Design Following ACI Principles
Most tool failures aren’t about the model picking the wrong tool — they’re about the tool being designed for engineers, not agents. The Agent-Computer Interface (ACI) framework changes the design perspective:
| Aspect | Bad Tool Design | Good Tool Design |
|---|---|---|
| Granularity | Maps to API endpoints | Maps to agent goals |
| Returns | Complete raw data | Fields relevant to next decision |
| Errors | Generic string | Structured with fix suggestions |
| Description | What it does | When to use and when NOT to use |
A practical example: instead of providing get_post + update_content + update_title as separate tools, provide update_yuque_post that expresses the complete action in one call. Counter-examples in tool descriptions boost accuracy from 53% to 85%.
When debugging agents, check tool definitions first. Most tool selection errors stem from inaccurate descriptions, not model capability.
3. Memory as Infrastructure, Not Afterthought
Agents lack native temporal continuity. When a session ends, the context is gone. Four types of memory solve different problems:
- Working memory (context window): Current task state. Actively managed.
- Procedural memory (skills): How to do things. Loaded on demand.
- Episodic memory (session logs): What happened. Persisted, searchable.
- Semantic memory (MEMORY.md): Stable facts. Injected at startup.
The critical design choice: memory consolidation must be reversible. When compacting long conversations, don’t delete raw messages — archive them. Move a pointer, don’t destroy data. If consolidation produces a bad summary, the agent can still fall back to the raw history.
4. Evaluation Before Optimization
Agent evaluation is fundamentally harder than traditional testing. The input space is infinite, LLMs are sensitive to prompt phrasing, and the same task may produce different results across runs.
Two metrics, two purposes:
- Pass@k: At least one correct run out of k. Tests capability boundaries. Use during development.
- Pass^k: All k runs correct. Tests reliability. Use before deployment.
The most dangerous anti-pattern: tuning the agent when evaluation is broken. If your scoring is flawed, you’re optimizing against distorted signals. When performance drops, check infrastructure first — resource limits causing crashes, buggy graders, or test cases disconnected from reality — before modifying the agent.
5. Multi-Agent Coordination Requires Protocols
Running multiple agents isn’t about parallelism — it’s about isolation and coordination. Sub-agents should return only summaries; their search, trial-and-error, and debugging process stays in their own context. The main agent’s context receives only conclusions.
The order matters: define protocols first, establish isolation next, then talk about collaboration. Without structured communication (JSONL message queues, task graphs, workspace isolation), errors amplify across agents. Agent A drifts, Agent B reinforces the bias, Agent C stacks on it, and all three converge on a wrong conclusion with high confidence.
The Progression in One Picture
| Phase | Developer Role | Code Quality | Verification | Scale |
|---|---|---|---|---|
| Manual coding | Writer | High (your code) | You test it | One person |
| Vibe coding | Prompter | Variable | You check it | One agent |
| Agentic coding | Architect | Structured | Agent tests itself | Multiple agents |
| Agent engineering | Orchestrator | Harnessed | Automated eval | Agent teams |
Each phase didn’t replace the previous one — it subsumed it. You still need taste, still need architecture thinking, still need to understand code. But the execution layer keeps moving further from your fingertips.
What This Means in Practice
Karpathy built a home automation agent called “Dobby the House Elf” — three prompts that scanned his local network, reverse-engineered smart device APIs, and replaced six separate apps with WhatsApp commands. “Dobby, sleepy time” turns everything off.
His conclusion about software: “These apps shouldn’t even exist. Shouldn’t it just be APIs and agents are the glue of intelligence that tool-calls all the parts?”
This is the trajectory. Software moves from products you operate to agents that operate products on your behalf. The interface collapses from GUIs to natural language. The complexity doesn’t disappear — it moves into the harness, the tools, the evaluation, the memory systems that make agents reliable.
Vibe coding got us comfortable with the idea that AI writes code. Agent engineering is about building the infrastructure that makes AI-written code trustworthy, maintainable, and autonomous.
The vibes were step 1. The engineering is everything after.
TeamDay runs autonomous AI agents in the cloud — SEO, content, social media, analytics, and more. The same agent engineering principles that Karpathy and Tw93 describe power our AI workforce. Start building your agent team.