Agent Harness
/ˈeɪdʒənt ˈhɑːrnɪs/
What is an Agent Harness?
An agent harness (also called a scaffold) is the infrastructure layer that enables a language model to function as an autonomous agent. While the AI model provides reasoning and language understanding, the harness provides everything else: processing inputs, orchestrating tool calls, managing context, handling errors, and returning results.
Think of it this way: the model is the brain, but the harness is the nervous system, skeleton, and muscles that let the brain interact with the world.
Why Harnesses Matter
Models alone can't be agents. A language model by itself:
- Can't persist state between conversations
- Can't call external APIs directly
- Can't recover from errors gracefully
- Can't manage its own context window limits
The harness fills these gaps, transforming a stateless text predictor into a reliable worker.
Core Harness Responsibilities
1. Context Management
Long-running agents quickly exhaust their context windows. Harnesses implement strategies like:
- Compaction: Summarizing older context to free up tokens
- Sliding windows: Dropping the oldest messages while keeping recent ones
- Selective memory: Storing important facts externally and retrieving when relevant
- Checkpointing: Saving state at key milestones for recovery
2. Tool Orchestration
When the model decides to use a tool, the harness:
- Validates the tool call parameters
- Executes the actual API/function call
- Handles timeouts, retries, and errors
- Returns results to the model
3. Error Recovery
Production agents encounter failures constantly:
- API rate limits
- Network timeouts
- Invalid tool responses
- Model hallucinations
Good harnesses detect these issues and implement recovery strategies—retrying, falling back, or gracefully degrading.
4. State Persistence
For agents that work across multiple sessions:
- Maintaining progress files and logs
- Git integration for code agents
- Database state for business process agents
- Session handoff between agent instances
Harness Architecture Patterns
Single-Loop Harness
User Input → Model → Tool Call → Execute → Model → Response
↑ |
└──────────────────────────────────────────┘
Simple, synchronous execution. Good for short tasks.
Multi-Agent Harness
Orchestrator Agent
├── Research Agent
├── Coding Agent
└── Review Agent
Specialized agents coordinated by a central harness. Good for complex workflows.
Long-Running Harness
Session 1: Init Agent → [checkpoint] →
Session 2: Continue Agent → [checkpoint] →
Session 3: Continue Agent → Complete
Work spans multiple context windows with state persistence between sessions.
Common Harness Failures
Anthropic's research identified several failure patterns that harnesses must handle:
| Failure Mode | Description | Harness Solution |
|---|---|---|
| One-shotting | Agent tries to complete everything in one session | Break into milestones, enforce checkpoints |
| Environmental degradation | Agent leaves broken state for next session | Require clean state verification before handoff |
| Premature completion | Agent declares done without verification | Mandate end-to-end testing before completion |
| Context exhaustion | Agent runs out of tokens mid-task | Implement compaction and summarization |
Evaluation vs. Production Harnesses
Evaluation harnesses run test suites:
- Provide instructions and tools to agents
- Run tasks concurrently
- Record complete transcripts
- Grade outcomes automatically
Production harnesses run real work:
- Handle authentication and authorization
- Implement rate limiting and quotas
- Provide observability and monitoring
- Manage cost and resource allocation
The best teams test their production harnesses, not just their models.
Building vs. Buying Harnesses
Build Your Own
- Pros: Full control, optimized for your use case
- Cons: Significant engineering investment, edge cases everywhere
Use a Framework
Popular options:
- LangChain/LangGraph: Flexible, large ecosystem
- CrewAI: Multi-agent orchestration
- AutoGen: Microsoft's multi-agent framework
- Claude's agent SDK: Native Anthropic support
Managed Platforms
- Anthropic Workbench: Built-in harness for Claude agents
- OpenAI Assistants API: Managed state and tools
- Vertex AI Agents: Google's managed agent infrastructure
The Harness-Model Relationship
A critical insight from Anthropic's evaluation research: you can't evaluate a model separately from its harness. Agent performance emerges from the combination.
A great model in a poor harness will fail. A mediocre model in an excellent harness can succeed. When benchmarking agents, you're always measuring the harness-model system, not the model alone.
Related Reading
- AI Agents - What harnesses enable
- Tool Use - What harnesses orchestrate
- Long-running Agents - Where harnesses shine
