technical

Agent Harness

Pronunciation

/ˈeɪdʒənt ˈhɑːrnɪs/

Also known as:scaffoldagent scaffoldagent frameworkagent infrastructureagent runtime

What is an Agent Harness?

An agent harness (also called a scaffold) is the infrastructure layer that enables a language model to function as an autonomous agent. While the AI model provides reasoning and language understanding, the harness provides everything else: processing inputs, orchestrating tool calls, managing context, handling errors, and returning results.

Think of it this way: the model is the brain, but the harness is the nervous system, skeleton, and muscles that let the brain interact with the world.

Why Harnesses Matter

Models alone can't be agents. A language model by itself:

Can't persist state between conversations
Can't call external APIs directly
Can't recover from errors gracefully
Can't manage its own context window limits

The harness fills these gaps, transforming a stateless text predictor into a reliable worker.

Core Harness Responsibilities

1. Context Management

Long-running agents quickly exhaust their context windows. Harnesses implement strategies like:

Compaction: Summarizing older context to free up tokens
Sliding windows: Dropping the oldest messages while keeping recent ones
Selective memory: Storing important facts externally and retrieving when relevant
Checkpointing: Saving state at key milestones for recovery

2. Tool Orchestration

When the model decides to use a tool, the harness:

Validates the tool call parameters
Executes the actual API/function call
Handles timeouts, retries, and errors
Returns results to the model

3. Error Recovery

Production agents encounter failures constantly:

API rate limits
Network timeouts
Invalid tool responses
Model hallucinations

Good harnesses detect these issues and implement recovery strategies—retrying, falling back, or gracefully degrading.

4. State Persistence

For agents that work across multiple sessions:

Maintaining progress files and logs
Git integration for code agents
Database state for business process agents
Session handoff between agent instances

Harness Architecture Patterns

Single-Loop Harness

User Input → Model → Tool Call → Execute → Model → Response
     ↑                                          |
     └──────────────────────────────────────────┘

Simple, synchronous execution. Good for short tasks.

Multi-Agent Harness

Orchestrator Agent
    ├── Research Agent
    ├── Coding Agent
    └── Review Agent

Specialized agents coordinated by a central harness. Good for complex workflows.

Long-Running Harness

Session 1: Init Agent → [checkpoint] →
Session 2: Continue Agent → [checkpoint] →
Session 3: Continue Agent → Complete

Work spans multiple context windows with state persistence between sessions.

Common Harness Failures

Anthropic's research identified several failure patterns that harnesses must handle:

Failure Mode	Description	Harness Solution
One-shotting	Agent tries to complete everything in one session	Break into milestones, enforce checkpoints
Environmental degradation	Agent leaves broken state for next session	Require clean state verification before handoff
Premature completion	Agent declares done without verification	Mandate end-to-end testing before completion
Context exhaustion	Agent runs out of tokens mid-task	Implement compaction and summarization

Evaluation vs. Production Harnesses

Evaluation harnesses run test suites:

Provide instructions and tools to agents
Run tasks concurrently
Record complete transcripts
Grade outcomes automatically

Production harnesses run real work:

Handle authentication and authorization
Implement rate limiting and quotas
Provide observability and monitoring
Manage cost and resource allocation

The best teams test their production harnesses, not just their models.

Building vs. Buying Harnesses

Build Your Own

Pros: Full control, optimized for your use case
Cons: Significant engineering investment, edge cases everywhere

Use a Framework

Popular options:

LangChain/LangGraph: Flexible, large ecosystem
CrewAI: Multi-agent orchestration
AutoGen: Microsoft's multi-agent framework
Claude's agent SDK: Native Anthropic support

Managed Platforms

Anthropic Workbench: Built-in harness for Claude agents
OpenAI Assistants API: Managed state and tools
Vertex AI Agents: Google's managed agent infrastructure

The Harness-Model Relationship

A critical insight from Anthropic's evaluation research: you can't evaluate a model separately from its harness. Agent performance emerges from the combination.

A great model in a poor harness will fail. A mediocre model in an excellent harness can succeed. When benchmarking agents, you're always measuring the harness-model system, not the model alone.

AI Agents - What harnesses enable
Tool Use - What harnesses orchestrate
Long-running Agents - Where harnesses shine

Mentioned In

Anthropic Engineering at 00:00:00

"The harness provides context management capabilities such as compaction to enable agents to work without exhausting token limits."