Agent Evaluation
/ˈeɪdʒənt ɪˌvæljuˈeɪʃən/
What is Agent Evaluation?
Agent evaluation (or "evals") is the systematic process of testing AI agents through automated trials that measure task completion, reliability, and behavior quality. Unlike traditional LLM evaluation (which tests single responses), agent evals must handle:
- Multi-turn interactions: Agents work across many steps
- Tool usage: Agents call external systems
- State modification: Agents change the world around them
- Compounding errors: Small mistakes cascade into large failures
Why Agent Evals Are Different
Traditional LLM evaluation:
Input → Model → Output → Grade
Agent evaluation:
Task → [Model → Tool → Model → Tool → ...] → Outcome → Grade
The complexity explodes. You're not just checking if an answer is correct—you're verifying that a sequence of actions produced the right result in the right way.
Core Evaluation Concepts
Task
A single test case with defined inputs and success criteria. Example: "Resolve a customer refund request for order #12345."
Trial
One attempt at completing a task. Since model outputs vary, you run multiple trials per task to get statistical confidence.
Transcript (Trace)
The complete record of a trial: every output, tool call, intermediate result, and reasoning step. Essential for debugging failures.
Outcome
The final state of the world after the trial. Did the customer actually get refunded? Is the ticket closed? Outcomes matter more than what the agent said it did.
Grader
Logic that scores agent performance. Can be code-based (deterministic checks), model-based (LLM judges), or human (expert review).
Key Metrics
pass@k
The probability that at least one of k trials succeeds. Used when you only need the agent to succeed once (like finding a solution).
Example: pass@3 = 87% means running the agent 3 times gives you an 87% chance of at least one success.
pass^k
The probability that all k trials succeed. Used when you need consistent reliability (like customer-facing agents).
Example: pass^3 = 64% means the agent succeeds all 3 times in 64% of scenarios.
The gap between pass@k and pass^k reveals reliability issues. An agent with pass@3=90% but pass^3=50% is creative but inconsistent.
Types of Agent Evals
Capability Evals
"What can this agent do?"
- Start with low pass rates
- Test difficult, novel tasks
- Drive improvement by finding weaknesses
Regression Evals
"Does this agent still work?"
- Should maintain ~100% pass rates
- Catch backsliding from code changes
- Run automatically on every change
Safety Evals
"Will this agent behave appropriately?"
- Test refusal of harmful requests
- Verify proper escalation
- Check for prompt injection resistance
Grader Types
Code-Based Graders
def grade_refund(outcome):
return outcome.refund_issued and outcome.amount == expected
- Fast, cheap, deterministic
- Brittle to valid variations
- Good for objective outcomes
Model-Based Graders
Is this customer service response helpful,
accurate, and appropriately empathetic?
Score 1-5 with reasoning.
- Flexible, nuanced assessment
- Non-deterministic (run multiple times)
- Good for subjective quality
Human Graders
- Gold standard for quality
- Expensive and slow
- Essential for calibrating other graders
Domain-Specific Evaluation
Coding Agents
- Unit tests verify code execution
- Static analysis checks code quality
- Benchmarks like SWE-bench test real bug fixes
Customer Service Agents
- Task completion (was issue resolved?)
- Interaction quality (was customer satisfied?)
- Policy adherence (were rules followed?)
Research Agents
- Accuracy of synthesized information
- Source quality and citation
- Completeness of coverage
Common Evaluation Mistakes
| Mistake | Problem | Solution |
|---|---|---|
| Testing happy path only | Agents fail on edge cases | Include adversarial and boundary tests |
| Trusting agent's claims | Agents say "done" when they're not | Verify outcomes independently |
| Single-trial evaluation | High variance masks true performance | Run multiple trials per task |
| Eval saturation | All tasks pass, no signal | Continuously add harder tasks |
| Grader bugs | Evals test the wrong thing | Review transcripts manually |
Building an Eval Suite
- Start with real failures: Turn production issues into test cases
- Balance positive and negative: Test what agents should and shouldn't do
- Include reference solutions: Verify tasks are actually solvable
- Review transcripts regularly: Catch grader bugs and eval issues
- Expand continuously: Add tests as you discover new failure modes
Evaluation Infrastructure
The evaluation harness manages the end-to-end process:
- Provides tasks and tools to agents
- Runs trials (often in parallel)
- Records complete transcripts
- Executes graders
- Aggregates results
Popular frameworks: Harbor, Promptfoo, Braintrust, LangSmith, Langfuse.
The Evaluation Mindset
Agent evaluation isn't a one-time checkpoint—it's a continuous practice:
"Evals are the unit tests of AI development."
Just as you wouldn't ship code without tests, you shouldn't deploy agents without evals. The best agent teams have more evaluation code than agent code.
Related Reading
- AI Agents - What we're evaluating
- Agent Harness - Infrastructure that runs evals
- Agent Transcript - The data evals analyze
Mentioned In

Anthropic Engineering at 00:00:00
"Evaluations for agents differ from traditional LLM evals because agents use tools across turns, modify state, and compound errors—requiring multi-turn assessment approaches."

Aishwarya Ranti at 00:42:00
"Evals suffer from semantic diffusion—the term means different things to different people. Data labeling companies call annotations 'evals,' PMs call acceptance criteria 'evals,' benchmark comparisons get called 'evals.' Neither evals nor production monitoring alone is sufficient."