technical

Agent Evaluation

Pronunciation

/ˈeɪdʒənt ɪˌvæljuˈeɪʃən/

Also known as:agent evalsagentic evaluationagent testingagent benchmarking

What is Agent Evaluation?

Agent evaluation (or "evals") is the systematic process of testing AI agents through automated trials that measure task completion, reliability, and behavior quality. Unlike traditional LLM evaluation (which tests single responses), agent evals must handle:

Multi-turn interactions: Agents work across many steps
Tool usage: Agents call external systems
State modification: Agents change the world around them
Compounding errors: Small mistakes cascade into large failures

Why Agent Evals Are Different

Traditional LLM evaluation:

Input → Model → Output → Grade

Agent evaluation:

Task → [Model → Tool → Model → Tool → ...] → Outcome → Grade

The complexity explodes. You're not just checking if an answer is correct—you're verifying that a sequence of actions produced the right result in the right way.

Core Evaluation Concepts

Task

A single test case with defined inputs and success criteria. Example: "Resolve a customer refund request for order #12345."

Trial

One attempt at completing a task. Since model outputs vary, you run multiple trials per task to get statistical confidence.

Transcript (Trace)

The complete record of a trial: every output, tool call, intermediate result, and reasoning step. Essential for debugging failures.

Outcome

The final state of the world after the trial. Did the customer actually get refunded? Is the ticket closed? Outcomes matter more than what the agent said it did.

Grader

Logic that scores agent performance. Can be code-based (deterministic checks), model-based (LLM judges), or human (expert review).

Key Metrics

pass@k

The probability that at least one of k trials succeeds. Used when you only need the agent to succeed once (like finding a solution).

Example: pass@3 = 87% means running the agent 3 times gives you an 87% chance of at least one success.

pass^k

The probability that all k trials succeed. Used when you need consistent reliability (like customer-facing agents).

Example: pass^3 = 64% means the agent succeeds all 3 times in 64% of scenarios.

The gap between pass@k and pass^k reveals reliability issues. An agent with pass@3=90% but pass^3=50% is creative but inconsistent.

Types of Agent Evals

Capability Evals

"What can this agent do?"

Start with low pass rates
Test difficult, novel tasks
Drive improvement by finding weaknesses

Regression Evals

"Does this agent still work?"

Should maintain ~100% pass rates
Catch backsliding from code changes
Run automatically on every change

Safety Evals

"Will this agent behave appropriately?"

Test refusal of harmful requests
Verify proper escalation
Check for prompt injection resistance

Grader Types

Code-Based Graders

def grade_refund(outcome):
    return outcome.refund_issued and outcome.amount == expected

Fast, cheap, deterministic
Brittle to valid variations
Good for objective outcomes

Model-Based Graders

Is this customer service response helpful,
accurate, and appropriately empathetic?
Score 1-5 with reasoning.

Flexible, nuanced assessment
Non-deterministic (run multiple times)
Good for subjective quality

Human Graders

Gold standard for quality
Expensive and slow
Essential for calibrating other graders

Domain-Specific Evaluation

Coding Agents

Unit tests verify code execution
Static analysis checks code quality
Benchmarks like SWE-bench test real bug fixes

Customer Service Agents

Task completion (was issue resolved?)
Interaction quality (was customer satisfied?)
Policy adherence (were rules followed?)

Research Agents

Accuracy of synthesized information
Source quality and citation
Completeness of coverage

Common Evaluation Mistakes

Mistake	Problem	Solution
Testing happy path only	Agents fail on edge cases	Include adversarial and boundary tests
Trusting agent's claims	Agents say "done" when they're not	Verify outcomes independently
Single-trial evaluation	High variance masks true performance	Run multiple trials per task
Eval saturation	All tasks pass, no signal	Continuously add harder tasks
Grader bugs	Evals test the wrong thing	Review transcripts manually

Building an Eval Suite

Start with real failures: Turn production issues into test cases
Balance positive and negative: Test what agents should and shouldn't do
Include reference solutions: Verify tasks are actually solvable
Review transcripts regularly: Catch grader bugs and eval issues
Expand continuously: Add tests as you discover new failure modes

Evaluation Infrastructure

The evaluation harness manages the end-to-end process:

Provides tasks and tools to agents
Runs trials (often in parallel)
Records complete transcripts
Executes graders
Aggregates results

Popular frameworks: Harbor, Promptfoo, Braintrust, LangSmith, Langfuse.

The Evaluation Mindset

Agent evaluation isn't a one-time checkpoint—it's a continuous practice:

"Evals are the unit tests of AI development."

Just as you wouldn't ship code without tests, you shouldn't deploy agents without evals. The best agent teams have more evaluation code than agent code.

AI Agents - What we're evaluating
Agent Harness - Infrastructure that runs evals
Agent Transcript - The data evals analyze

Mentioned In

Anthropic Engineering at 00:00:00

"Evaluations for agents differ from traditional LLM evals because agents use tools across turns, modify state, and compound errors—requiring multi-turn assessment approaches."

Aishwarya Ranti at 00:42:00

"Evals suffer from semantic diffusion—the term means different things to different people. Data labeling companies call annotations 'evals,' PMs call acceptance criteria 'evals,' benchmark comparisons get called 'evals.' Neither evals nor production monitoring alone is sufficient."