Newsfeed / Glossary / Agent Evaluation
technical

Agent Evaluation

Pronunciation

/ˈeɪdʒənt ɪˌvæljuˈeɪʃən/

Also known as:agent evalsagentic evaluationagent testingagent benchmarking

What is Agent Evaluation?

Agent evaluation (or "evals") is the systematic process of testing AI agents through automated trials that measure task completion, reliability, and behavior quality. Unlike traditional LLM evaluation (which tests single responses), agent evals must handle:

  • Multi-turn interactions: Agents work across many steps
  • Tool usage: Agents call external systems
  • State modification: Agents change the world around them
  • Compounding errors: Small mistakes cascade into large failures

Why Agent Evals Are Different

Traditional LLM evaluation:

Input → Model → Output → Grade

Agent evaluation:

Task → [Model → Tool → Model → Tool → ...] → Outcome → Grade

The complexity explodes. You're not just checking if an answer is correct—you're verifying that a sequence of actions produced the right result in the right way.

Core Evaluation Concepts

Task

A single test case with defined inputs and success criteria. Example: "Resolve a customer refund request for order #12345."

Trial

One attempt at completing a task. Since model outputs vary, you run multiple trials per task to get statistical confidence.

Transcript (Trace)

The complete record of a trial: every output, tool call, intermediate result, and reasoning step. Essential for debugging failures.

Outcome

The final state of the world after the trial. Did the customer actually get refunded? Is the ticket closed? Outcomes matter more than what the agent said it did.

Grader

Logic that scores agent performance. Can be code-based (deterministic checks), model-based (LLM judges), or human (expert review).

Key Metrics

pass@k

The probability that at least one of k trials succeeds. Used when you only need the agent to succeed once (like finding a solution).

Example: pass@3 = 87% means running the agent 3 times gives you an 87% chance of at least one success.

pass^k

The probability that all k trials succeed. Used when you need consistent reliability (like customer-facing agents).

Example: pass^3 = 64% means the agent succeeds all 3 times in 64% of scenarios.

The gap between pass@k and pass^k reveals reliability issues. An agent with pass@3=90% but pass^3=50% is creative but inconsistent.

Types of Agent Evals

Capability Evals

"What can this agent do?"

  • Start with low pass rates
  • Test difficult, novel tasks
  • Drive improvement by finding weaknesses

Regression Evals

"Does this agent still work?"

  • Should maintain ~100% pass rates
  • Catch backsliding from code changes
  • Run automatically on every change

Safety Evals

"Will this agent behave appropriately?"

  • Test refusal of harmful requests
  • Verify proper escalation
  • Check for prompt injection resistance

Grader Types

Code-Based Graders

def grade_refund(outcome):
    return outcome.refund_issued and outcome.amount == expected
  • Fast, cheap, deterministic
  • Brittle to valid variations
  • Good for objective outcomes

Model-Based Graders

Is this customer service response helpful,
accurate, and appropriately empathetic?
Score 1-5 with reasoning.
  • Flexible, nuanced assessment
  • Non-deterministic (run multiple times)
  • Good for subjective quality

Human Graders

  • Gold standard for quality
  • Expensive and slow
  • Essential for calibrating other graders

Domain-Specific Evaluation

Coding Agents

  • Unit tests verify code execution
  • Static analysis checks code quality
  • Benchmarks like SWE-bench test real bug fixes

Customer Service Agents

  • Task completion (was issue resolved?)
  • Interaction quality (was customer satisfied?)
  • Policy adherence (were rules followed?)

Research Agents

  • Accuracy of synthesized information
  • Source quality and citation
  • Completeness of coverage

Common Evaluation Mistakes

MistakeProblemSolution
Testing happy path onlyAgents fail on edge casesInclude adversarial and boundary tests
Trusting agent's claimsAgents say "done" when they're notVerify outcomes independently
Single-trial evaluationHigh variance masks true performanceRun multiple trials per task
Eval saturationAll tasks pass, no signalContinuously add harder tasks
Grader bugsEvals test the wrong thingReview transcripts manually

Building an Eval Suite

  1. Start with real failures: Turn production issues into test cases
  2. Balance positive and negative: Test what agents should and shouldn't do
  3. Include reference solutions: Verify tasks are actually solvable
  4. Review transcripts regularly: Catch grader bugs and eval issues
  5. Expand continuously: Add tests as you discover new failure modes

Evaluation Infrastructure

The evaluation harness manages the end-to-end process:

  • Provides tasks and tools to agents
  • Runs trials (often in parallel)
  • Records complete transcripts
  • Executes graders
  • Aggregates results

Popular frameworks: Harbor, Promptfoo, Braintrust, LangSmith, Langfuse.

The Evaluation Mindset

Agent evaluation isn't a one-time checkpoint—it's a continuous practice:

"Evals are the unit tests of AI development."

Just as you wouldn't ship code without tests, you shouldn't deploy agents without evals. The best agent teams have more evaluation code than agent code.

Mentioned In

Evaluations for agents differ from traditional LLM evals because agents use tools across turns, modify state, and compound errors—requiring multi-turn assessment approaches.

Anthropic Engineering at 00:00:00

"Evaluations for agents differ from traditional LLM evals because agents use tools across turns, modify state, and compound errors—requiring multi-turn assessment approaches."

Evals suffer from semantic diffusion—the term means different things to different people. Data labeling companies call annotations 'evals,' PMs call acceptance criteria 'evals,' benchmark comparisons get called 'evals.' Neither evals nor production monitoring alone is sufficient.

Aishwarya Ranti at 00:42:00

"Evals suffer from semantic diffusion—the term means different things to different people. Data labeling companies call annotations 'evals,' PMs call acceptance criteria 'evals,' benchmark comparisons get called 'evals.' Neither evals nor production monitoring alone is sufficient."

Related Terms

See Also