Why LLMs Cannot Evaluate Their Own Output

This is Hamel Hussein and Shrea Shankar - teachers of the #1 eval course on Maven, who've trained 2,000+ PMs and engineers including teams at OpenAI and Anthropic. Their process is surprisingly manual at the start, and that's the point.

"The top misconception is: can't the AI just eval it?" It doesn't work. When Hamel showed a trace where an AI scheduled a virtual tour that didn't exist, ChatGPT would say "looks good" because it lacks the context to know that feature doesn't exist. The domain expert catches it in seconds. LLMs miss product smell.

The process: open coding with a benevolent dictator. Look at traces (logs of LLM interactions). Write quick notes on what's wrong - just the first/most upstream error you see. Don't try to find everything. Don't use committees. Appoint one person whose taste you trust (the domain expert). Keep it informal: "jank" is fine as a note. Do at least 100 traces until you hit "theoretical saturation" - when you stop learning new things.

Error analysis precedes test writing. This is different from software engineering where you jump to unit tests. With LLMs, surface area is too large and behavior too stochastic. You need data analysis first to understand what to even test. Only after open coding do you codify patterns into automated evals.

The real estate agent example is perfect. User asks about availability. AI says "we don't have that, have a nice day." Technically correct. Product-wise? Terrible. A lead management tool should hand off to a human, not close the conversation. That's the kind of thing only a product person catches.

Don't make evals expensive. Binary scores only (pass/fail). One domain expert, not a committee. Sample your data, don't review everything. The goal isn't perfection - it's actionable improvement. If you make the process expensive, you won't do it.

10 Rules for Building Reliable AI Evaluations

LLMs can't do error analysis - They lack context; say "looks good" on obvious product failures
Open coding - Write quick notes on first error; don't find everything; be informal
Benevolent dictator - One domain expert whose taste you trust; not committees
100 traces minimum - Until theoretical saturation; you'll get addicted after 20
Theoretical saturation - Stop when you stop learning new things
Binary scores only - Pass/fail; don't do 1-5 scales; makes everything tractable
Error analysis → tests - Different from software eng; understand before codifying
Product person required - Engineers miss product smell; domain expertise critical
Sample, don't review all - Makes the process sustainable
"Jank" is valid - Keep notes informal; specificity matters more than polish

What This Means for AI Product Quality

AI eval isn't automated testing - it's data analysis requiring human judgment. The companies shipping reliable AI products aren't using sophisticated frameworks; they're putting domain experts in front of traces and letting them develop taste. No shortcut exists.

AI Eval Guide: Why You Need 100 Manual Reviews First

Why LLMs Cannot Evaluate Their Own Output

10 Rules for Building Reliable AI Evaluations

What This Means for AI Product Quality