AI Eval Guide: Why You Need 100 Manual Reviews First
Hamel Hussein and Shrea Shankar explain why LLMs can't evaluate themselves. The open coding process, theoretical saturation, and why one domain expert beats committees.
Why LLMs Cannot Evaluate Their Own Output
This is Hamel Hussein and Shrea Shankar - teachers of the #1 eval course on Maven, who've trained 2,000+ PMs and engineers including teams at OpenAI and Anthropic. Their process is surprisingly manual at the start, and that's the point.
"The top misconception is: can't the AI just eval it?" It doesn't work. When Hamel showed a trace where an AI scheduled a virtual tour that didn't exist, ChatGPT would say "looks good" because it lacks the context to know that feature doesn't exist. The domain expert catches it in seconds. LLMs miss product smell.
The process: open coding with a benevolent dictator. Look at traces (logs of LLM interactions). Write quick notes on what's wrong - just the first/most upstream error you see. Don't try to find everything. Don't use committees. Appoint one person whose taste you trust (the domain expert). Keep it informal: "jank" is fine as a note. Do at least 100 traces until you hit "theoretical saturation" - when you stop learning new things.
Error analysis precedes test writing. This is different from software engineering where you jump to unit tests. With LLMs, surface area is too large and behavior too stochastic. You need data analysis first to understand what to even test. Only after open coding do you codify patterns into automated evals.
The real estate agent example is perfect. User asks about availability. AI says "we don't have that, have a nice day." Technically correct. Product-wise? Terrible. A lead management tool should hand off to a human, not close the conversation. That's the kind of thing only a product person catches.
Don't make evals expensive. Binary scores only (pass/fail). One domain expert, not a committee. Sample your data, don't review everything. The goal isn't perfection - it's actionable improvement. If you make the process expensive, you won't do it.
10 Rules for Building Reliable AI Evaluations
- LLMs can't do error analysis - They lack context; say "looks good" on obvious product failures
- Open coding - Write quick notes on first error; don't find everything; be informal
- Benevolent dictator - One domain expert whose taste you trust; not committees
- 100 traces minimum - Until theoretical saturation; you'll get addicted after 20
- Theoretical saturation - Stop when you stop learning new things
- Binary scores only - Pass/fail; don't do 1-5 scales; makes everything tractable
- Error analysis → tests - Different from software eng; understand before codifying
- Product person required - Engineers miss product smell; domain expertise critical
- Sample, don't review all - Makes the process sustainable
- "Jank" is valid - Keep notes informal; specificity matters more than polish
What This Means for AI Product Quality
AI eval isn't automated testing - it's data analysis requiring human judgment. The companies shipping reliable AI products aren't using sophisticated frameworks; they're putting domain experts in front of traces and letting them develop taste. No shortcut exists.


