The Generalization Gap

The Gap

AI models consistently score higher on standardized benchmarks than their real-world performance suggests. A model that passes the bar exam may fail to follow simple multi-step instructions. One that achieves near-perfect scores on math competitions may stumble on basic arithmetic in context. This disconnect between measured and actual capability is the generalization gap.

Why It Matters

Benchmarks Overstate Capability

Benchmarks test narrow, well-defined tasks. Real-world use demands flexible reasoning across ambiguous, open-ended situations. High benchmark scores create false confidence in deployment readiness.

The Jagged Intelligence Problem

Demis Hassabis frames this as "jagged intelligence" -- models that are brilliant in some domains and brittle in others. There is no smooth capability curve; instead, performance is unpredictably uneven.

Scaling Alone Does Not Close It

Larger models improve benchmark scores but the generalization gap persists. Ilya Sutskever argues this signals a need for fundamentally new approaches, not just more compute.

What Drives the Gap

Data contamination: Benchmark questions leak into training data, inflating scores
Narrow evaluation: Benchmarks test isolated skills, not integrated reasoning
Distribution mismatch: Training data distributions differ from deployment contexts
Memorization vs. understanding: Models may pattern-match rather than reason

Generalization - The core concept behind the gap
Benchmarks - Why standard evaluations fall short
Jagged Intelligence - The uneven capability profile

The Gap

Why It Matters

Benchmarks Overstate Capability

The Jagged Intelligence Problem

Scaling Alone Does Not Close It

What Drives the Gap

Related Trends

Related Terms

Key People

The Generalization Gap

The Gap

Why It Matters

Benchmarks Overstate Capability

The Jagged Intelligence Problem

Scaling Alone Does Not Close It

What Drives the Gap

Related Reading

Related Trends

Related Terms

Key People