research

Generalization

Pronunciation

/ˌdʒenərəlaɪˈzeɪʃən/

Also known as:out-of-distribution generalizationtransfer learningdomain adaptation

What is Generalization?

Generalization is a model's ability to perform well on new, previously unseen data rather than just memorizing the training examples. It's arguably the most important property of any machine learning system—a model that only works on data it's seen before has limited practical value.

The fundamental question: Does the model learn underlying patterns and principles, or does it just memorize specific examples?

Why Generalization Matters

Training vs. Reality: Models are trained on fixed datasets, but deployed in dynamic, unpredictable environments. Good generalization bridges this gap.

Novel Situations: Real-world use cases involve combinations and contexts the model never saw during training.

True Understanding: A model that generalizes well likely understands deeper patterns rather than surface correlations.

Types of Generalization

In-distribution: Performing well on new examples similar to training data. Most benchmarks test this.

Out-of-distribution (OOD): Handling examples that differ significantly from training data. Much harder.

Zero-shot: Performing tasks never explicitly trained for.

Few-shot: Learning new tasks from just a few examples.

Transfer: Applying knowledge from one domain to another.

The Generalization Problem in LLMs

Large language models exhibit a puzzling pattern. They can:

Score above human average on the bar exam
Write sophisticated code
Explain complex scientific concepts

Yet they also:

Fail simple logic puzzles
Make basic arithmetic errors
Miss obvious contradictions

This inconsistency—what Demis Hassabis calls "jagged intelligence"—reveals that current models don't generalize uniformly across domains.

Memorization vs. Understanding

A persistent debate: Do LLMs truly generalize, or do they pattern-match against memorized training data?

Evidence for generalization:

Novel creative combinations
Solving problems not in training data
Cross-domain transfer

Evidence for memorization:

Performance degrades with novel phrasings
Struggle with truly novel scenarios
Benchmark contamination concerns

The truth is likely somewhere in between—models generalize some patterns while memorizing others.

Testing Generalization

Held-out test sets: Data withheld from training to evaluate performance.

Adversarial examples: Inputs designed to fool models, testing robustness.

Distribution shifts: Testing on data from different sources or time periods.

Novel task types: Evaluating on task categories not present in training.

Why It's Hard

The curse of dimensionality: As input complexity grows, the space of possible inputs explodes exponentially.

Spurious correlations: Models can learn shortcuts that work on training data but fail generally.

Data bias: Training data may not represent the full distribution of real-world scenarios.

Evaluation challenges: Hard to know if a model truly generalizes or just saw similar examples during training.

The Path Forward

Improving generalization likely requires:

Better architectures: World models, causal reasoning
Richer training: Multi-modal, embodied learning
Curriculum learning: Progressive exposure to harder examples
Uncertainty quantification: Knowing when the model is out of its depth

Jagged Intelligence - The inconsistent generalization profile of current AI
World Models - One path to better generalization