Generalization
/ˌdʒenərəlaɪˈzeɪʃən/
What is Generalization?
Generalization is a model's ability to perform well on new, previously unseen data rather than just memorizing the training examples. It's arguably the most important property of any machine learning system—a model that only works on data it's seen before has limited practical value.
The fundamental question: Does the model learn underlying patterns and principles, or does it just memorize specific examples?
Why Generalization Matters
Training vs. Reality: Models are trained on fixed datasets, but deployed in dynamic, unpredictable environments. Good generalization bridges this gap.
Novel Situations: Real-world use cases involve combinations and contexts the model never saw during training.
True Understanding: A model that generalizes well likely understands deeper patterns rather than surface correlations.
Types of Generalization
In-distribution: Performing well on new examples similar to training data. Most benchmarks test this.
Out-of-distribution (OOD): Handling examples that differ significantly from training data. Much harder.
Zero-shot: Performing tasks never explicitly trained for.
Few-shot: Learning new tasks from just a few examples.
Transfer: Applying knowledge from one domain to another.
The Generalization Problem in LLMs
Large language models exhibit a puzzling pattern. They can:
- Score above human average on the bar exam
- Write sophisticated code
- Explain complex scientific concepts
Yet they also:
- Fail simple logic puzzles
- Make basic arithmetic errors
- Miss obvious contradictions
This inconsistency—what Demis Hassabis calls "jagged intelligence"—reveals that current models don't generalize uniformly across domains.
Memorization vs. Understanding
A persistent debate: Do LLMs truly generalize, or do they pattern-match against memorized training data?
Evidence for generalization:
- Novel creative combinations
- Solving problems not in training data
- Cross-domain transfer
Evidence for memorization:
- Performance degrades with novel phrasings
- Struggle with truly novel scenarios
- Benchmark contamination concerns
The truth is likely somewhere in between—models generalize some patterns while memorizing others.
Testing Generalization
Held-out test sets: Data withheld from training to evaluate performance.
Adversarial examples: Inputs designed to fool models, testing robustness.
Distribution shifts: Testing on data from different sources or time periods.
Novel task types: Evaluating on task categories not present in training.
Why It's Hard
The curse of dimensionality: As input complexity grows, the space of possible inputs explodes exponentially.
Spurious correlations: Models can learn shortcuts that work on training data but fail generally.
Data bias: Training data may not represent the full distribution of real-world scenarios.
Evaluation challenges: Hard to know if a model truly generalizes or just saw similar examples during training.
The Path Forward
Improving generalization likely requires:
- Better architectures: World models, causal reasoning
- Richer training: Multi-modal, embodied learning
- Curriculum learning: Progressive exposure to harder examples
- Uncertainty quantification: Knowing when the model is out of its depth
Related Reading
- Jagged Intelligence - The inconsistent generalization profile of current AI
- World Models - One path to better generalization