Chinchilla
/tʃɪnˈtʃɪlə/
What is Chinchilla?
Chinchilla refers to both a specific language model and, more importantly, the influential scaling laws paper from DeepMind published in March 2022. The paper "Training Compute-Optimal Large Language Models" fundamentally changed how the AI industry thinks about training large language models.
The Key Discovery
DeepMind asked: Given a fixed compute budget, how should you balance model size versus training data?
By training over 400 models (70M to 16B parameters, on 5B to 500B tokens), they discovered:
For compute-optimal training, model size and number of training tokens should be scaled equally. For every doubling of model size, the number of training tokens should also double.
The shocking implication: Most existing LLMs were significantly undertrained. The industry had been making models bigger while keeping training data relatively constant—a suboptimal approach.
Chinchilla vs. Gopher
DeepMind tested their hypothesis by training Chinchilla:
| Model | Parameters | Training Tokens | Compute |
|---|---|---|---|
| Gopher | 280B | 300B | Same |
| Chinchilla | 70B | 1.3T | Same |
Despite being 4x smaller, Chinchilla outperformed Gopher on nearly every benchmark because it was trained on 4x more data.
Performance Results
Chinchilla uniformly outperformed much larger models:
- Gopher (280B parameters)
- GPT-3 (175B parameters)
- Jurassic-1 (178B parameters)
- Megatron-Turing NLG (530B parameters)
On MMLU, Chinchilla achieved 67.5% accuracy—a 7% improvement over Gopher.
Why It Mattered
For training: Labs realized they needed 11x more data than GPT-3-era models used.
For inference: Smaller, better-trained models are cheaper to run. Chinchilla's 4x smaller size means 4x lower inference costs.
For the industry: Shifted focus from "make models bigger" to "train models longer on more data."
The Chinchilla Tax
Post-Chinchilla, models that aren't compute-optimal are said to be paying the "Chinchilla tax"—wasting compute on extra parameters instead of additional training.
Limitations and Updates
The Chinchilla scaling laws assume:
- Fixed compute budget
- Single-epoch training (each token seen once)
- Optimal balance between model size and data
Later research has refined these findings:
- Inference-optimal models may benefit from being slightly larger (since inference costs scale with deployment)
- Multi-epoch training on high-quality data can outperform single-epoch on lower-quality data
- Data quality matters as much as quantity
Legacy
Chinchilla fundamentally changed LLM training practices. Models like LLaMA explicitly followed Chinchilla-optimal ratios. The paper remains one of the most cited and influential works in modern AI research.
Related Reading
- Scaling Laws - The broader research area
- Pre-training - Where Chinchilla insights apply