research

Chinchilla

Pronunciation

/tʃɪnˈtʃɪlə/

Also known as:Chinchilla scaling lawscompute-optimal trainingChinchilla optimal

What is Chinchilla?

Chinchilla refers to both a specific language model and, more importantly, the influential scaling laws paper from DeepMind published in March 2022. The paper "Training Compute-Optimal Large Language Models" fundamentally changed how the AI industry thinks about training large language models.

The Key Discovery

DeepMind asked: Given a fixed compute budget, how should you balance model size versus training data?

By training over 400 models (70M to 16B parameters, on 5B to 500B tokens), they discovered:

For compute-optimal training, model size and number of training tokens should be scaled equally. For every doubling of model size, the number of training tokens should also double.

The shocking implication: Most existing LLMs were significantly undertrained. The industry had been making models bigger while keeping training data relatively constant—a suboptimal approach.

Chinchilla vs. Gopher

DeepMind tested their hypothesis by training Chinchilla:

Model	Parameters	Training Tokens	Compute
Gopher	280B	300B	Same
Chinchilla	70B	1.3T	Same

Despite being 4x smaller, Chinchilla outperformed Gopher on nearly every benchmark because it was trained on 4x more data.

Performance Results

Chinchilla uniformly outperformed much larger models:

Gopher (280B parameters)
GPT-3 (175B parameters)
Jurassic-1 (178B parameters)
Megatron-Turing NLG (530B parameters)

On MMLU, Chinchilla achieved 67.5% accuracy—a 7% improvement over Gopher.

Why It Mattered

For training: Labs realized they needed 11x more data than GPT-3-era models used.

For inference: Smaller, better-trained models are cheaper to run. Chinchilla's 4x smaller size means 4x lower inference costs.

For the industry: Shifted focus from "make models bigger" to "train models longer on more data."

The Chinchilla Tax

Post-Chinchilla, models that aren't compute-optimal are said to be paying the "Chinchilla tax"—wasting compute on extra parameters instead of additional training.

Limitations and Updates

The Chinchilla scaling laws assume:

Fixed compute budget
Single-epoch training (each token seen once)
Optimal balance between model size and data

Later research has refined these findings:

Inference-optimal models may benefit from being slightly larger (since inference costs scale with deployment)
Multi-epoch training on high-quality data can outperform single-epoch on lower-quality data
Data quality matters as much as quantity

Legacy

Chinchilla fundamentally changed LLM training practices. Models like LLaMA explicitly followed Chinchilla-optimal ratios. The paper remains one of the most cited and influential works in modern AI research.

Scaling Laws - The broader research area
Pre-training - Where Chinchilla insights apply

Related Terms

Scaling Laws Pre Training Gopher

Related Terms

See Also