Gopher

/ˈɡoʊfər/

Also known as: DeepMind Gopher

research intermediate

What is Gopher?

Gopher is a 280-billion parameter language model developed by DeepMind and published in December 2021. At the time of its release, it was among the largest language models ever trained and demonstrated strong performance across a wide range of language tasks. However, Gopher’s most lasting contribution to AI history came not from its own capabilities but from serving as the control case that proved a counterintuitive insight about model scaling.

Architecture and Training

Gopher used a standard Transformer architecture with 280 billion parameters, trained on approximately 300 billion tokens from MassiveText, a curated dataset of web pages, books, news articles, and code. It was designed to push the frontier of what large-scale language models could achieve through sheer parameter count.

At launch, Gopher outperformed existing models on 100 out of 124 benchmark tasks, with particular strength in knowledge-intensive domains like reading comprehension and fact checking. It represented the state of the art for its time.

Why Gopher Matters: The Chinchilla Lesson

Gopher’s enduring significance is as the “before” in one of AI’s most important experiments. In March 2022, DeepMind published the Chinchilla scaling laws paper, which used Gopher as a direct comparison.

The experiment was straightforward: given the same compute budget, what happens if you train a much smaller model on much more data?

The answer was decisive. Chinchilla, with only 70 billion parameters (4x smaller than Gopher) but trained on 1.3 trillion tokens (4x more data), outperformed Gopher on nearly every benchmark. On MMLU, Chinchilla scored 67.5% versus Gopher’s 60%.

This proved that Gopher — and by extension, most large models of that era — was significantly undertrained. The industry had been scaling parameters when it should have been scaling training data proportionally.

Legacy

Gopher helped establish two things: first, that large language models could achieve broad competence across tasks, validating the generalist approach. Second, and more importantly, that the path to better models was not simply “make them bigger” but “train them optimally.” This insight, formalized as the Chinchilla scaling laws, reshaped how every major AI lab approaches model training.

  • Chinchilla - The model that proved Gopher was undertrained
  • Scaling Laws - The research area Gopher and Chinchilla defined
  • Pre-training - The training phase where these insights apply