Jeff Dean: A 15-Year Whirlwind Tour of How Modern AI Models Came to Be

2025-11-24 AI Engineer

lecture research training deepmind tutorial

The Inside Story of Deep Learning’s Rise at Google

This is Jeff Dean - employee #30 at Google, creator of MapReduce and BigTable, founder of Google Brain, now Chief Scientist at DeepMind - giving the definitive history of how modern AI models came to be. It’s essentially the inside story of deep learning’s rise from someone who was there for all of it.

The humility of getting scale wrong. In 1990, Dean was so excited about neural networks that he did his senior thesis on parallel training using a 32-processor hypercube machine. “I was completely wrong. You needed like a million times as much processing power to make really good neural nets, not 32 times.” That instinct about scale would prove correct - just off by orders of magnitude.

The Google Brain origin story is delightfully casual. In 2012, Dean bumped into Andrew Ng in a Google micro kitchen. Ng mentioned his Stanford students were getting good results with neural nets on speech. Dean’s response: “Oh, that’s cool. We should train really big neural networks.” That conversation became Google Brain and the disbelief system (named “in part because people didn’t believe it was going to work”).

The back-of-envelope calculation that launched TPUs. Dean realized that if Google rolled out its new high-quality speech recognition model and 100 million people talked to their phones for 3 minutes daily, they would need to double Google’s entire data center capacity. Specialized hardware wasn’t optional - it was existential. TPU v1 delivered 15-30x speedup over CPUs/GPUs and 30-80x energy efficiency. The paper is now the most cited in ISCA’s 50-year history.

Every major breakthrough gets one slide. Word2vec and the discovery that vector directions are meaningful (king - man + woman = queen). Sequence-to-sequence models for translation. Transformers showing 10-100x compute efficiency over LSTMs. Self-supervised learning on text producing “almost infinite training examples.” Vision Transformers achieving state-of-art with 4-20x less compute. Sparse models activating only 1-5% of parameters per prediction. Chain-of-thought prompting. Distillation. RLHF.

The progress framing is sobering. “Three years ago we were really excited that we’d gotten 15% correct on eighth grade math problems.” That GSM8K benchmark - middle school word problems like “Sean has five toys and for Christmas he got two more” - is now essentially solved.

12 Key Breakthroughs That Created Modern AI

Google Brain started in a micro kitchen - Dean met Andrew Ng, decided to “train really big neural networks”
Disbelief: “mathematically wrong but it worked” - Asynchronous training with 200 model replicas updating shared parameters
Cat paper (2012) - 10M YouTube frames, unsupervised learning, neurons learned “cat” concept without labels
Word2vec directions are semantic - King - man + woman = queen; past/future tense directions
TPU imperative - Rolling out better speech recognition would have doubled Google’s data centers
TPUv1 - 15-30x faster, 30-80x more energy efficient than CPUs/GPUs
Transformers (2017) - 10-100x less compute than LSTMs for same accuracy; attention over recurrence
Sparse models - Only 1-5% of parameters activated per prediction; Gemini uses this
Chain of thought - Model does more computation per token by “showing its work”
Distillation - 3% of training data with soft targets matches 100% of data with hard labels
Pathways - Single Python process can address 10,000 TPU devices across metro areas
GSM8K progress - 15% accuracy 3 years ago on 8th grade math; now essentially solved

What 15 Years of Compounding AI Progress Teaches Us

Fifteen years of compounding breakthroughs - from the cat paper to transformers to sparse models - created modern AI. Each step seemed incremental; together they’re transformative. The person who built MapReduce now runs systems that solve problems thought impossible three years ago.