Jeff Dean: The Napkin Sketch That Launched TPUs
Google's Chief Scientist explains why rolling out better speech recognition would have doubled data centers, plus the case for moonshot research labs.
How Jeff Dean Sees Hardware-Software Co-Design
This is Jeff Dean at NeurIPS 2024, freshly announcing TPU v7 (Ironwood), and it's a different side of him - less technical lecture, more strategic reflection on how AI innovation actually happens and why it needs institutional support.
The napkin sketch that changed hardware forever. In 2013, Dean did a back-of-envelope calculation: if Google rolled out its better speech recognition model to 100 million users for a few minutes daily, they would need to double Google's entire data center capacity - just for one feature improvement. "The compute requirements got quite scary." That thought experiment launched the TPU program. By 2015, TPUv1 was in data centers - 30-70x more energy efficient than CPUs/GPUs, 15-30x faster. Pre-transformer.
Hardware/software co-design is forecasting the entire ML field. Every TPU generation requires predicting where ML computations will be 2.5-6 years in the future. "It's not a very easy thing." The strategy: add small hardware features that might matter. If they pay off, you're ready. If not, you've lost a small piece of chip area. The transformer architecture was born at Google on a "pretty similar timeline" to TPUs - serendipity in co-design.
The Pathways abstraction is underappreciated. A single Python process can address 20,000 TPU devices across multiple pods, multiple buildings, multiple metro areas. Pathways automatically figures out which network to use - high-speed interconnect within pods, data center network across pods, long-distance links across cities. All Gemini training runs on Jax → Pathways → XLA → TPUs.
Academic research funding is Dean's passion project. "The whole deep learning revolution built on academic research from 30-40 years ago." Neural networks and backpropagation came from academia. Google itself was built on TCP/IP, RISC processors, and the Stanford Digital Library Project (which funded PageRank). Dean advocates for the Lo Institute model: 3-5 year moonshot grants with 3-5 PIs and 30-50 PhD students targeting specific societal impacts.
Healthcare AI moonshot: learn from every past decision to inform every future one. Dean's aspirational goal: use every past healthcare decision to help every clinician and every person make better decisions. "Super hard" due to privacy, regulatory fragmentation, and data format inconsistencies. Requires federated learning and privacy-preserving ML because "you're not going to be able to move healthcare data from where it sits."
10 Insights From Jeff Dean on TPUs and AI Research
- TPU v7 (Ironwood) - 9,216 chips per pod, FP4 precision support, 3,600x peak performance vs TPUv2
- The napkin sketch - Rolling out better speech recognition would have doubled Google's data centers; TPUs were existential
- TPUv1 (2015) - 30-70x more energy efficient, 15-30x faster than CPUs/GPUs; pre-transformer era
- Hardware forecasting - Every TPU generation requires predicting ML needs 2.5-6 years ahead
- Pathways - Single Python process addresses 20,000 devices across metros; all Gemini training uses this
- Publishing continuum - Not binary publish/don't; Pixel features ship first, SIGGRAPH papers follow
- Google internal research conference - 6,000 attendees; "might feel a year ahead" of NeurIPS
- 3-5 year moonshots - Dean's preferred time horizon: "not so distant it won't have impact, not so short you can't be ambitious"
- Titan paper - Hybrid transformer + recurrence; "interesting idea to explore" but not in Gemini yet
- Healthcare moonshot - Learn from every past decision; requires federated learning, can't move healthcare data
What This Means for AI Infrastructure and Research
TPUs exist because a napkin calculation showed rolling out better speech recognition would double Google's data centers. Hardware/software co-design requires predicting ML needs 2.5-6 years ahead. Today a single Python process can address 20,000 devices across multiple cities. That's the infrastructure enabling frontier models.


