Andrej Karpathy: The Busy Person's Intro to LLMs

2023-11-22 Andrej Karpathy

lecture tutorial training research

What Karpathy Wants Everyone to Understand About LLMs

This is the definitive introduction to large language models - Karpathy re-recorded his viral 30-minute talk for YouTube after the original wasn’t captured. If you understand this talk, you understand the fundamentals.

“A large language model is just two files.” The parameters file (140GB for Llama 2 70B - 70 billion parameters × 2 bytes each as float16) and a run file (~500 lines of C with no dependencies). Take these two files, compile, and you can talk to the model offline on a MacBook. That’s the entire package.

Training is compression. Take 10TB of internet text, 6,000 GPUs for 12 days (~$2M), and compress it into 140GB of parameters. That’s roughly 100x compression - but it’s lossy compression. The model has a “gestalt” of the training data, not an identical copy. “This is kind of like a zip file of the internet.”

The reversal curse shows how weird this knowledge is. GPT-4 knows Tom Cruise’s mother is Mary Lee Pfeiffer. But ask “Who is Mary Lee Pfeiffer’s son?” and it doesn’t know. “This knowledge is weird and kind of one-dimensional. You have to ask from a certain direction.”

“LLMs are mostly inscrutable artifacts.” We know the exact architecture, every mathematical operation. But we don’t know what the 100 billion parameters are doing. “We can measure that it’s getting better at next word prediction, but we don’t know how these parameters collaborate to perform that.” Unlike a car where we understand all the parts.

Pre-training vs fine-tuning. Pre-training: massive quantity, lower quality internet data, builds knowledge. Fine-tuning: smaller quantity (~100K examples), very high quality Q&A pairs, gives the model its assistant “format.” Pre-training is expensive (months, millions of dollars, once per year). Fine-tuning is cheap (daily iterations possible).

RLHF uses comparisons because comparing is easier than generating. Writing a haiku is hard. Picking the best haiku from several options is easier. Stage 3 fine-tuning exploits this with reinforcement learning from human feedback.

Scaling laws are the key insight. Performance is a “remarkably smooth, well-behaved, predictable function of only two variables: N (parameters) and D (training data).” No signs of topping out. “Algorithmic progress is not necessary - we can get more powerful models for free by training bigger models longer.”

11 Insights From Karpathy on How LLMs Work

Two files - Parameters (140GB for 70B model) + run.c (~500 lines)
100x lossy compression - 10TB internet → 140GB parameters
Next word prediction - Fundamental task; forces learning about the world
Reversal curse - Knowledge is one-dimensional; direction matters
“Mostly inscrutable” - We know architecture but not what parameters do
Pre-training = knowledge - Expensive, months, internet-scale data
Fine-tuning = alignment - Cheap, daily possible, 100K quality examples
RLHF - Comparing is easier than generating; stage 3 optimization
Scaling laws - Performance predictable from parameters × data; no plateau
Open vs closed - Closed (GPT-4, Claude) work better; open (Llama) catchable
“Hallucination” - Model doesn’t know what it memorized vs generated

What This Means for Understanding AI Systems

An LLM is a 100x compressed version of human knowledge that fits on a laptop. We built it, we can run it, but we don’t actually understand how 100 billion parameters collaborate to produce intelligence. We’re in the strange position of having created something powerful before fully understanding it.