Stanford CME295: Transformers and LLMs Introduction
Twin brothers from Netflix teach the foundations of NLP, tokenization, and word embeddings. The ideal starting point for understanding LLMs from first principles.
Why This Course Is the Ideal LLM Starting Point
This is the ideal starting point for anyone wanting to understand LLMs from first principles. Afin and Shervin - twin brothers who've worked at Uber, Google, and now Netflix on LLMs - break down the fundamentals without assuming prior deep learning expertise.
What makes this lecture valuable:
The instructors have been teaching this material as workshops since 2020, iterating through the ChatGPT explosion and beyond. They bring both academic rigor and industry experience from actually shipping LLM products.
The three buckets of NLP they introduce create a clean mental model:
- Classification - Sentiment, intent detection, language identification
- Multi-classification - Named entity recognition (NER), part-of-speech tagging
- Generation - Translation, Q&A, summarization (where all the action is today)
The tokenization trade-offs are particularly well explained:
- Word-level is simple but creates OOV (out-of-vocabulary) problems
- Subword leverages word roots but increases sequence length
- Character-level handles misspellings but makes sequences very long and representations meaningless
The key insight on embeddings: One-hot encoding makes all tokens orthogonal (equally dissimilar), which is useless. We need learned representations where semantically similar tokens have high cosine similarity. This is the foundation that enables everything from Word2Vec to modern transformers.
5 Fundamentals From Stanford's Transformers Course
- Two-unit Stanford course: 50% midterm, 50% final, no homework - purely conceptual
- Proxy tasks matter: Word2Vec's skip-gram and CBOW tasks aren't the goal - the learned embeddings are
- Vocabulary size: ~10K-50K for single language, 100K+ for multilingual/code models
- Sequence length is compute: Longer sequences from character/subword tokenization directly impact model speed
- Quality > quantity: Having the right representation matters more than having more data
What This Means for Learning Transformer Architectures
Understanding LLMs from first principles starts with tokenization trade-offs and learned embeddings. One-hot encoding makes all tokens equally dissimilar - useless. We need representations where semantically similar tokens have high cosine similarity. This foundation enables everything from Word2Vec to modern transformers.


