Stanford CME295: Transformers and LLMs Introduction

2025-10-17 Stanford Online

transformers llm education nlp embeddings

Why This Course Is the Ideal LLM Starting Point

This is the ideal starting point for anyone wanting to understand LLMs from first principles. Afin and Shervin - twin brothers who’ve worked at Uber, Google, and now Netflix on LLMs - break down the fundamentals without assuming prior deep learning expertise.

What makes this lecture valuable:

The instructors have been teaching this material as workshops since 2020, iterating through the ChatGPT explosion and beyond. They bring both academic rigor and industry experience from actually shipping LLM products.

The three buckets of NLP they introduce create a clean mental model:

Classification - Sentiment, intent detection, language identification
Multi-classification - Named entity recognition (NER), part-of-speech tagging
Generation - Translation, Q&A, summarization (where all the action is today)

The tokenization trade-offs are particularly well explained:

Word-level is simple but creates OOV (out-of-vocabulary) problems
Subword leverages word roots but increases sequence length
Character-level handles misspellings but makes sequences very long and representations meaningless

The key insight on embeddings: One-hot encoding makes all tokens orthogonal (equally dissimilar), which is useless. We need learned representations where semantically similar tokens have high cosine similarity. This is the foundation that enables everything from Word2Vec to modern transformers.

5 Fundamentals From Stanford’s Transformers Course

Two-unit Stanford course: 50% midterm, 50% final, no homework - purely conceptual
Proxy tasks matter: Word2Vec’s skip-gram and CBOW tasks aren’t the goal - the learned embeddings are
Vocabulary size: ~10K-50K for single language, 100K+ for multilingual/code models
Sequence length is compute: Longer sequences from character/subword tokenization directly impact model speed
Quality > quantity: Having the right representation matters more than having more data

What This Means for Learning Transformer Architectures

Understanding LLMs from first principles starts with tokenization trade-offs and learned embeddings. One-hot encoding makes all tokens equally dissimilar - useless. We need representations where semantically similar tokens have high cosine similarity. This foundation enables everything from Word2Vec to modern transformers.