Newsfeed / Stanford CME295: Transformers and LLMs Introduction
Stanford Online·October 17, 2025

Stanford CME295: Transformers and LLMs Introduction

Twin brothers from Netflix teach the foundations of NLP, tokenization, and word embeddings. The ideal starting point for understanding LLMs from first principles.

Stanford CME295: Transformers and LLMs Introduction

Why This Course Is the Ideal LLM Starting Point

This is the ideal starting point for anyone wanting to understand LLMs from first principles. Afin and Shervin - twin brothers who've worked at Uber, Google, and now Netflix on LLMs - break down the fundamentals without assuming prior deep learning expertise.

What makes this lecture valuable:

The instructors have been teaching this material as workshops since 2020, iterating through the ChatGPT explosion and beyond. They bring both academic rigor and industry experience from actually shipping LLM products.

The three buckets of NLP they introduce create a clean mental model:

  1. Classification - Sentiment, intent detection, language identification
  2. Multi-classification - Named entity recognition (NER), part-of-speech tagging
  3. Generation - Translation, Q&A, summarization (where all the action is today)

The tokenization trade-offs are particularly well explained:

  • Word-level is simple but creates OOV (out-of-vocabulary) problems
  • Subword leverages word roots but increases sequence length
  • Character-level handles misspellings but makes sequences very long and representations meaningless

The key insight on embeddings: One-hot encoding makes all tokens orthogonal (equally dissimilar), which is useless. We need learned representations where semantically similar tokens have high cosine similarity. This is the foundation that enables everything from Word2Vec to modern transformers.

5 Fundamentals From Stanford's Transformers Course

  • Two-unit Stanford course: 50% midterm, 50% final, no homework - purely conceptual
  • Proxy tasks matter: Word2Vec's skip-gram and CBOW tasks aren't the goal - the learned embeddings are
  • Vocabulary size: ~10K-50K for single language, 100K+ for multilingual/code models
  • Sequence length is compute: Longer sequences from character/subword tokenization directly impact model speed
  • Quality > quantity: Having the right representation matters more than having more data

What This Means for Learning Transformer Architectures

Understanding LLMs from first principles starts with tokenization trade-offs and learned embeddings. One-hot encoding makes all tokens equally dissimilar - useless. We need representations where semantically similar tokens have high cosine similarity. This foundation enables everything from Word2Vec to modern transformers.

Related