Pre-training
/priː ˈtreɪnɪŋ/
What is Pre-training?
Pre-training is the first phase of training a large language model, where the model learns general language understanding from massive amounts of text. Think of it as reading billions of books, articles, and websites to learn grammar, facts, and patterns in language.
During pre-training, the model processes billions of words and repeatedly predicts the next token in a sequence. This self-supervised approach—learning from the structure of data itself rather than human labels—is what enables LLMs to develop broad capabilities.
The Pre-training Pipeline
1. Data Collection Gathering diverse text from books, articles, websites, code repositories, and other sources. Hugging Face's FineWeb dataset, for example, contains 15 trillion tokens (44TB) from 96 CommonCrawl snapshots.
2. Data Cleaning Removing duplicates, non-textual elements, formatting issues, and low-quality content. Data quality dramatically affects model quality.
3. Tokenization Converting text into numerical tokens the model can process. Text is broken into subwords or characters and mapped to unique numbers.
4. Training The core task: predict the next token in a sequence. The model sees "The cat sat on the" and learns to predict "mat" (or similar). Repeated billions of times, this builds deep language understanding.
Resources Required
Pre-training is extraordinarily resource-intensive:
- Time: Weeks to months of continuous training
- Compute: Thousands of GPUs running in parallel
- Data: Trillions of tokens
- Cost: Millions of dollars for frontier models
This is why most organizations fine-tune existing models rather than pre-train from scratch.
Pre-training vs. Fine-tuning
| Aspect | Pre-training | Fine-tuning |
|---|---|---|
| Goal | General language understanding | Specific task or behavior |
| Data | Trillions of tokens, diverse | Thousands to millions, targeted |
| Time | Weeks to months | Hours to days |
| Cost | Millions of dollars | Hundreds to thousands |
| Who does it | Foundation model labs | Anyone with a use case |
The Two-Phase Paradigm
Modern LLM development is described in two phases:
- Pre-training: Builds general-purpose language capabilities
- Post-training: Refines and aligns these capabilities (includes fine-tuning, RLHF, DPO)
As Andrej Karpathy describes it, pre-training is "a crappy form of evolution"—selecting for models that predict internet text well. Post-training then shapes this raw capability into something useful and safe.
2025 Developments
Reinforcement Pre-Training (RPT): Microsoft researchers reframed next-token prediction as a sequential decision-making problem, potentially improving how models learn during pre-training.
Data scarcity: High-quality text data is becoming scarce. Labs are exploring synthetic data, multimodal data, and more efficient training methods.
Scaling limits: Pure scaling of pre-training is showing diminishing returns, shifting focus to post-training innovations.
Related Reading
- Scaling Laws - The relationship between pre-training compute and performance
- Andrej Karpathy - Calls pre-training "crappy evolution"
- John Schulman - Pioneer in post-training techniques
