Pre-training

What is Pre-training?

Pre-training is the first phase of training a large language model, where the model learns general language understanding from massive amounts of text. Think of it as reading billions of books, articles, and websites to learn grammar, facts, and patterns in language.

During pre-training, the model processes billions of words and repeatedly predicts the next token in a sequence. This self-supervised approach—learning from the structure of data itself rather than human labels—is what enables LLMs to develop broad capabilities.

The Pre-training Pipeline

1. Data Collection Gathering diverse text from books, articles, websites, code repositories, and other sources. Hugging Face's FineWeb dataset, for example, contains 15 trillion tokens (44TB) from 96 CommonCrawl snapshots.

2. Data Cleaning Removing duplicates, non-textual elements, formatting issues, and low-quality content. Data quality dramatically affects model quality.

3. Tokenization Converting text into numerical tokens the model can process. Text is broken into subwords or characters and mapped to unique numbers.

4. Training The core task: predict the next token in a sequence. The model sees "The cat sat on the" and learns to predict "mat" (or similar). Repeated billions of times, this builds deep language understanding.

Resources Required

Pre-training is extraordinarily resource-intensive:

Time: Weeks to months of continuous training
Compute: Thousands of GPUs running in parallel
Data: Trillions of tokens
Cost: Millions of dollars for frontier models

This is why most organizations fine-tune existing models rather than pre-train from scratch.

Pre-training vs. Fine-tuning

Aspect	Pre-training	Fine-tuning
Goal	General language understanding	Specific task or behavior
Data	Trillions of tokens, diverse	Thousands to millions, targeted
Time	Weeks to months	Hours to days
Cost	Millions of dollars	Hundreds to thousands
Who does it	Foundation model labs	Anyone with a use case

The Two-Phase Paradigm

Modern LLM development is described in two phases:

Pre-training: Builds general-purpose language capabilities
Post-training: Refines and aligns these capabilities (includes fine-tuning, RLHF, DPO)

As Andrej Karpathy describes it, pre-training is "a crappy form of evolution"—selecting for models that predict internet text well. Post-training then shapes this raw capability into something useful and safe.

2025 Developments

Reinforcement Pre-Training (RPT): Microsoft researchers reframed next-token prediction as a sequential decision-making problem, potentially improving how models learn during pre-training.

Data scarcity: High-quality text data is becoming scarce. Labs are exploring synthetic data, multimodal data, and more efficient training methods.

Scaling limits: Pure scaling of pre-training is showing diminishing returns, shifting focus to post-training innovations.

Scaling Laws - The relationship between pre-training compute and performance
Andrej Karpathy - Calls pre-training "crappy evolution"
John Schulman - Pioneer in post-training techniques

What is Pre-training?

The Pre-training Pipeline

Resources Required

Pre-training vs. Fine-tuning

The Two-Phase Paradigm

2025 Developments

Related Terms

See Also

Pre-training

What is Pre-training?

The Pre-training Pipeline

Resources Required

Pre-training vs. Fine-tuning

The Two-Phase Paradigm

2025 Developments

Related Reading

Related Terms

See Also