Self-Supervised Learning

Also known as: SSL, self-supervision, pretext learning

research intermediate

What is Self-Supervised Learning?

Self-supervised learning is a training paradigm where a model learns representations from unlabeled data by solving pretext tasks derived from the data itself. Instead of requiring humans to label every example, the system generates its own supervisory signal. For language models, this typically means predicting the next token in a sequence; for vision models, it might mean reconstructing masked portions of an image. Self-supervised learning is the engine behind the pre-training phase of virtually every modern foundation model.

Why Self-Supervised Learning Changed Everything

Before self-supervised learning became dominant, AI progress was bottlenecked by the availability of labeled data. Annotating millions of images or text samples is prohibitively expensive. Self-supervised learning removed this constraint by allowing models to learn from the vast ocean of unlabeled data on the internet. This is what made scaling laws practical: once you no longer need labels, you can train on trillions of tokens of text or billions of images, and performance improves predictably with scale.

How It Works in Practice

The most common self-supervised objective for language models is next-token prediction (autoregressive modeling). Given a sequence of words, the model learns to predict what comes next. BERT introduced masked language modeling, where random words are hidden and the model predicts them from context. In computer vision, methods like DINO and MAE mask patches of an image and train the model to reconstruct them. In all cases, the data provides its own labels, and the model learns rich internal representations as a byproduct of solving these tasks.

The Foundation of Foundation Models

Self-supervised learning is the reason foundation models exist. The ability to learn general-purpose representations from raw data, then fine-tune them for specific tasks with minimal labeled examples, represents one of the most important paradigm shifts in AI. It explains why a model trained to predict the next word can also write code, translate languages, and reason about math: the pretext task forces the model to develop a deep understanding of structure and meaning.

Pre-training - The phase where self-supervised learning is applied
Scaling Laws - Why self-supervised learning scales so well
Deep Learning - The architectural foundation for self-supervised methods

Related Terms

pre training deep learning scaling laws