architecture

JEPA

Pronunciation

/ˈdʒepə/

Also known as:Joint Embedding Predictive ArchitectureI-JEPAV-JEPA

What is JEPA?

Joint Embedding Predictive Architecture (JEPA) is Yann LeCun's proposed framework for building more human-like AI systems. First outlined in his 2022 paper "A Path Towards Autonomous Machine Intelligence," JEPA represents an alternative to the autoregressive approach used by LLMs.

The key insight: Predict abstract representations, not raw pixels or tokens. This allows the system to ignore irrelevant details while focusing on semantic understanding.

How JEPA Works

Traditional generative models (like GPT) predict the next token or pixel directly. JEPA takes a different approach:

  1. Encode parts of an input into abstract representations (embeddings)
  2. Predict the embedding of one part from another part
  3. Learn by comparing predicted embeddings to actual embeddings

This happens in "embedding space" rather than "pixel/token space"—a crucial distinction that eliminates the need to model irrelevant details.

Why Not Generative Models?

LeCun argues that autoregressive generative models (LLMs, diffusion models) have fundamental limitations:

  • Computational waste: Predicting every pixel/token, even irrelevant ones
  • Uncertainty handling: Struggle with multiple valid futures
  • Brittleness: Sensitive to exact input formulations

JEPA can handle uncertainty by predicting distributions in embedding space, naturally accommodating multiple possible outcomes.

I-JEPA (Images)

Meta's Image-based JEPA learns by:

  • Taking an image and masking parts of it
  • Predicting the embedding of masked regions from visible regions
  • Comparing predicted vs. actual embeddings

Results: A 632M parameter model trained on 16 A100 GPUs in under 72 hours achieved state-of-the-art low-shot classification on ImageNet with only 12 labeled examples per class. Other methods take 2-10x more compute for worse results.

V-JEPA (Video)

V-JEPA extends the architecture to video:

"V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning." — Yann LeCun

V-JEPA 2 has been successfully applied to robotics planning, demonstrating how JEPA can serve as a world model for real-world decision making.

Key Advantages

AspectGenerative ModelsJEPA
Prediction targetRaw pixels/tokensAbstract embeddings
Irrelevant detailsMust model everythingCan ignore noise
UncertaintySingle outputMultiple valid outcomes
EfficiencyHigh computeMore efficient
Semantic focusSurface patternsDeeper meaning

JEPA vs. Transformers

JEPA is not an alternative to transformers—many JEPA implementations use transformer modules. It's an alternative to autoregressive generation as a learning paradigm, regardless of the underlying architecture.

The Vision

LeCun positions JEPA as the core of his vision for achieving human-level reasoning:

  1. World model: JEPA learns how the world works
  2. Planning: Use the world model to simulate action consequences
  3. Reasoning: Navigate complex decision spaces

This contrasts with the "scale up LLMs" approach dominant in the industry.

Related Terms

See Also