Skip to main content
JEPA (Joint-Embedding Predictive Architecture) is Yann LeCun’s proposed backbone for world models. Instead of predicting raw pixels or tokens, a JEPA predicts a representation of a future observation from a representation of a past one. Operating in a learned latent space lets the predictor ignore unpredictable surface detail (pixel noise, lighting jitter, paraphrase variation) and focus on the structural, semantically meaningful parts of the signal that actually support planning. This page walks the family tree in the order the ideas were proposed.

JEPA (theory, 2022)

The framing paper. LeCun argues that generative prediction in observation space is the wrong objective for world-model learning — the model wastes capacity on irrelevant pixel-level detail — and proposes that the entire architecture (perception, prediction, planning, short- and long-term memory) should operate on learned representations. Every subsequent JEPA instantiates some part of this blueprint.

I-JEPA (images)

The first concrete JEPA. Given a single image, mask out target blocks and train a predictor — operating on encoded patches rather than pixels — to predict the representations of the masked blocks from a context of unmasked blocks. I-JEPA matches masked-autoencoder quality on ImageNet without any pixel-level reconstruction loss, validating the “predict in latent space” thesis for static images.

MC-JEPA (motion + content)

Extends I-JEPA beyond a single still image by jointly learning content (what is in the scene) and motion (how it flows) from video pairs. A shared encoder produces representations that are simultaneously useful for semantic tasks (classification, segmentation) and for optical-flow estimation, showing that the JEPA objective can absorb temporal structure without giving up static-image quality.

V-JEPA (video / dynamics)

Scales the predict-in-latent-space idea to full video. A V-JEPA encoder watches many seconds of video and a predictor reconstructs the representations of masked space-time regions. The resulting features transfer strongly to downstream video-understanding tasks despite never seeing a generative pixel loss. V-JEPA 2 (2025) scales this to billions of parameters and becomes the visual backbone of several later VLA and VL-JEPA systems.

VL-JEPA (vision + language)

Replaces the standard autoregressive token-generation objective of vision-language models with a JEPA-style objective: given a video and a textual query, predict the embedding of the answer rather than decoding text left to right. A lightweight decoder is invoked only when explicit text output is needed, which cuts decoding cost by roughly 2.85× while matching or beating autoregressive VLM baselines with half the trainable parameters.

H-JEPA (hierarchical world models)

Not a single paper but the long-horizon-planning vision from LeCun’s 2022 position paper: stack JEPAs so that the lower levels predict fine-grained, short-horizon representations (seconds of video, millisecond-scale dynamics) and the higher levels predict coarse-grained, long-horizon representations (minutes-to-hours plans, abstract subgoals). An agent can plan at the top of the stack without having to simulate every pixel of every frame. MC-JEPA’s pyramidal flow predictor is an early concrete step in this direction.

LeJEPA (theoretical foundation)

The theory paper that shows why JEPAs work and how to make them work without the collection of empirical tricks (stop-gradient, EMA teacher, schedulers, large-batch tuning) that earlier variants relied on. LeJEPA proves that the optimal embedding distribution for a JEPA is an isotropic Gaussian and introduces a simple regularizer — Sketched Isotropic Gaussian Regularization (SIGReg) — that drives training toward it. The result is a single-hyperparameter, linear-cost objective in ~50 lines of code that matches or beats heuristic-heavy baselines across 10+ datasets and 60+ architectures.

Where this fits

The JEPA family is the architectural answer to the design problem raised in World Models: if you want an internal simulator that lets an agent plan at human-relevant timescales, what should it predict? LeCun’s answer is representations, not pixels or tokens — and the papers above trace how that single idea scales from a static image (I-JEPA) to multi-modal, hierarchical, theoretically grounded systems (VL-JEPA, H-JEPA, LeJEPA).