Self-supervised video representations with TCN TCC using aegean-ai/tcc
This notebook is designed to read like a tutorial and a course assignment at the same time.
It focuses on one question:
Can a self-supervised video representation learn the latent phase of a task purely from temporal structure, without robot control, action labels, or transcription?The answer developed across the two Google Research papers is:
- TCN learns by enforcing time-based contrastive alignment under strong synchronization assumptions.
- TCC generalizes this idea by enforcing temporal cycle-consistency, which is more robust to variations in execution speed and alignment.
- study the conceptual evolution from TCN to TCC
- train the PyTorch rewrite in
aegean-ai/tcc - extract frame embeddings
- visualize trajectories with PCA, t-SNE, and UMAP
- segment action sequences using representation geometry
Papers
-
Sermanet et al., Time-Contrastive Networks, 2018
https://arxiv.org/abs/1704.06888 -
Dwibedi et al., Temporal Cycle-Consistency Learning, 2019
https://arxiv.org/abs/1904.07846
Repository used in this notebook
https://github.com/aegean-ai/tcc
This notebook assumes themainbranch and the current PyTorch package layout undersrc/tcc/.
What you should learn
By the end, you should be able to explain why TCN and TCC are related but not identical:- TCN: metric alignment with synchronized positives
- TCC: structural temporal alignment via cycles
- TCN says: “frames at the same time index should be close.”
- TCC says: “if I map from one sequence to another and back, I should return to the same temporal phase.”
1. Theory recap: from TCN to TCC
1.1 TCN: contrastive temporal alignment
TCN learns an embedding so that synchronized frames from different views become neighbors in feature space. A canonical triplet-style loss is: Interpretation:- anchor: frame
- positive: synchronized frame
- negative: mismatched-time frame
1.2 Why TCN is not enough
Suppose two people perform the same pouring task:- one moves slowly
- one moves quickly
- one pauses before tilting
- one starts tilting earlier
1.3 TCC: align temporal structure, not raw clock time
TCC keeps the idea that embeddings should reflect task progression, but replaces hard synchronized matching with cycle consistency. Conceptual intuition. Given frame in sequence , map it to the most corresponding frame in sequence : Then map back from sequence to sequence : TCC encourages . Differentiable training loss. The hardargmin above is not differentiable, so the actual TCC loss replaces it with a soft nearest-neighbor formulation. For frame in sequence , define a soft correspondence distribution over frames in sequence :
where is a temperature parameter. The cycle-back distribution is computed analogously, and the loss is the cross-entropy between the back-mapped distribution and a target concentrated at the original index . This makes the entire cycle differentiable and trainable with standard gradient descent.
Conceptually:
- TCN aligns absolute timestamps
- TCC aligns latent phase structure
1.4 What the embedding should look like on pouring
If TCC works, then the learned trajectory in embedding space should behave like a latent phase variable:- early reach frames cluster near other early reach frames
- grasp transitions appear near one another
- tilt and pour form coherent regions
- embeddings from different videos should trace similar temporal paths
2. How this notebook and the aegean-ai/tcc repo work together
This notebook is not a standalone script. It is a guided analysis layer that drives the aegean-ai/tcc PyTorch package. The repo provides the training loop, model definitions, dataset utilities, and evaluation code. The notebook provides the experimental protocol: configuring runs, extracting embeddings, and visualizing results.
Two supported environments
| Dev container (recommended) | Google Colab | |
|---|---|---|
| GPU | Local NVIDIA GPU via Docker | Colab T4/A100 runtime |
| Package manager | uv (pre-installed in container) | pip (Colab default) |
| Setup effort | make start — one command | Clone + pip install in notebook cells |
| Persistence | Full local disk | Session-scoped (data lost on disconnect) |
| Best for | Full sweep, large runs | Quick experiments, no local GPU |
Workflow overview
What you modify vs. what you use as-is
| Layer | You modify | You use as-is |
|---|---|---|
| Notebook | Embedding dimension, iteration count, analysis parameters (, projection method) | Visualization and segmentation code |
| Repo config | model.conv_embedder.embedding_size, train.max_iters, logdir | Everything else in configs/default.yaml |
| Repo code | Nothing — treat as a library | train.py, evaluate.py, datasets.py, models.py |
3. Environment setup
Choose one of the two paths below. Both result in a workingimport tcc with GPU access.
Path A: Dev container (recommended for full assignment)
The repo ships a complete Docker-based development environment with GPU support,uv, and VS Code integration.
Prerequisites: Docker with NVIDIA Container Toolkit, VS Code with Dev Containers extension.
Steps:
- Clone the repo locally:
- Copy the environment file:
- Open in VS Code → “Reopen in Container” (or run
docker compose up -dmanually). - Inside the container, run:
This creates a
.venvwithuv, installs the package in editable mode, and registers a Jupyter kernel. - Open this notebook in VS Code or JupyterLab (port 8888) and select the “Python 3 (tcc)” kernel.
- Base image:
pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime - Package manager:
uv(not pip) — the Makefile handles alluvcalls - Python: whatever 3.11+ is in the container (typically from conda)
- Workspace:
/workspaces/tcc - TensorBoard: port 6006
Path B: Google Colab (quick start, no local GPU needed)
Use this path if you do not have a local GPU or want a fast start. Colab sessions are ephemeral — save checkpoints to Google Drive to avoid losing training results. Steps:- In a Colab notebook, enable GPU: Runtime → Change runtime type → T4 GPU.
- Run the clone and install cells below (Section 3.1–3.2).
- Colab uses
pip— the%pip installcommands handle everything.
- Session timeout erases all local files. Mount Google Drive for persistence:
- Colab’s default Python may differ from 3.11 — the package should still install but is only tested on 3.11–3.12.
Python version requirement
The repo requires Python ≥3.11, <3.13 (pyproject.toml). The dev container satisfies this automatically. On Colab, check with !python --version.
3.1 Clone the repository (Colab / Path B only)
If you are using the dev container (Path A), skip this — the repo is already your workspace at/workspaces/tcc.
3.2 Install the package (Colab / Path B only)
If you are using the dev container (Path A), skip this —make start already installed the package. Run make install-notebooks if you need matplotlib/umap-learn.
3.3 Quick repository inspection
Verify the repo structure. In the dev container the repo root is/workspaces/tcc; on Colab it is the cloned tcc/ directory.
4. Data: the pouring dataset
The multiview pouring dataset is hosted on HuggingFace atsermanet/multiview-pouring. It contains TFRecord files with multi-view video sequences of pouring tasks.
Download from HuggingFace
Usehuggingface_hub to download the dataset files. The code cell below clones the dataset repository into data/pouring/. This is the recommended approach — it downloads all TFRecord files and the recombination script needed for one split file.
Expected directory layout
After download and conversion, the dataset root must have this structure:create_dataset function expects this layout — it discovers videos by listing subdirectories under train/ or val/, then loads frames in filename-sorted order.
4.1 Download from HuggingFace
The dataset is hosted atsermanet/multiview-pouring and contains TFRecord files organized into train/, val/, and test/ splits.
Use huggingface_hub.snapshot_download to download the full dataset. This downloads all files (TFRecords, recombination scripts, README) into a local cache and returns the path. We then symlink or copy into our expected data/pouring/ directory.
Note: One test file (whiteorange_to_clear1_real) was split into two parts due to upload size limits. After downloading, run the provided shell script to recombine it. This only affects the test split — training and validation are ready to use immediately.
4.2 Expected semantic phases
We will reason about pouring in terms of latent phases such as:- reach
- grasp
- lift / position
- tilt
- pour
- retract / return
These phase names are used only for qualitative interpretation of the learned representation.
5. Configuration and training
The currentaegean-ai/tcc package provides:
- a typed configuration object
- an
alignmentalgorithm corresponding to TCC - a PyTorch training loop
- training algorithm
- dataset name
- image size
- batch size
- embedding size
- checkpoint/logging schedule
5.1 Utility: robust config editing
Research repositories evolve. Rather than assuming one exact config layout, we use helper functions that can set values safely if the corresponding fields exist. This makes the notebook more resilient to small refactors of the dataclass hierarchy.5.2 Choose experiment settings
The assignment requires an embedding-dimension sweep:- 32
- 64
- 128
5.3 Build a training config for one run
The training code insrc/tcc/train.py expects a TCCConfig, and the default config already uses:
datasets: [pouring]training_algo: alignment
- embedding size
- log directory
- dataset root
- optionally
train.max_itersfor a shorter tutorial run
6. Training
The training loop in the repo is exposed throughtcc.train.train(cfg).
The logic is:
- instantiate the algorithm corresponding to
cfg.training_algo - build the dataset loader
- optimize the alignment loss
- save checkpoints in
cfg.logdir
6.1 Full assignment runs
Run three experiments:7. Loading checkpoints and extracting embeddings
The repo provides the pieces we need:get_algo(...)to instantiate the TCC algorithm- checkpoint loading utilities from
tcc.train - embedding extraction utilities from
tcc.evaluate
7.1 Build the evaluation dataloader
The repo training code internally converts the top-level config into aDataConfig.We reuse the same helper if available; otherwise we build the
DataConfig manually.
8. Representation diagnostics
Now we test the main scientific claim:Do embeddings organize frames by task phase?We use two projection methods and two diagnostic approaches:
- PCA — linear projection preserving global variance; fast and deterministic
- UMAP — nonlinear projection revealing manifold structure; better for fine-grained phase separation
- single-video trajectory plots colored by time
- cross-video overlays in a shared projection space (joint fit, so coordinates are comparable)
9. Temporal segmentation from embedding geometry
This section operationalizes the claim that the embedding has learned latent phase. We use two complementary segmentation strategies:9.1 Change-point detection
If the representation changes rapidly at phase transitions, then should spike near boundaries. This is a boundary-detection approach — it finds where phase transitions occur without assigning cluster labels.9.2 KMeans clustering in the native embedding space
If the embedding clusters by phase, KMeans should recover coarse phase labels. We use to match the six expected pouring phases (reach, grasp, lift, tilt, pour, retract). Experiment with different values to test sensitivity.10. Embedding dimension sweep
The assignment asks you to compare:- compression
- expressiveness
- ease of clustering
- risk of overfitting appearance rather than phase
11. Write-up questions
Q1. TCN vs TCC
Explain, in your own words, the evolution from TCN to TCC. Include the role of the soft nearest-neighbor formulation in making cycle consistency differentiable.Q2. Does the learned representation encode phase?
Use your PCA and UMAP plots to justify a claim. Compare single-video trajectories with cross-video overlays.Q3. How well does segmentation recover phase structure?
Compare change-point detection and KMeans clustering. Do the detected boundaries align with qualitative phase transitions? Does varying change the story?Q4. What failure modes remain?
Examples:- appearance variation dominating phase
- pauses causing over-segmentation
- self-similar frames across non-adjacent stages
- collapse of distinct phases into one cluster
12. Final checklist
Before submitting, verify that you have:- trained at least one real TCC run on pouring
- extracted embeddings from a saved checkpoint
- produced PCA and UMAP trajectory plots for multiple videos
- produced a cross-video overlay using joint projection
- run both change-point detection and KMeans segmentation
- compared
- written answers to all four questions (Q1–Q4) in inline markdown cells, with all supporting figures embedded in the notebook

