AI Lakehouse

In this project you build a small but complete data lakehouse from scratch and use it the way a data/ML team would. You will run an S3-compatible object store locally (RustFS), put a SQL-catalog table format on top of it (DuckLake), organize data into raw, silver, and gold layers, and move datasets back and forth between local Docker storage and the Hugging Face Hub. Budget three weeks. The point is not to produce a large system. It is to understand why a lakehouse separates a SQL catalog from data files in object storage, and to feel what versioned table operations (snapshots, time travel, rollback, schema evolution) actually do.

What you will be graded on

This project tests four things directly:

DuckLake design principles. You can explain and defend the separation of a SQL metadata catalog from Parquet data files, snapshots and immutability, and time travel.
Lakehouse construction and initial population on RustFS. You stand up DuckLake over a RustFS S3 layer and land real data in the raw layer.
Version control and layered operations. You run raw to silver to gold transformations and exercise snapshots, time travel, rollback, and schema evolution on several datasets.
Local-to-Hugging-Face interaction. You ingest Hugging Face datasets through local Docker storage into the lakehouse, and push a curated gold table back to the Hub.

Architecture

Editable diagram source: images/lakehouse.c4.yaml Read the diagram as three separated planes plus external sources and a sink. Ingestion lands raw data in RustFS: bytes and files are written straight into the object store’s raw area from whatever source you have, a dataset hub (Hugging Face, Kaggle), a streaming feed, or a batch drop, never through the engine. In this project your source is the Hugging Face Hub, but the lakehouse treats them all the same. DuckDB is the query and transform engine: it reads raw from RustFS through the DuckLake catalog, runs the SQL that builds silver then gold, and a curated gold table is published back to the Hub (push_to_hub). In the middle, DuckLake is the table format and catalog (what tables exist, their schemas, and every snapshot), and RustFS is plain object storage holding the Parquet bytes. The engine reaches storage through the catalog: DuckDB → ATTACH (SQL) → DuckLake → data files → RustFS. That separation is the whole idea, and it mirrors how larger platforms are built (compute, catalog, and storage as independent layers): cheap immutable data files in object storage, transactional bookkeeping in a database that any number of clients can read, and a query engine that is the only thing that has to understand both.

The stack you must understand

DuckLake keeps its catalog in a SQL database (here, a DuckDB file) and stores table data as Parquet in object storage. Every change (insert, update, delete, schema change) produces a new immutable snapshot; updates are modeled as delete-plus-insert. You read old states with time travel, for example ... AT (VERSION => 3). There are no primary keys, foreign keys, or UNIQUE/CHECK constraints, so data quality is your job in silver and gold. See the DuckLake docs.
RustFS is an Apache-2.0, Rust, S3-compatible object store (a MinIO alternative) that you run in Docker. DuckLake talks to it through DuckDB’s httpfs extension and an S3 secret with a custom endpoint.
Medallion layers: raw is data exactly as ingested; silver is cleaned, typed, and deduplicated; gold is the curated, ML-ready product (features and labels, aggregates).
Hugging Face datasets are your sources and one of your sinks. Tabular and text columns land directly as Parquet; image and video columns are extracted to object storage with only their URIs kept in the tables (see the section on images, video, and sensor data below).

Starter scaffold

Fork this layout and fill it in.

lakehouse-project/
  docker-compose.yml
  .env                 # HF_TOKEN=...  (do not commit real tokens)
  local-store/         # local Docker host storage (staging area)
  sql/
    00_attach.sql       # extensions + S3 secret + ATTACH DuckLake
    10_raw.sql          # land datasets in the raw layer
    20_silver.sql       # raw -> silver transforms
    30_gold.sql         # silver -> gold feature/label tables
  notebooks/            # or scripts/: ingestion, transforms, HF round-trip
  rebuild.sh            # rebuild the whole lakehouse from scratch
  README.md

`docker-compose.yml`

services:
  rustfs:
    image: rustfs/rustfs:latest
    ports:
      - "9000:9000"   # S3 API
      - "9001:9001"   # web console
    volumes:
      - ./rustfs-data:/data   # host dir must be owned by UID 10001 (RustFS runs as non-root)
      - ./rustfs-logs:/logs
    # Default credentials are rustfsadmin / rustfsadmin. Change them for anything real,
    # and confirm the credential env-var names against the RustFS docs for your image tag.

  lab:
    image: python:3.12-slim
    working_dir: /workspace
    depends_on: [rustfs]
    volumes:
      - ./:/workspace
      - ./local-store:/data/local   # the "local storage" half of objective 4
    environment:
      AWS_ACCESS_KEY_ID: rustfsadmin
      AWS_SECRET_ACCESS_KEY: rustfsadmin
      S3_ENDPOINT: "http://rustfs:9000"
      HF_TOKEN: ${HF_TOKEN:-}
    command: sleep infinity   # exec in and run: pip install duckdb datasets huggingface_hub boto3

Create the rustfs-data directory and give it to UID 10001 before the first up (mkdir -p rustfs-data && sudo chown 10001 rustfs-data), then create a bucket named lakehouse from the console at http://localhost:9001 or with boto3.

`sql/00_attach.sql`

INSTALL ducklake; LOAD ducklake;
INSTALL httpfs;   LOAD httpfs;

CREATE OR REPLACE SECRET rustfs (
    TYPE s3,
    KEY_ID 'rustfsadmin',
    SECRET 'rustfsadmin',
    ENDPOINT 'rustfs:9000',   -- use 'localhost:9000' if you connect from the host
    URL_STYLE 'path',
    USE_SSL false
);

-- catalog (metadata) in a DuckDB file; data files on the RustFS bucket
ATTACH 'ducklake:metadata.ducklake' AS lake (DATA_PATH 's3://lakehouse/');
USE lake;
CREATE SCHEMA IF NOT EXISTS raw;
CREATE SCHEMA IF NOT EXISTS silver;
CREATE SCHEMA IF NOT EXISTS gold;

Land your first dataset and take a snapshot

import datasets, duckdb

# 1. pull a Parquet-based HF dataset through local Docker storage
ds = datasets.load_dataset("OWNER/PARQUET_DATASET", split="train")
ds.to_parquet("/data/local/raw_tmp.parquet")

# 2. attach the lakehouse and land the data in the raw layer
con = duckdb.connect()
con.execute(open("sql/00_attach.sql").read())
con.execute("""
    CREATE TABLE raw.my_dataset AS
    SELECT * FROM read_parquet('/data/local/raw_tmp.parquet');
""")

# 3. confirm a snapshot was recorded and the bytes live in RustFS
print(con.sql("FROM ducklake_snapshots('lake')"))

Confirm the exact snapshot/time-travel function names and the S3 secret options against the DuckLake and DuckDB S3 API docs for the version you install, since DuckLake is young and evolving.

Working with images, video, and sensor data

The lessons above use tidy tables, but your two datasets are not tabular: COCO is images and VisDrone is video. The rule that makes a lakehouse work for them is simple: keep the heavy bytes in object storage and put only references and metadata in the lakehouse tables. DuckDB queries the metadata, and your data loader fetches the bytes those rows point to. DuckDB selects; the loader materializes. Concretely, when you ingest a Hugging Face dataset, image/video/audio columns arrive as a struct<bytes, path>. During raw ingestion you:

write each blob to its own object, for example s3://lakehouse/assets/<dataset>/<column>/<file>, and
replace the column with a plain string URI that points at it (or keep the upstream URL).

So a raw or silver table holds URIs, labels, boxes, captions, and timestamps, never pixels. That is exactly what lets DuckDB be the query engine for a vision dataset.

COCO (computer vision: images)

Land the images as objects and the annotations as Parquet tables (for example coco_annotations with image_uri, category, bbox, caption). DuckDB then queries pure metadata, for example to find crowded scenes:

SELECT image_uri, COUNT(*) AS n_people
FROM silver.coco_annotations
WHERE category = 'person'
GROUP BY image_uri
HAVING COUNT(*) >= 5
ORDER BY n_people DESC;

Your gold layer is a training table (image URI plus label/caption plus split); the loader fetches each image_uri only for the rows a query selected.

VisDrone (video: query a fragment, not the whole clip)

Video is the case that forces the idea. Storing whole clips and scanning them per query is hopeless, so you store each clip as fragments (short byte ranges / fMP4 chunks) in S3 and build a fragment index table in the lakehouse: one row per fragment with clip_uri, fragment_id, start_frame, end_frame, start_time, end_time, and per-fragment detection statistics (object counts, classes). The per-frame VisDrone detection annotations (bounding boxes per frame) join to it. Now “query a video fragment” is just SQL over the index, and you read only the matching fragments from S3:

-- pick the busiest fragments to sample for a detection model
SELECT clip_uri, fragment_id, start_frame, end_frame, n_objects
FROM silver.visdrone_fragments
WHERE n_objects > 20
ORDER BY n_objects DESC
LIMIT 100;

The loader takes those (clip_uri, fragment_id) rows, fetches just those byte ranges from RustFS, and decodes only those frames, instead of every clip. The DuckLake catalog versions the fragment index like any other table, so you can time-travel it too. This is the mechanism real multimodal data planes use; you are building a small version of it.

The importance of lakehouses in the AI industryVersion-controlled, layered curation of multimodal data is exactly what production pipelines do, only at a far larger scale. NVIDIA’s NeMo Curator powers the video curation pipeline behind the Cosmos world foundation models, trained on curated video well beyond laptop scale:

“Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers.” Cosmos World Foundation Model Platform for Physical AI, NVIDIA, 2025

Your raw, silver, and gold layers, the VisDrone fragment index, and the metadata-plus-URI tables are the same ideas a production curation pipeline relies on; what changes in production is the scale, the orchestration, and the compute, not the design.

Three-week plan

Week 1 - stand up the stack and populate raw

Bring up RustFS and a DuckDB environment with docker-compose; create the lakehouse bucket.
ATTACH DuckLake with an S3 DATA_PATH. Verify the split: the catalog file holds the metadata, and the Parquet objects appear in the RustFS bucket.
Ingest COCO (images) and VisDrone (video) into the raw layer: write the image/video blobs as objects in the RustFS raw area, and land COCO annotations and the VisDrone fragment index as Parquet tables of URIs and metadata (not pixels).
Checkpoint: list snapshots with ducklake_snapshots, and show (console or boto3) the raw objects in RustFS and the matching URI rows in the catalog.

Week 2 - layered transforms and version control

raw to silver: fix types, handle missing values, deduplicate, and perform at least one schema evolution (add or rename a column) so you can watch a new snapshot appear.
silver to gold: build ML-ready tables for both datasets (a COCO image-URI + label/caption + split table, and a VisDrone training table), and demonstrate a video-fragment query: select specific VisDrone fragments by their per-fragment statistics and confirm you read only those fragments from RustFS, not whole clips. Run the COCO metadata query (for example crowded scenes) too.
Exercise version control on a key dataset: take a sequence of snapshots, run a time-travel query (AT (VERSION => n) and by timestamp), compare two snapshots, and roll back a deliberately bad transform.
Checkpoint: a short demo of time travel and rollback, plus notes on what each snapshot changed.

Week 3 - Hugging Face round-trip and report

Ingest additional HF data incrementally into raw (a new snapshot), demonstrating the local-storage-to-lakehouse flow.
Push a gold table back to the Hub as a dataset (Dataset.from_parquet(...).push_to_hub(...) or huggingface_hub), closing the loop between your lakehouse and Hugging Face.
Make it reproducible: rebuild.sh recreates the whole lakehouse from an empty bucket.
Write the report (below) and prepare a short live demo.

Datasets

This project is fixed to two datasets so you exercise both the image path and the video path:

COCO (computer vision: images): object detection, captions, and segmentation. Available on the Hugging Face Hub (for example HuggingFaceM4/COCO); use a split or subset small enough to iterate on a laptop (for example val2017).
VisDrone (video): drone-view object detection in videos, the VisDrone-Dataset Task 2 (object detection in videos) with its VID sequences and per-frame bounding-box annotations. Use a few sequences. This is the dataset you use to demonstrate querying a video fragment instead of a whole clip.

Plus the one gold table you publish back to the Hub. Land the heavy media as objects in RustFS and keep annotations, the fragment index, and URIs in DuckLake tables, as described in “Working with images, video, and sensor data”.

Deliverables

A Git repository: docker-compose.yml, the sql/ and notebooks/ (or scripts/), rebuild.sh, and a README.md with exact run instructions.
A populated lakehouse with raw, silver, and gold layers for COCO and VisDrone: image/video bytes as objects in RustFS, with URIs, annotations, and the VisDrone fragment index in the DuckLake catalog.
A versioning demonstration: snapshots, a time-travel query, a snapshot comparison, and a rollback.
A working video-fragment query over the VisDrone fragment index that materializes only the selected fragments from RustFS, plus the COCO metadata query.
The Hugging Face round-trip: ingestion from the Hub and a gold dataset published back to it (include the dataset URL).
A 2 to 3 page report answering the design-principle questions.

Report: design-principle questions

DuckLake keeps metadata in a SQL catalog and data in Parquet on object storage. What does this separation buy you compared with a single self-contained file, and what are the consistency implications when several clients read at once?
Updates are modeled as delete-plus-insert and every change records an immutable snapshot. Explain, in terms of snapshots and data files, how time travel and rollback work, and what keeping all snapshots costs over time.
DuckLake has no primary keys or constraints. How did you guarantee uniqueness and quality in silver and gold without them?
Trace a single INSERT into a raw table all the way to bytes: catalog entry, new snapshot, Parquet file, S3 object in RustFS. Where does each piece of state actually live?
Why put the catalog in a SQL database at all, rather than in files alongside the data (as file-only table formats do)? What does that choice make easy, and what does it make harder?
Your image and video bytes never enter a DuckLake table. Explain why, what the tables hold instead, and how the VisDrone fragment index lets DuckDB answer “give me the busy fragments” without scanning whole videos.

Assessment

Weight	Criterion (maps to a project objective)
25%	DuckLake design principles: report answers and correct, deliberate use of the catalog, snapshots, and separation
25%	Lakehouse construction and initial raw population on the RustFS S3 layer
30%	Version control and layered operations: working raw/silver/gold transforms, snapshots, time travel, rollback, schema evolution, and the video-fragment query
20%	Local-to-Hugging-Face interaction: ingestion through local storage and a gold dataset published back to the Hub

References

DuckLake: documentation, the manifesto, and the v1.0 release announcement
DuckLake book: DuckLake: The Definitive Guide, “CRUD operations” (O’Reilly)
DuckDB S3 access: httpfs S3 API
RustFS: GitHub and rustfs.com
Hugging Face datasets: docs
Production multimodal curation: Cosmos World Foundation Model Platform for Physical AI (NVIDIA), whose video curation pipeline is built with NVIDIA NeMo Curator

Edit this page on GitHub or file an issue.

Course

Study Guides

Assignments

Project

What you will be graded on

Architecture

The stack you must understand

Starter scaffold

`docker-compose.yml`

`sql/00_attach.sql`

Land your first dataset and take a snapshot

Working with images, video, and sensor data

COCO (computer vision: images)

VisDrone (video: query a fragment, not the whole clip)

Three-week plan

Week 1 - stand up the stack and populate raw

Week 2 - layered transforms and version control

Week 3 - Hugging Face round-trip and report

Datasets

Deliverables

Report: design-principle questions

Assessment

References

​What you will be graded on

​Architecture

​The stack you must understand

​Starter scaffold

​docker-compose.yml

​sql/00_attach.sql

​Land your first dataset and take a snapshot

​Working with images, video, and sensor data

​COCO (computer vision: images)

​VisDrone (video: query a fragment, not the whole clip)

​Three-week plan

​Week 1 - stand up the stack and populate raw

​Week 2 - layered transforms and version control

​Week 3 - Hugging Face round-trip and report

​Datasets

​Deliverables

​Report: design-principle questions

​Assessment

​References

What you will be graded on

Architecture

The stack you must understand

Starter scaffold

`docker-compose.yml`

`sql/00_attach.sql`

Land your first dataset and take a snapshot

Working with images, video, and sensor data

COCO (computer vision: images)

VisDrone (video: query a fragment, not the whole clip)

Three-week plan

Week 1 - stand up the stack and populate raw

Week 2 - layered transforms and version control

Week 3 - Hugging Face round-trip and report

Datasets

Deliverables

Report: design-principle questions

Assessment

References