What you will be graded on
This project tests four things directly:- DuckLake design principles. You can explain and defend the separation of a SQL metadata catalog from Parquet data files, snapshots and immutability, and time travel.
- Lakehouse construction and initial population on RustFS. You stand up DuckLake over a RustFS S3 layer and land real data in the raw layer.
- Version control and layered operations. You run raw to silver to gold transformations and exercise snapshots, time travel, rollback, and schema evolution on several datasets.
- Local-to-Hugging-Face interaction. You ingest Hugging Face datasets through local Docker storage into the lakehouse, and push a curated gold table back to the Hub.
Architecture
images/lakehouse.c4.yaml
Read the diagram as three separated planes plus external sources and a sink. Ingestion lands raw data in RustFS: bytes and files are written straight into the object store’s raw area from whatever source you have, a dataset hub (Hugging Face, Kaggle), a streaming feed, or a batch drop, never through the engine. In this project your source is the Hugging Face Hub, but the lakehouse treats them all the same. DuckDB is the query and transform engine: it reads raw from RustFS through the DuckLake catalog, runs the SQL that builds silver then gold, and a curated gold table is published back to the Hub (push_to_hub). In the middle, DuckLake is the table format and catalog (what tables exist, their schemas, and every snapshot), and RustFS is plain object storage holding the Parquet bytes. The engine reaches storage through the catalog: DuckDB → ATTACH (SQL) → DuckLake → data files → RustFS.
That separation is the whole idea, and it mirrors how larger platforms are built (compute, catalog, and storage as independent layers): cheap immutable data files in object storage, transactional bookkeeping in a database that any number of clients can read, and a query engine that is the only thing that has to understand both.
The stack you must understand
- DuckLake keeps its catalog in a SQL database (here, a DuckDB file) and stores table data as Parquet in object storage. Every change (insert, update, delete, schema change) produces a new immutable snapshot; updates are modeled as delete-plus-insert. You read old states with time travel, for example
... AT (VERSION => 3). There are no primary keys, foreign keys, orUNIQUE/CHECKconstraints, so data quality is your job in silver and gold. See the DuckLake docs. - RustFS is an Apache-2.0, Rust, S3-compatible object store (a MinIO alternative) that you run in Docker. DuckLake talks to it through DuckDB’s
httpfsextension and an S3 secret with a custom endpoint. - Medallion layers:
rawis data exactly as ingested;silveris cleaned, typed, and deduplicated;goldis the curated, ML-ready product (features and labels, aggregates). - Hugging Face datasets are your sources and one of your sinks. Tabular and text columns land directly as Parquet; image and video columns are extracted to object storage with only their URIs kept in the tables (see the section on images, video, and sensor data below).
Starter scaffold
Fork this layout and fill it in.docker-compose.yml
rustfs-data directory and give it to UID 10001 before the first up (mkdir -p rustfs-data && sudo chown 10001 rustfs-data), then create a bucket named lakehouse from the console at http://localhost:9001 or with boto3.
sql/00_attach.sql
Land your first dataset and take a snapshot
Working with images, video, and sensor data
The lessons above use tidy tables, but your two datasets are not tabular: COCO is images and VisDrone is video. The rule that makes a lakehouse work for them is simple: keep the heavy bytes in object storage and put only references and metadata in the lakehouse tables. DuckDB queries the metadata, and your data loader fetches the bytes those rows point to. DuckDB selects; the loader materializes. Concretely, when you ingest a Hugging Face dataset, image/video/audio columns arrive as astruct<bytes, path>. During raw ingestion you:
- write each blob to its own object, for example
s3://lakehouse/assets/<dataset>/<column>/<file>, and - replace the column with a plain
stringURI that points at it (or keep the upstream URL).
raw or silver table holds URIs, labels, boxes, captions, and timestamps, never pixels. That is exactly what lets DuckDB be the query engine for a vision dataset.
COCO (computer vision: images)
Land the images as objects and the annotations as Parquet tables (for examplecoco_annotations with image_uri, category, bbox, caption). DuckDB then queries pure metadata, for example to find crowded scenes:
image_uri only for the rows a query selected.
VisDrone (video: query a fragment, not the whole clip)
Video is the case that forces the idea. Storing whole clips and scanning them per query is hopeless, so you store each clip as fragments (short byte ranges / fMP4 chunks) in S3 and build a fragment index table in the lakehouse: one row per fragment withclip_uri, fragment_id, start_frame, end_frame, start_time, end_time, and per-fragment track statistics (object counts, classes). The per-frame VisDrone tracking annotations join to it.
Now “query a video fragment” is just SQL over the index, and you read only the matching fragments from S3:
(clip_uri, fragment_id) rows, fetches just those byte ranges from RustFS, and decodes only those frames, instead of every clip. The DuckLake catalog versions the fragment index like any other table, so you can time-travel it too. This is the mechanism real multimodal data planes use; you are building a small version of it.
From this assignment to production
Version-controlled, layered curation of multimodal data is exactly what production pipelines do, only at a far larger scale. NVIDIA’s NeMo Curator powers the video curation pipeline behind the Cosmos world foundation models, trained on curated video well beyond laptop scale:“Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers.” Cosmos World Foundation Model Platform for Physical AI, NVIDIA, 2025Your raw, silver, and gold layers, the VisDrone fragment index, and the metadata-plus-URI tables are the same ideas a production curation pipeline relies on; what changes in production is the scale, the orchestration, and the compute, not the design.
Three-week plan
Week 1 - stand up the stack and populate raw
- Bring up RustFS and a DuckDB environment with
docker-compose; create thelakehousebucket. ATTACHDuckLake with an S3DATA_PATH. Verify the split: the catalog file holds the metadata, and the Parquet objects appear in the RustFS bucket.- Ingest COCO (images) and VisDrone (video) into the
rawlayer: write the image/video blobs as objects in the RustFS raw area, and land COCO annotations and the VisDrone fragment index as Parquet tables of URIs and metadata (not pixels). - Checkpoint: list snapshots with
ducklake_snapshots, and show (console orboto3) the raw objects in RustFS and the matching URI rows in the catalog.
Week 2 - layered transforms and version control
rawtosilver: fix types, handle missing values, deduplicate, and perform at least one schema evolution (add or rename a column) so you can watch a new snapshot appear.silvertogold: build ML-ready tables for both datasets (a COCO image-URI + label/caption + split table, and a VisDrone training table), and demonstrate a video-fragment query: select specific VisDrone fragments by their per-fragment statistics and confirm you read only those fragments from RustFS, not whole clips. Run the COCO metadata query (for example crowded scenes) too.- Exercise version control on a key dataset: take a sequence of snapshots, run a time-travel query (
AT (VERSION => n)and by timestamp), compare two snapshots, and roll back a deliberately bad transform. - Checkpoint: a short demo of time travel and rollback, plus notes on what each snapshot changed.
Week 3 - Hugging Face round-trip and report
- Ingest additional HF data incrementally into
raw(a new snapshot), demonstrating the local-storage-to-lakehouse flow. - Push a gold table back to the Hub as a dataset (
Dataset.from_parquet(...).push_to_hub(...)orhuggingface_hub), closing the loop between your lakehouse and Hugging Face. - Make it reproducible:
rebuild.shrecreates the whole lakehouse from an empty bucket. - Write the report (below) and prepare a short live demo.
Datasets
This project is fixed to two datasets so you exercise both the image path and the video path:- COCO (computer vision: images): object detection, captions, and segmentation. Available on the Hugging Face Hub (for example
HuggingFaceM4/COCO); use a split or subset small enough to iterate on a laptop (for exampleval2017). - VisDrone (video): drone-view multi-object tracking and video detection (the
MOTandVIDsplits). Use a few sequences. This is the dataset you use to demonstrate querying a video fragment instead of a whole clip.
Deliverables
- A Git repository:
docker-compose.yml, thesql/andnotebooks/(orscripts/),rebuild.sh, and aREADME.mdwith exact run instructions. - A populated lakehouse with
raw,silver, andgoldlayers for COCO and VisDrone: image/video bytes as objects in RustFS, with URIs, annotations, and the VisDrone fragment index in the DuckLake catalog. - A versioning demonstration: snapshots, a time-travel query, a snapshot comparison, and a rollback.
- A working video-fragment query over the VisDrone fragment index that materializes only the selected fragments from RustFS, plus the COCO metadata query.
- The Hugging Face round-trip: ingestion from the Hub and a gold dataset published back to it (include the dataset URL).
- A 2 to 3 page report answering the design-principle questions.
Report: design-principle questions
- DuckLake keeps metadata in a SQL catalog and data in Parquet on object storage. What does this separation buy you compared with a single self-contained file, and what are the consistency implications when several clients read at once?
- Updates are modeled as delete-plus-insert and every change records an immutable snapshot. Explain, in terms of snapshots and data files, how time travel and rollback work, and what keeping all snapshots costs over time.
- DuckLake has no primary keys or constraints. How did you guarantee uniqueness and quality in
silverandgoldwithout them? - Trace a single
INSERTinto a raw table all the way to bytes: catalog entry, new snapshot, Parquet file, S3 object in RustFS. Where does each piece of state actually live? - Why put the catalog in a SQL database at all, rather than in files alongside the data (as file-only table formats do)? What does that choice make easy, and what does it make harder?
- Your image and video bytes never enter a DuckLake table. Explain why, what the tables hold instead, and how the VisDrone fragment index lets DuckDB answer “give me the busy fragments” without scanning whole videos.
Assessment
| Weight | Criterion (maps to a project objective) |
|---|---|
| 25% | DuckLake design principles: report answers and correct, deliberate use of the catalog, snapshots, and separation |
| 25% | Lakehouse construction and initial raw population on the RustFS S3 layer |
| 30% | Version control and layered operations: working raw/silver/gold transforms, snapshots, time travel, rollback, schema evolution, and the video-fragment query |
| 20% | Local-to-Hugging-Face interaction: ingestion through local storage and a gold dataset published back to the Hub |
References
- DuckLake: documentation, the manifesto, and the v1.0 release announcement
- DuckDB S3 access: httpfs S3 API
- RustFS: GitHub and rustfs.com
- Hugging Face datasets: docs
- Production multimodal curation: Cosmos World Foundation Model Platform for Physical AI (NVIDIA), whose video curation pipeline is built with NVIDIA NeMo Curator

