Repository: This project has its own public GitHub repo at github.com/aegean-ai/foreign-whispers. Clone it, file issues, and submit pull requests there.
What you are building
An open-source video dubbing pipeline that takes a YouTube video in English and produces a dubbed version in Spanish (or another target language), using only local GPU resources — no paid APIs. The full pipeline:
You are going to build the same thing from open-source components. No API keys to a proprietary service. No per-minute billing. The entire pipeline runs on your own GPU server.
You will demonstrate your pipeline using the Foreign Whispers Dubbing Studio — a Next.js frontend at http://localhost:8501.

foreign_whispers Python library — it contains the alignment logic, evaluation metrics, and helper functions that the tasks ask you to implement or extend.
Architecture
| Layer | What it is | Where it runs |
|---|---|---|
| GPU services | Whisper STT (port 8000), Chatterbox TTS (port 8020) | Dedicated GPU containers |
| API | FastAPI orchestrator (port 8080) — proxies to GPU services | CPU container |
foreign_whispers library | Alignment logic, metrics, evaluation | Pure Python — no GPU needed |
| Frontend | Next.js Dubbing Studio (port 8501) | Node container |
Per-stage deep-dives
Each pipeline stage has its own integration notebook with detailed analysis and tasks.End-to-end pipeline — from YouTube URL to dubbed video
End-to-end pipeline — from YouTube URL to dubbed video
Orchestrates the full pipeline (P1-P5) via the
FWClient SDK. Each step calls the FastAPI backend, which delegates GPU work to the STT and TTS containers. Results are cached on disk — re-running skips completed steps.Open the end-to-end pipeline notebookDownload — YouTube video and caption fetching
Download — YouTube video and caption fetching
Downloads the source video and closed captions via
yt-dlp through the FastAPI backend. Produces MP4 files and caption JSON in pipeline_data/api/.Open the download integration notebookTranscription — Whisper STT vs YouTube captions
Transcription — Whisper STT vs YouTube captions
Compares YouTube captions (fast, no GPU) against Whisper STT (slower, more accurate timestamps). Examines segment duration distributions and JSON structure.Open the transcription integration notebook
Diarization — speaker identification with pyannote
Diarization — speaker identification with pyannote
Wire pyannote speaker diarization into the pipeline so multi-speaker videos produce per-speaker labeled segments. Includes 5 tasks: merge function, API endpoint, transcription merge, frontend integration, and per-speaker TTS voice selection.Open the diarization integration notebook
Translation — argostranslate and duration-aware re-ranking
Translation — argostranslate and duration-aware re-ranking
Translates transcription segments from English to a target language using argostranslate. Analyzes translation length expansion and its impact on TTS timing budgets.Open the translation integration notebook
Alignment — temporal alignment metrics, policies, and global optimization
Alignment — temporal alignment metrics, policies, and global optimization
The hard problem: a 3-second English phrase might take 5 seconds in Spanish. Covers segment stretch metrics, fallback policy (accept / mild stretch / gap shift / request shorter / fail), and global timeline optimization.Open the alignment integration notebook
TTS — Chatterbox text-to-speech and voice cloning
TTS — Chatterbox text-to-speech and voice cloning
Synthesizes target-language speech using Chatterbox TTS. Compares baseline (no alignment) vs aligned (time-stretched) modes. Includes tasks for voice resolution and per-speaker voice assignment.Open the TTS integration notebook
Stitch — final video assembly with ffmpeg
Stitch — final video assembly with ffmpeg
Combines the original video with dubbed TTS audio and rolling two-line VTT captions. Uses audio-only remux (no video re-encoding) to preserve original video quality.Open the stitch integration notebook
Phase 1: Environment setup and end-to-end run
Step 1 — Clone the repository
Step 2 — Configure environment variables
Create a.env file at the project root:
Step 3 — Start the Docker stack
The pipeline runs as four containers: an API orchestrator (CPU), a Whisper STT server (GPU), a Chatterbox TTS server (GPU), and a Next.js frontend.- API:
curl http://localhost:8080/healthz - Frontend: open
http://localhost:8501in your browser
Step 4 — Install the local Python library
foreign_whispers Python package (alignment logic, metrics, evaluation) locally. No GPU needed for this package.
Step 5 — (Optional) Set up Logfire observability
Step 6 — Run the end-to-end pipeline notebook
Opennotebooks/pipeline_end_to_end/pipeline_end_to_end.ipynb and run all cells. This executes the five pipeline stages (P1-P5) on a sample YouTube video and produces a dubbed output. Watch the Dubbing Studio frontend at http://localhost:8501 to see the result.
Understand what each stage produces:
| Stage | What it does | Output location |
|---|---|---|
| P1 — Download | Fetches video + YouTube captions via yt-dlp | pipeline_data/api/videos/, youtube_captions/ |
| P2 — Transcribe | Runs Whisper STT on the audio track | pipeline_data/api/transcriptions/whisper/ |
| P3 — Translate | Translates EN to ES via argostranslate | pipeline_data/api/translations/argos/ |
| P4 — TTS | Synthesizes Spanish speech via Chatterbox | pipeline_data/api/tts_audio/chatterbox/ |
| P5 — Stitch | Remuxes video with dubbed audio + VTT captions | pipeline_data/api/dubbed_videos/, dubbed_captions/ |
pipeline_data/api/. Re-running skips completed steps.
Phase 2: Integration notebooks
Work through these notebooks in order. Each one deep-dives into a pipeline stage and contains tasks markedYOUR CODE HERE.
Notebook 1: Download integration — how yt-dlp fetches video files and closed captions through the FastAPI backend
Notebook: notebooks/download_integration/download_integration.ipynb
You will inspect downloaded artifacts and visualize the caption timeline.
No coding tasks — this is an exploration notebook. Make sure you understand the data format (segment dicts with start, end, text fields) before moving on.
Notebook 2: Transcription integration — YouTube captions vs Whisper STT
Notebook:notebooks/transcription_integration/transcription_integration.ipynb
You will compare segment duration distributions between YouTube captions (fast, no GPU) and Whisper STT (slower, more accurate timestamps).
No coding tasks — but pay attention to the segment JSON structure. Every downstream stage consumes this format.
Notebook 3: Translation integration — unconstrained translation and its timing consequences
Notebook:notebooks/translation_integration/translation_integration.ipynb
argostranslate produces unconstrained translations. Some languages expand text by ~10-30% versus English, which causes timing problems for TTS.
Your task — Duration-aware re-ranking:
- File to modify:
foreign_whispers/reranking.py - Function:
get_shorter_translations()— currently a stub returning an empty list - Goal: Generate shorter Spanish translation candidates that fit within a TTS duration budget (~15 chars/second for Spanish)
- Approaches to consider: rule-based truncation, multi-backend translation (run the same text through different engines like argostranslate, MarianMT, and a local LLM, then pick the shortest candidate that preserves meaning), LLM candidate generation, or hybrid
Notebook 4: Diarization integration
Notebook:notebooks/diarization_integration/diarization_integration.ipynb
This is the largest notebook with 5 tasks. Work through them sequentially.
Task 1 — assign_speakers merge function (pure Python, no GPU)
- File to modify:
foreign_whispers/diarization.py - Write a function that assigns speaker labels to transcription segments using temporal overlap with diarization output
- Tests are provided — run them first (TDD), implement, re-run until all 4 pass
- Files to create:
api/src/schemas/diarize.py,api/src/routers/diarize.py - Files to modify:
api/src/main.py,api/src/core/config.py - Create
POST /api/diarize/\{video_id\}that extracts audio, runs pyannote, caches results
- File to modify:
api/src/routers/diarize.py - After diarization, update the transcription JSON so each segment has a
speakerfield
- Files to modify:
frontend/src/lib/api.ts,frontend/src/lib/types.ts,frontend/src/hooks/use-pipeline.ts,frontend/src/components/pipeline-table.tsx,frontend/src/components/pipeline-status-bar.tsx - Add the diarize stage to the Next.js frontend between transcribe and translate
- Files to modify:
api/src/routers/tts.py,api/src/services/tts_service.py - When speaker labels exist, use different Chatterbox reference voices per speaker
Notebook 5: Alignment integration
Notebook:notebooks/alignment_integration/alignment_integration.ipynb
This is the most analytically demanding notebook with 4 tasks.
Task 1 — Improve TTS duration prediction
- File to modify:
foreign_whispers/alignment.py— the_estimate_durationhelper - Replace the crude ~15 chars/s heuristic with a better predictor (syllable-based, regression model trained on ground-truth TTS durations)
- File to modify:
foreign_whispers/reranking.py - For segments tagged
REQUEST_SHORTER, generate shorter candidates that fit the timing budget
- File to modify:
foreign_whispers/alignment.py - Implement
global_align_dp()using DP, ILP, or beam search to beat the greedy left-to-right scheduler - Compare total drift, severe stretch count, and overlap count
- File to modify:
foreign_whispers/evaluation.py - Design a multi-dimensional evaluation: timing accuracy, intelligibility (STT round-trip), semantic fidelity (embedding similarity), naturalness (speaking rate variance)
Notebook 6: TTS integration — baseline vs aligned modes and voice cloning
Notebook:notebooks/tts_integration/tts_integration.ipynb
Task 1 — Understand the Chatterbox client (read-only)
- Study how
tts.pyalready supportsspeaker_wavkwargs
- File to create:
foreign_whispers/voice_resolution.py - Implement
resolve_speaker_wav()with fallback chain: speaker-specific, then language default, then global default - 5 tests provided — run first, implement, re-run
speaker_wav to the TTS API
- Files to modify:
api/src/core/config.py,api/src/routers/tts.py,api/src/services/tts_service.py - Expose speaker selection as a query parameter
- File to modify:
api/src/routers/tts.py - When diarized segments exist, build a speaker-to-voice mapping and switch voices per segment
Notebook 7: Stitch integration — final assembly with ffmpeg audio remux and VTT captions
Notebook:notebooks/stitch_integration/stitch_integration.ipynb
The video track is copied as-is (no re-encoding); only the audio stream is replaced with TTS output. Rolling two-line VTT captions are generated alongside.
No coding tasks — verify your dubbed output plays correctly with captions.

