Foreign Whispers — End-to-End Pipeline
Repository: This project has its own public GitHub repo at github.com/aegean-ai/foreign-whispers. Clone it, file issues, and submit pull requests there.Commercial dubbing services like ElevenLabs can take a video, transcribe it, translate it, clone the speaker’s voice, and return a dubbed video in the target language — watch their demo below:
You are going to build the same thing from open-source components. No API keys to a proprietary service. No per-minute billing. The entire pipeline runs on your own GPU server.
Where this matters:
- Media localization — dub documentaries, lectures, or interviews into multiple languages at scale
- Accessibility — make video content available to non-English-speaking audiences without manual voiceover
- Research — experiment with duration-aware translation, prosody alignment, and speaker-aware TTS in a controllable pipeline
- Education — learn how production ML systems compose ASR, MT, TTS, and audio engineering into a single product
FWClient SDK. Each step calls the FastAPI backend, which delegates GPU work to the STT and TTS containers.

Architecture
| Layer | What it is | Where it runs |
|---|---|---|
| GPU services | Whisper STT (port 8000), Chatterbox TTS (port 8020) | Dedicated GPU containers |
| API | FastAPI orchestrator (port 8080) — proxies to GPU services | CPU container |
foreign_whispers library | Alignment logic, metrics, evaluation | Pure Python — no GPU needed |
Production tools used
- FastAPI. The backend is a layered FastAPI application with Pydantic schemas, dependency injection, and async request handling.
- Logfire. Every pipeline step emits structured traces to Pydantic’s observability platform for debugging timing issues and comparing experiment runs.
- Docker Compose. The full stack runs as four coordinated containers with GPU passthrough.
Per-stage integration notebooks
For deep-dives into individual stages, see:| Notebook | Stage |
|---|---|
download_integration/ | YouTube download + caption fetching |
transcription_integration/ | Whisper vs YouTube captions |
diarization_integration/ | Speaker diarization (student assignment) |
translation_integration/ | argostranslate + duration-aware re-ranking |
alignment_integration/ | Temporal alignment: metrics, policies, global optimizer |
tts_integration/ | Chatterbox TTS + voice cloning |
stitch_integration/ | Final assembly + captions |
Requirements
Logfire observability (recommended)
Every API call in this notebook emits a Logfire trace span. To see pipeline execution in the Logfire dashboard, authenticate once:Setup — SDK Client and Logfire Tracing
Pipeline Execution
Each step calls the API via theFWClient SDK. All GPU work happens in the STT/TTS containers. Results are cached on disk — re-running skips already-completed steps.
P1 — Download
P2 — Transcribe
P3 — Translate
P4 — TTS
P5 — Stitch
Summary
| Step | Tool | Output |
|---|---|---|
| P1 — Download | yt-dlp via API | videos/*.mp4, youtube_captions/*.txt |
| P2 — Transcribe | Whisper STT (GPU) | transcriptions/whisper/*.json |
| P3 — Translate | argostranslate | translations/argos/*.json |
| P4 — TTS | Chatterbox (GPU) | tts_audio/chatterbox/\{config\}/*.wav |
| P5 — Stitch | ffmpeg | dubbed_videos/\{config\}/*.mp4, dubbed_captions/*.vtt |
pipeline_data/api/. Re-running skips completed steps.
For alignment analysis, evaluation, and per-stage deep-dives, see the integration notebooks listed in the intro.

