Foreign Whispers — End-to-End Pipeline

Repository: This project has its own public GitHub repo at github.com/aegean-ai/foreign-whispers. Clone it, file issues, and submit pull requests there.

Commercial dubbing services like ElevenLabs can take a video, transcribe it, translate it, clone the speaker’s voice, and return a dubbed video in the target language — watch their demo below:

You are going to build the same thing from open-source components. No API keys to a proprietary service. No per-minute billing. The entire pipeline runs on your own GPU server. Where this matters:

Media localization — dub documentaries, lectures, or interviews into multiple languages at scale
Accessibility — make video content available to non-English-speaking audiences without manual voiceover
Research — experiment with duration-aware translation, prosody alignment, and speaker-aware TTS in a controllable pipeline
Education — learn how production ML systems compose ASR, MT, TTS, and audio engineering into a single product

This notebook orchestrates the full pipeline from YouTube URL to dubbed output video via the FWClient SDK. Each step calls the FastAPI backend, which delegates GPU work to the STT and TTS containers.

YouTube URL → Download → Transcribe → Translate → TTS (+ alignment) → Stitch → Dubbed Video

You will demonstrate your pipeline using the Foreign Whispers Dubbing Studio — a Next.js frontend at http://localhost:8501.

Architecture

Layer	What it is	Where it runs
GPU services	Whisper STT (port 8000), Chatterbox TTS (port 8020)	Dedicated GPU containers
API	FastAPI orchestrator (port 8080) — proxies to GPU services	CPU container
`foreign_whispers` library	Alignment logic, metrics, evaluation	Pure Python — no GPU needed

┌────────────────────┐
│  API (CPU :8080)    │  orchestrates the pipeline
└──┬─────────┬───────┘
   │ HTTP    │ HTTP
   ▼         ▼
┌────────┐ ┌────────┐
│ STT    │ │ TTS    │   GPU containers
│ :8000  │ │ :8020  │
└────────┘ └────────┘

Production tools used

FastAPI. The backend is a layered FastAPI application with Pydantic schemas, dependency injection, and async request handling.
Logfire. Every pipeline step emits structured traces to Pydantic’s observability platform for debugging timing issues and comparing experiment runs.
Docker Compose. The full stack runs as four coordinated containers with GPU passthrough.

Per-stage integration notebooks

For deep-dives into individual stages, see:

Notebook	Stage
`download_integration/`	YouTube download + caption fetching
`transcription_integration/`	Whisper vs YouTube captions
`diarization_integration/`	Speaker diarization (student assignment)
`translation_integration/`	argostranslate + duration-aware re-ranking
`alignment_integration/`	Temporal alignment: metrics, policies, global optimizer
`tts_integration/`	Chatterbox TTS + voice cloning
`stitch_integration/`	Final assembly + captions

Requirements

docker compose --profile nvidia up -d   # start GPU services + API
uv sync                                 # install library deps locally

Logfire observability (recommended)

Every API call in this notebook emits a Logfire trace span. To see pipeline execution in the Logfire dashboard, authenticate once:

uv run logfire auth                     # opens browser for one-time login

After authenticating, re-run the notebook — each pipeline stage (P1–P5) will appear as a span in the Logfire dashboard with timing, video_id, and error details. If Logfire is not configured, the notebook still runs normally using a no-op shim — no traces are emitted but nothing breaks.

Setup — SDK Client and Logfire Tracing

import sys
import os
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent.parent
sys.path.insert(0, str(PROJECT_ROOT))
print(f"Project root: {PROJECT_ROOT}")

IMAGES_DIR = Path("images")
IMAGES_DIR.mkdir(exist_ok=True)

# Load .env (LOGFIRE_TOKEN, HF_TOKEN, etc.)
from dotenv import load_dotenv
load_dotenv(PROJECT_ROOT / ".env")

from foreign_whispers import FWClient

API_URL = "http://localhost:8080"
fw = FWClient(API_URL)

# Verify API is reachable
print(fw.healthz())
print(f"SDK client ready: {fw!r}")

# Optional: Logfire tracing (no-op shim if unavailable)
try:
    import logfire
    logfire.configure(service_name="foreign-whispers-notebook")
    LOGFIRE_ENABLED = True
    print("Logfire tracing enabled.")
except Exception:
    # Logfire not installed or not authenticated — use no-op shim
    class _NoopSpan:
        def __enter__(self): return self
        def __exit__(self, *a): pass

    class _noop:
        @staticmethod
        def span(name, **kw): return _NoopSpan()
        @staticmethod
        def info(*a, **kw): pass

    import types
    logfire = _noop()
    LOGFIRE_ENABLED = False
    print("Logfire not configured — using no-op shim. Run `logfire auth` to enable.")

Pipeline Execution

Each step calls the API via the FWClient SDK. All GPU work happens in the STT/TTS containers. Results are cached on disk — re-running skips already-completed steps.

P1 — Download

VIDEO_URL = "https://www.youtube.com/watch?v=GYQ5yGV_-Oc"

with logfire.span("P1.download"):
    dl = fw.download(VIDEO_URL)

video_id = dl["video_id"]
print(f"Video ID : {video_id}")
print(f"Title    : {dl['title']}")
print(f"Captions : {len(dl['caption_segments'])} segments")
for seg in dl["caption_segments"][:5]:
    print(f"  {seg}")

P2 — Transcribe

with logfire.span("P2.transcribe", video_id=video_id):
    transcript = fw.transcribe(video_id)

print(f"Language : {transcript['language']}")
print(f"Segments : {len(transcript['segments'])}")
print(f"Skipped  : {transcript.get('skipped', False)}")
print("\nFirst 3 segments:")
for seg in transcript["segments"][:3]:
    dur = seg["end"] - seg["start"]
    print(f"  [{seg['start']:.1f}s – {seg['end']:.1f}s ({dur:.1f}s)]  {seg['text'].strip()}")

P3 — Translate

with logfire.span("P3.translate", video_id=video_id):
    translation = fw.translate(video_id)

print(f"Target language: {translation['target_language']}")
print(f"Segments:        {len(translation['segments'])}")
print("\nFirst segment comparison:")
en_seg = transcript["segments"][0]
es_seg = translation["segments"][0]
print(f"  EN: {en_seg['text']}")
print(f"  ES: {es_seg['text']}")

P4 — TTS

with logfire.span("P4.tts", video_id=video_id):
    tts_result = fw.tts(video_id, alignment=True)

print(f"Audio path: {tts_result['audio_path']}")
print(f"Config:     {tts_result.get('config', 'N/A')}")

P5 — Stitch

with logfire.span("P5.stitch", video_id=video_id):
    stitch_result = fw.stitch(video_id)

print(f"Video path: {stitch_result['video_path']}")

Summary

Step	Tool	Output
P1 — Download	`yt-dlp` via API	`videos/.mp4`, `youtube_captions/.txt`
P2 — Transcribe	Whisper STT (GPU)	`transcriptions/whisper/*.json`
P3 — Translate	`argostranslate`	`translations/argos/*.json`
P4 — TTS	Chatterbox (GPU)	`tts_audio/chatterbox/\{config\}/*.wav`
P5 — Stitch	`ffmpeg`	`dubbed_videos/\{config\}/.mp4`, `dubbed_captions/.vtt`

All artifacts are cached in pipeline_data/api/. Re-running skips completed steps. For alignment analysis, evaluation, and per-stage deep-dives, see the integration notebooks listed in the intro.

import json

# Show pipeline artifacts
data_dir = PROJECT_ROOT / "pipeline_data" / "api"
artifacts = {
    "Source video": list((data_dir / "videos").glob("*.mp4")),
    "YouTube captions": list((data_dir / "youtube_captions").glob("*.txt")),
    "Transcription": list((data_dir / "transcriptions" / "whisper").glob("*.json")),
    "Translation": list((data_dir / "translations" / "argos").glob("*.json")),
    "TTS audio": list((data_dir / "tts_audio").rglob("*.wav")),
    "Dubbed video": list((data_dir / "dubbed_videos").rglob("*.mp4")),
    "Dubbed captions": list((data_dir / "dubbed_captions").glob("*.vtt")),
}

print("=== Pipeline Artifacts ===")
for label, files in artifacts.items():
    if files:
        for f in files:
            size_mb = f.stat().st_size / (1024 * 1024)
            print(f"  {label:<20} {f.name:<60} {size_mb:.1f} MB")
    else:
        print(f"  {label:<20} (not yet produced)")

# Show first few lines of the dubbed captions (rolling two-line format)
vtt_files = list((data_dir / "dubbed_captions").glob("*.vtt"))
if vtt_files:
    print(f"\n=== Dubbed Captions Preview ({vtt_files[0].name}) ===")
    for line in vtt_files[0].read_text().splitlines()[:20]:
        print(f"  {line}")

Edit this page on GitHub or file an issue.

​Foreign Whispers — End-to-End Pipeline

​Architecture

​Production tools used

​Per-stage integration notebooks

​Requirements

​Logfire observability (recommended)

​Setup — SDK Client and Logfire Tracing

​Pipeline Execution

​P1 — Download

​P2 — Transcribe

​P3 — Translate

​P4 — TTS

​P5 — Stitch

​Summary

Foreign Whispers — End-to-End Pipeline

Architecture

Production tools used

Per-stage integration notebooks

Requirements

Logfire observability (recommended)

Setup — SDK Client and Logfire Tracing

Pipeline Execution

P1 — Download

P2 — Transcribe

P3 — Translate

P4 — TTS

P5 — Stitch

Summary