Skip to main content
Open In Colab

Foreign Whispers — End-to-End Pipeline

Repository: This project has its own public GitHub repo at github.com/aegean-ai/foreign-whispers. Clone it, file issues, and submit pull requests there.
Commercial dubbing services like ElevenLabs can take a video, transcribe it, translate it, clone the speaker’s voice, and return a dubbed video in the target language — watch their demo below:
You are going to build the same thing from open-source components. No API keys to a proprietary service. No per-minute billing. The entire pipeline runs on your own GPU server. Where this matters:
  • Media localization — dub documentaries, lectures, or interviews into multiple languages at scale
  • Accessibility — make video content available to non-English-speaking audiences without manual voiceover
  • Research — experiment with duration-aware translation, prosody alignment, and speaker-aware TTS in a controllable pipeline
  • Education — learn how production ML systems compose ASR, MT, TTS, and audio engineering into a single product
This notebook orchestrates the full pipeline from YouTube URL to dubbed output video via the FWClient SDK. Each step calls the FastAPI backend, which delegates GPU work to the STT and TTS containers.
YouTube URL → Download → Transcribe → Translate → TTS (+ alignment) → Stitch → Dubbed Video
You will demonstrate your pipeline using the Foreign Whispers Dubbing Studio — a Next.js frontend at http://localhost:8501. Foreign Whispers Dubbing Studio

Architecture

LayerWhat it isWhere it runs
GPU servicesWhisper STT (port 8000), Chatterbox TTS (port 8020)Dedicated GPU containers
APIFastAPI orchestrator (port 8080) — proxies to GPU servicesCPU container
foreign_whispers libraryAlignment logic, metrics, evaluationPure Python — no GPU needed
┌────────────────────┐
│  API (CPU :8080)    │  orchestrates the pipeline
└──┬─────────┬───────┘
   │ HTTP    │ HTTP
   ▼         ▼
┌────────┐ ┌────────┐
│ STT    │ │ TTS    │   GPU containers
│ :8000  │ │ :8020  │
└────────┘ └────────┘

Production tools used

  • FastAPI. The backend is a layered FastAPI application with Pydantic schemas, dependency injection, and async request handling.
  • Logfire. Every pipeline step emits structured traces to Pydantic’s observability platform for debugging timing issues and comparing experiment runs.
  • Docker Compose. The full stack runs as four coordinated containers with GPU passthrough.

Per-stage integration notebooks

For deep-dives into individual stages, see:
NotebookStage
download_integration/YouTube download + caption fetching
transcription_integration/Whisper vs YouTube captions
diarization_integration/Speaker diarization (student assignment)
translation_integration/argostranslate + duration-aware re-ranking
alignment_integration/Temporal alignment: metrics, policies, global optimizer
tts_integration/Chatterbox TTS + voice cloning
stitch_integration/Final assembly + captions

Requirements

docker compose --profile nvidia up -d   # start GPU services + API
uv sync                                 # install library deps locally
Every API call in this notebook emits a Logfire trace span. To see pipeline execution in the Logfire dashboard, authenticate once:
uv run logfire auth                     # opens browser for one-time login
After authenticating, re-run the notebook — each pipeline stage (P1–P5) will appear as a span in the Logfire dashboard with timing, video_id, and error details. If Logfire is not configured, the notebook still runs normally using a no-op shim — no traces are emitted but nothing breaks.

Setup — SDK Client and Logfire Tracing

import sys
import os
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent.parent
sys.path.insert(0, str(PROJECT_ROOT))
print(f"Project root: {PROJECT_ROOT}")

IMAGES_DIR = Path("images")
IMAGES_DIR.mkdir(exist_ok=True)

# Load .env (LOGFIRE_TOKEN, HF_TOKEN, etc.)
from dotenv import load_dotenv
load_dotenv(PROJECT_ROOT / ".env")

from foreign_whispers import FWClient

API_URL = "http://localhost:8080"
fw = FWClient(API_URL)

# Verify API is reachable
print(fw.healthz())
print(f"SDK client ready: {fw!r}")

# Optional: Logfire tracing (no-op shim if unavailable)
try:
    import logfire
    logfire.configure(service_name="foreign-whispers-notebook")
    LOGFIRE_ENABLED = True
    print("Logfire tracing enabled.")
except Exception:
    # Logfire not installed or not authenticated — use no-op shim
    class _NoopSpan:
        def __enter__(self): return self
        def __exit__(self, *a): pass

    class _noop:
        @staticmethod
        def span(name, **kw): return _NoopSpan()
        @staticmethod
        def info(*a, **kw): pass

    import types
    logfire = _noop()
    LOGFIRE_ENABLED = False
    print("Logfire not configured — using no-op shim. Run `logfire auth` to enable.")

Pipeline Execution

Each step calls the API via the FWClient SDK. All GPU work happens in the STT/TTS containers. Results are cached on disk — re-running skips already-completed steps.

P1 — Download

VIDEO_URL = "https://www.youtube.com/watch?v=GYQ5yGV_-Oc"

with logfire.span("P1.download"):
    dl = fw.download(VIDEO_URL)

video_id = dl["video_id"]
print(f"Video ID : {video_id}")
print(f"Title    : {dl['title']}")
print(f"Captions : {len(dl['caption_segments'])} segments")
for seg in dl["caption_segments"][:5]:
    print(f"  {seg}")

P2 — Transcribe

with logfire.span("P2.transcribe", video_id=video_id):
    transcript = fw.transcribe(video_id)

print(f"Language : {transcript['language']}")
print(f"Segments : {len(transcript['segments'])}")
print(f"Skipped  : {transcript.get('skipped', False)}")
print("\nFirst 3 segments:")
for seg in transcript["segments"][:3]:
    dur = seg["end"] - seg["start"]
    print(f"  [{seg['start']:.1f}s – {seg['end']:.1f}s ({dur:.1f}s)]  {seg['text'].strip()}")

P3 — Translate

with logfire.span("P3.translate", video_id=video_id):
    translation = fw.translate(video_id)

print(f"Target language: {translation['target_language']}")
print(f"Segments:        {len(translation['segments'])}")
print("\nFirst segment comparison:")
en_seg = transcript["segments"][0]
es_seg = translation["segments"][0]
print(f"  EN: {en_seg['text']}")
print(f"  ES: {es_seg['text']}")

P4 — TTS

with logfire.span("P4.tts", video_id=video_id):
    tts_result = fw.tts(video_id, alignment=True)

print(f"Audio path: {tts_result['audio_path']}")
print(f"Config:     {tts_result.get('config', 'N/A')}")

P5 — Stitch

with logfire.span("P5.stitch", video_id=video_id):
    stitch_result = fw.stitch(video_id)

print(f"Video path: {stitch_result['video_path']}")

Summary

StepToolOutput
P1 — Downloadyt-dlp via APIvideos/*.mp4, youtube_captions/*.txt
P2 — TranscribeWhisper STT (GPU)transcriptions/whisper/*.json
P3 — Translateargostranslatetranslations/argos/*.json
P4 — TTSChatterbox (GPU)tts_audio/chatterbox/\{config\}/*.wav
P5 — Stitchffmpegdubbed_videos/\{config\}/*.mp4, dubbed_captions/*.vtt
All artifacts are cached in pipeline_data/api/. Re-running skips completed steps. For alignment analysis, evaluation, and per-stage deep-dives, see the integration notebooks listed in the intro.
import json

# Show pipeline artifacts
data_dir = PROJECT_ROOT / "pipeline_data" / "api"
artifacts = {
    "Source video": list((data_dir / "videos").glob("*.mp4")),
    "YouTube captions": list((data_dir / "youtube_captions").glob("*.txt")),
    "Transcription": list((data_dir / "transcriptions" / "whisper").glob("*.json")),
    "Translation": list((data_dir / "translations" / "argos").glob("*.json")),
    "TTS audio": list((data_dir / "tts_audio").rglob("*.wav")),
    "Dubbed video": list((data_dir / "dubbed_videos").rglob("*.mp4")),
    "Dubbed captions": list((data_dir / "dubbed_captions").glob("*.vtt")),
}

print("=== Pipeline Artifacts ===")
for label, files in artifacts.items():
    if files:
        for f in files:
            size_mb = f.stat().st_size / (1024 * 1024)
            print(f"  {label:<20} {f.name:<60} {size_mb:.1f} MB")
    else:
        print(f"  {label:<20} (not yet produced)")

# Show first few lines of the dubbed captions (rolling two-line format)
vtt_files = list((data_dir / "dubbed_captions").glob("*.vtt"))
if vtt_files:
    print(f"\n=== Dubbed Captions Preview ({vtt_files[0].name}) ===")
    for line in vtt_files[0].read_text().splitlines()[:20]:
        print(f"  {line}")