Scalable Inference with Ray and Triton

Introduction

This document outlines a distributed, actor-based inference system designed to process large-scale geospatial machine learning workloads from raw imagery to vectorized outputs. The system combines Ray for parallel and stateful pipeline orchestration with NVIDIA’s Triton Inference Server for efficient GPU-accelerated model serving, together enabling high throughput with minimal idle time across all stages.

About Ray

Ray is an open-source framework for building and running distributed applications at scale. It provides a unified runtime for tasks (stateless units of work) and actors (stateful, long-lived processes). Ray was chosen because it:

Supports persistent actors that maintain state across calls
Offers a simple API for asynchronous and parallel execution
Can scale from a single machine to a multi-node cluster

About Triton Inference Server

NVIDIA Triton Inference Server enables deployment of trained AI models from multiple frameworks. Triton was chosen because it:

Provides optimized GPU utilization through dynamic batching
Supports multiple model backends
Integrates with both local and cloud deployments

High-Level Ray Actor Overview

Actor(s)	Primary Function	Key Inputs	Key Outputs
ControllerActor	Orchestrates all stages	Job list, config	Run metadata, progress table
TileLoaderActor	Ingests tiles, extracts chips	GeoTIFF tiles	Chips + metadata
InputQueueActor	Buffers chips between stages	Chip records	Chip records for batching
InferenceDispatcherActor	Batches chips, runs inference	Chips from queue	Chip predictions
AggregatorActor	Buffers predictions per tile	Chip predictions	Complete per-tile sets
PostProcessingActor	Stitches masks, extracts centerlines	Complete tile sets	GeoJSON centerlines
CenterlineWorker	Converts polygons to centerlines	Polygon batches	Vectorized centerlines

Pipeline Ingress

Building Tile Jobs

The build_tile_jobs(...) function creates a standardized list of per-tile job dicts:

{
    "tif_path": "s3://bucket/path/to.tif",
    "tile_id": "H6B10",
    "job_id": "njogis-2020",
    "requested_chip_size": 256,
    "requested_chip_overlap": 32,
    "use_pseudo_color_nir": True,
    "target_format": "NIR-GB",
    "target_model_input_size": [3, 256, 256]
}

Running the Pipeline

store = s3_store()
files = list_s3_files(store, prefix="imagery/njogis-tiles/2020/cog")
tif_keys = sorted(files["key"])
s3_tif_paths = [os.path.join("s3://njtpa/", k) for k in tif_keys]

tile_jobs = build_tile_jobs(tif_paths=s3_tif_paths, job_id="njogis-tiles_2020")

main(
    tile_jobs=tile_jobs,
    run_id="njogis-tiles_2020_cog_full_run",
    endpoint="triton-inference-server:8001",
    model_name="batched_semseg_model",
    model_version="1",
    num_tileloaders=3,
    num_postprocessors=3,
    storage_mode="s3",
    store=store,
)

Pipeline Stages

Lifecycle of a Single Tile

Ingestion – TileLoaderActor reads the tile, extracts chips, sends to queue
Queuing – InputQueueActor buffers chips for downstream consumption
Inference – InferenceDispatcherActor batches chips, runs model inference
Aggregation – AggregatorActor groups predictions until tile complete
Post-Processing – PostProcessingActor stitches mask, extracts centerlines

Stage 0 — ControllerActor

Startup & Wiring: Launches all workers, connects handoffs
Progress Tracking: Maintains progress table indexed by (job_id, tile_id)
Health & Logging: Polls actors, logs status summaries
Completion Criteria: Declares complete when all tiles processed

Stage 1 — TileLoaderActor

Reads .tif tiles from local or S3 storage
Extracts geospatial metadata (CRS, transform, dimensions)
Splits tiles into chips with overlap
Assigns composite keys for traceability

Stage 2 — InferenceDispatcherActor

Accumulates chips into batches (default size: 200)
Normalizes inputs (mean/std from model config)
Sends mini-batches to Triton via gRPC
Applies softmax and confidence thresholding
Handles backpressure with exponential backoff

Stage 3 — PostProcessingActor

Reconstructs full-size prediction mask from chips
Applies morphological operations
Converts polygons to centerlines via CenterlineWorker pool
Writes GeoJSON output to local/S3

Triton Inference Server Configuration

Model Directory Structure

models/
└── batched_semseg_model/
    ├── config.pbtxt
    └── 1/
        └── model.onnx

Configuration

name: "batched_semseg_model"
platform: "onnxruntime_onnx"
max_batch_size: 200

instance_group [
  {
    kind: KIND_GPU
    count: 1
    gpus: [0]
  }
]

dynamic_batching {
  preferred_batch_size: [16, 32, 64, 128, 150, 200]
  max_queue_delay_microseconds: 100000
  preserve_ordering: true
}

input [
  {
    name: "image"
    data_type: TYPE_FP32
    dims: [3, -1, -1]
  }
]

output [
  {
    name: "sem_seg"
    data_type: TYPE_FP32
    dims: [2, -1, -1]
  }
]

Chapter Summary

The inference pipeline transforms large-scale geospatial imagery into usable vector data through a fully automated, parallel workflow:

Starting from GeoTIFF tiles
Applying configurable preprocessing
Performing semantic segmentation via Triton
Reassembling predictions at tile scale
Converting to vectorized sidewalk centerlines

The Ray-based architecture provides:

Scalable concurrency across multiple tiles and jobs
Robust fault handling with per-tile tracking
Flexible deployment for local or cloud environments
Minimal idle time through asynchronous handoffs
Clear observability via central controller

Each run produces:

Vectorized per-tile GeoJSON centerlines
Progress and status tracking tables
Complete run metadata and execution logs

Edit this page on GitHub or file an issue.

Overview

Remote Sensing

Manufacturing QC

Scalable Inference with Ray and Triton

Introduction

About Ray

About Triton Inference Server

High-Level Ray Actor Overview

Pipeline Ingress

Building Tile Jobs

Running the Pipeline

Pipeline Stages

Lifecycle of a Single Tile

Stage 0 — ControllerActor

Stage 1 — TileLoaderActor

Stage 2 — InferenceDispatcherActor

Stage 3 — PostProcessingActor

Triton Inference Server Configuration

Model Directory Structure

Configuration

Chapter Summary

Overview

Remote Sensing

Manufacturing QC

​Introduction

​About Ray

​About Triton Inference Server

​High-Level Ray Actor Overview

​Pipeline Ingress

​Building Tile Jobs

​Running the Pipeline

​Pipeline Stages

​Lifecycle of a Single Tile

​Stage 0 — ControllerActor

​Stage 1 — TileLoaderActor

​Stage 2 — InferenceDispatcherActor

​Stage 3 — PostProcessingActor

​Triton Inference Server Configuration

​Model Directory Structure

​Configuration

​Chapter Summary

Introduction

About Ray

About Triton Inference Server

High-Level Ray Actor Overview

Pipeline Ingress

Building Tile Jobs

Running the Pipeline

Pipeline Stages

Lifecycle of a Single Tile

Stage 0 — ControllerActor

Stage 1 — TileLoaderActor

Stage 2 — InferenceDispatcherActor

Stage 3 — PostProcessingActor

Triton Inference Server Configuration

Model Directory Structure

Configuration

Chapter Summary