Skip to main content

Introduction

The training dataset ultimately used for model development was stored in a post-cleaning, parquet-based format designed for efficient streaming. At this stage, the corpus contained:
  • Total chips: 90,073
  • Image dimensions: 256×256 pixels (fixed)
  • File format: JPEG-encoded images and masks stored as byte arrays
  • Schema fields: file_name (str), image (jpg bytes), sem_seg (jpg bytes), height (int), width (int)
This format struck a balance between compactness and utility: it dropped unused geospatial metadata while retaining all information essential for semantic segmentation training.

Dataset Structure and Example

Each example goes through several representational stages:
  1. Pre-cleaning record – the dataset’s original form, including geospatial fields
  2. Post-cleaning record – a simplified schema containing only essential fields
  3. Preprocessed record – the online representation created at training time

Pre-Cleaning Schema (Original Dataset)

FieldTypeMapped FieldDescription
filenamestrfile_nameUnique identifier for the chip
tfwtxtignoredAffine transformation matrix
tifjpgimageBase chip image
label_tifjpgsem_segSidewalk label mask
label_tfwtxtignoredAffine transformation matrix for label

Post-Cleaning Schema

FieldTypeDescription
file_namestrUnique identifier for the chip
imagejpg bytesBase chip image, JPEG-encoded
sem_segjpg bytesGround truth mask (0=background, 1=sidewalk, 255=ignore)
heightintImage height (256)
widthintImage width (256)

Model Input Schema

FieldTypeDescription
file_namestrCarries through unchanged
imageTensor[3, 256, 256] (float32)Decoded and normalized RGB image
sem_segTensor[256, 256] (uint8)Decoded ground truth label mask
heightintCarries through unchanged
widthintCarries through unchanged

Systemic Labeling Errors and Cleaning

Issue 1: Cropped Rows of Pixels

The first issue was a pervasive labeling artifact: the top eight rows of many ground truth masks were corrupted or misaligned. Our solution was to crop the first eight rows from both images and masks, then resize them back to 256×256.

Issue 2: Low-Quality, Rectangular Masks

The annotations were rendered as discontinuous rectangular blocks instead of smooth, continuous polygons. We developed an end-to-end mask rebuffering algorithm involving:
  • Skeletonizing raw masks
  • Extracting centerlines
  • Smoothing and merging
  • Simplifying into straight-line segments
  • Rebuffering into continuous, tube-shaped polygons
This correction not only improved geometric quality but also increased the positive pixel share by approximately 1.5%.

Residual Annotation Quality Issues

Even after corrections, the dataset still exhibited:
  • Under-annotation: Large stretches of visible sidewalk were completely unmarked
  • Missed hard cases: Sidewalks partially obscured by trees or shadows were often omitted

Split Reconstruction and Streaming Strategy

Addressing Severe Class Imbalance

Out of 199,999 original images, 90,073 contained positive sidewalk predictions. The concentration of sidewalk pixels amounted to ~2% before rebuffering and ~3.5% after rebuffering.

Spectral Clustering and Stratified Split Formation

We developed a clustering approach grouping tiles according to visual and structural characteristics: Feature construction:
  • Vegetation index (red/green ratio)
  • Red dominance
  • Color variability (std combined)
  • Brightness contrast
  • Overall texture
Clustering approach: Spectral clustering produced compact, well-distributed clusters. Final splits:
  • Total chips: 90,073 (100%)
  • Train chips: 72,026 (79.96%)
  • Validation chips: 9,135 (10.14%)
  • Test chips: 8,912 (9.89%)

Shard Composition and Two-Stage Randomness

Our data pipeline applies randomness at two distinct stages:
  1. Global re-distribution at split construction (one-time, offline)
    • Cluster chips at tile-group level
    • Stratified split assignment
    • Full random shuffle within each split
    • Materialize into parquet shards
  2. Epoch-wise reshuffle during streaming (online, every epoch)
    • Epoch reset and reseeding
    • Buffer-based shuffling during streaming
    • Batch construction

Data Augmentation and Enrichment

For this stage of experimentation, we applied horizontal flipping as our sole augmentation.

Planned Augmentation Strategies

  • Brightness Normalization at Inference
  • Brightness Augmentation During Training
  • Geometric transformations (crops, small rotations, scaling)
  • Occlusion simulation
  • Photometric jitter

Pipeline Modernization and Integration

We developed:
  • A custom dataset class for streaming and decoding parquet shards
  • In-memory read/write support
  • Integration layers for Hugging Face’s datasets API and Ray’s ray data API
This modernization was essential for scaling and provides a baseline for extensibility and platform migration.

Chapter Summary

The original dataset required a comprehensive review and restructuring:
  • Corrected systemic errors by cropping corrupted rows and rebuffering masks
  • Addressed severe class imbalance by filtering background-only chips
  • Applied spectral clustering and stratified split formation
  • Designed a two-stage randomness strategy
  • Introduced horizontal flipping as lightweight augmentation
  • Undertook significant pipeline modernization effort
These steps transformed both the dataset and its utilization into a resource that is cleaner, better balanced, and fully aligned with modern ML infrastructure.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.