Training Dataset Review and Processing

Introduction

The training dataset ultimately used for model development was stored in a post-cleaning, parquet-based format designed for efficient streaming. At this stage, the corpus contained:

Total chips: 90,073
Image dimensions: 256×256 pixels (fixed)
File format: JPEG-encoded images and masks stored as byte arrays
Schema fields: file_name (str), image (jpg bytes), sem_seg (jpg bytes), height (int), width (int)

This format struck a balance between compactness and utility: it dropped unused geospatial metadata while retaining all information essential for semantic segmentation training.

Dataset Structure and Example

Each example goes through several representational stages:

Pre-cleaning record – the dataset’s original form, including geospatial fields
Post-cleaning record – a simplified schema containing only essential fields
Preprocessed record – the online representation created at training time

Pre-Cleaning Schema (Original Dataset)

Field	Type	Mapped Field	Description
`filename`	str	→ `file_name`	Unique identifier for the chip
`tfw`	txt	ignored	Affine transformation matrix
`tif`	jpg	→ `image`	Base chip image
`label_tif`	jpg	→ `sem_seg`	Sidewalk label mask
`label_tfw`	txt	ignored	Affine transformation matrix for label

Post-Cleaning Schema

Field	Type	Description
`file_name`	str	Unique identifier for the chip
`image`	jpg bytes	Base chip image, JPEG-encoded
`sem_seg`	jpg bytes	Ground truth mask (0=background, 1=sidewalk, 255=ignore)
`height`	int	Image height (256)
`width`	int	Image width (256)

Model Input Schema

Field	Type	Description
`file_name`	str	Carries through unchanged
`image`	Tensor[3, 256, 256] (float32)	Decoded and normalized RGB image
`sem_seg`	Tensor[256, 256] (uint8)	Decoded ground truth label mask
`height`	int	Carries through unchanged
`width`	int	Carries through unchanged

Systemic Labeling Errors and Cleaning

Issue 1: Cropped Rows of Pixels

The first issue was a pervasive labeling artifact: the top eight rows of many ground truth masks were corrupted or misaligned. Our solution was to crop the first eight rows from both images and masks, then resize them back to 256×256.

Issue 2: Low-Quality, Rectangular Masks

The annotations were rendered as discontinuous rectangular blocks instead of smooth, continuous polygons. We developed an end-to-end mask rebuffering algorithm involving:

Skeletonizing raw masks
Extracting centerlines
Smoothing and merging
Simplifying into straight-line segments
Rebuffering into continuous, tube-shaped polygons

This correction not only improved geometric quality but also increased the positive pixel share by approximately 1.5%.

Residual Annotation Quality Issues

Even after corrections, the dataset still exhibited:

Under-annotation: Large stretches of visible sidewalk were completely unmarked
Missed hard cases: Sidewalks partially obscured by trees or shadows were often omitted

Split Reconstruction and Streaming Strategy

Addressing Severe Class Imbalance

Out of 199,999 original images, 90,073 contained positive sidewalk predictions. The concentration of sidewalk pixels amounted to ~2% before rebuffering and ~3.5% after rebuffering.

Spectral Clustering and Stratified Split Formation

We developed a clustering approach grouping tiles according to visual and structural characteristics: Feature construction:

Vegetation index (red/green ratio)
Red dominance
Color variability (std combined)
Brightness contrast
Overall texture

Clustering approach: Spectral clustering produced compact, well-distributed clusters. Final splits:

Total chips: 90,073 (100%)
Train chips: 72,026 (79.96%)
Validation chips: 9,135 (10.14%)
Test chips: 8,912 (9.89%)

Shard Composition and Two-Stage Randomness

Our data pipeline applies randomness at two distinct stages:

Global re-distribution at split construction (one-time, offline)
- Cluster chips at tile-group level
- Stratified split assignment
- Full random shuffle within each split
- Materialize into parquet shards
Epoch-wise reshuffle during streaming (online, every epoch)
- Epoch reset and reseeding
- Buffer-based shuffling during streaming
- Batch construction

Data Augmentation and Enrichment

For this stage of experimentation, we applied horizontal flipping as our sole augmentation.

Planned Augmentation Strategies

Brightness Normalization at Inference
Brightness Augmentation During Training
Geometric transformations (crops, small rotations, scaling)
Occlusion simulation
Photometric jitter

Pipeline Modernization and Integration

We developed:

A custom dataset class for streaming and decoding parquet shards
In-memory read/write support
Integration layers for Hugging Face’s datasets API and Ray’s ray data API

This modernization was essential for scaling and provides a baseline for extensibility and platform migration.

Chapter Summary

The original dataset required a comprehensive review and restructuring:

Corrected systemic errors by cropping corrupted rows and rebuffering masks
Addressed severe class imbalance by filtering background-only chips
Applied spectral clustering and stratified split formation
Designed a two-stage randomness strategy
Introduced horizontal flipping as lightweight augmentation
Undertook significant pipeline modernization effort

These steps transformed both the dataset and its utilization into a resource that is cleaner, better balanced, and fully aligned with modern ML infrastructure.

Edit this page on GitHub or file an issue.

Overview

Remote Sensing

Manufacturing QC

Training Dataset Review and Processing

Introduction

Dataset Structure and Example

Pre-Cleaning Schema (Original Dataset)

Post-Cleaning Schema

Model Input Schema

Systemic Labeling Errors and Cleaning

Issue 1: Cropped Rows of Pixels

Issue 2: Low-Quality, Rectangular Masks

Residual Annotation Quality Issues

Split Reconstruction and Streaming Strategy

Addressing Severe Class Imbalance

Spectral Clustering and Stratified Split Formation

Shard Composition and Two-Stage Randomness

Data Augmentation and Enrichment

Planned Augmentation Strategies

Pipeline Modernization and Integration

Chapter Summary

Overview

Remote Sensing

Manufacturing QC

​Introduction

​Dataset Structure and Example

​Pre-Cleaning Schema (Original Dataset)

​Post-Cleaning Schema

​Model Input Schema

​Systemic Labeling Errors and Cleaning

​Issue 1: Cropped Rows of Pixels

​Issue 2: Low-Quality, Rectangular Masks

​Residual Annotation Quality Issues

​Split Reconstruction and Streaming Strategy

​Addressing Severe Class Imbalance

​Spectral Clustering and Stratified Split Formation

​Shard Composition and Two-Stage Randomness

​Data Augmentation and Enrichment

​Planned Augmentation Strategies

​Pipeline Modernization and Integration

​Chapter Summary

Introduction

Dataset Structure and Example

Pre-Cleaning Schema (Original Dataset)

Post-Cleaning Schema

Model Input Schema

Systemic Labeling Errors and Cleaning

Issue 1: Cropped Rows of Pixels

Issue 2: Low-Quality, Rectangular Masks

Residual Annotation Quality Issues

Split Reconstruction and Streaming Strategy

Addressing Severe Class Imbalance

Spectral Clustering and Stratified Split Formation

Shard Composition and Two-Stage Randomness

Data Augmentation and Enrichment

Planned Augmentation Strategies

Pipeline Modernization and Integration

Chapter Summary