Introduction
The training dataset ultimately used for model development was stored in a post-cleaning, parquet-based format designed for efficient streaming. At this stage, the corpus contained:- Total chips: 90,073
- Image dimensions: 256×256 pixels (fixed)
- File format: JPEG-encoded images and masks stored as byte arrays
- Schema fields:
file_name (str),image (jpg bytes),sem_seg (jpg bytes),height (int),width (int)
Dataset Structure and Example
Each example goes through several representational stages:- Pre-cleaning record – the dataset’s original form, including geospatial fields
- Post-cleaning record – a simplified schema containing only essential fields
- Preprocessed record – the online representation created at training time
Pre-Cleaning Schema (Original Dataset)
| Field | Type | Mapped Field | Description |
|---|---|---|---|
filename | str | → file_name | Unique identifier for the chip |
tfw | txt | ignored | Affine transformation matrix |
tif | jpg | → image | Base chip image |
label_tif | jpg | → sem_seg | Sidewalk label mask |
label_tfw | txt | ignored | Affine transformation matrix for label |
Post-Cleaning Schema
| Field | Type | Description |
|---|---|---|
file_name | str | Unique identifier for the chip |
image | jpg bytes | Base chip image, JPEG-encoded |
sem_seg | jpg bytes | Ground truth mask (0=background, 1=sidewalk, 255=ignore) |
height | int | Image height (256) |
width | int | Image width (256) |
Model Input Schema
| Field | Type | Description |
|---|---|---|
file_name | str | Carries through unchanged |
image | Tensor[3, 256, 256] (float32) | Decoded and normalized RGB image |
sem_seg | Tensor[256, 256] (uint8) | Decoded ground truth label mask |
height | int | Carries through unchanged |
width | int | Carries through unchanged |
Systemic Labeling Errors and Cleaning
Issue 1: Cropped Rows of Pixels
The first issue was a pervasive labeling artifact: the top eight rows of many ground truth masks were corrupted or misaligned. Our solution was to crop the first eight rows from both images and masks, then resize them back to 256×256.Issue 2: Low-Quality, Rectangular Masks
The annotations were rendered as discontinuous rectangular blocks instead of smooth, continuous polygons. We developed an end-to-end mask rebuffering algorithm involving:- Skeletonizing raw masks
- Extracting centerlines
- Smoothing and merging
- Simplifying into straight-line segments
- Rebuffering into continuous, tube-shaped polygons
Residual Annotation Quality Issues
Even after corrections, the dataset still exhibited:- Under-annotation: Large stretches of visible sidewalk were completely unmarked
- Missed hard cases: Sidewalks partially obscured by trees or shadows were often omitted
Split Reconstruction and Streaming Strategy
Addressing Severe Class Imbalance
Out of 199,999 original images, 90,073 contained positive sidewalk predictions. The concentration of sidewalk pixels amounted to ~2% before rebuffering and ~3.5% after rebuffering.Spectral Clustering and Stratified Split Formation
We developed a clustering approach grouping tiles according to visual and structural characteristics: Feature construction:- Vegetation index (red/green ratio)
- Red dominance
- Color variability (std combined)
- Brightness contrast
- Overall texture
- Total chips: 90,073 (100%)
- Train chips: 72,026 (79.96%)
- Validation chips: 9,135 (10.14%)
- Test chips: 8,912 (9.89%)
Shard Composition and Two-Stage Randomness
Our data pipeline applies randomness at two distinct stages:-
Global re-distribution at split construction (one-time, offline)
- Cluster chips at tile-group level
- Stratified split assignment
- Full random shuffle within each split
- Materialize into parquet shards
-
Epoch-wise reshuffle during streaming (online, every epoch)
- Epoch reset and reseeding
- Buffer-based shuffling during streaming
- Batch construction
Data Augmentation and Enrichment
For this stage of experimentation, we applied horizontal flipping as our sole augmentation.Planned Augmentation Strategies
- Brightness Normalization at Inference
- Brightness Augmentation During Training
- Geometric transformations (crops, small rotations, scaling)
- Occlusion simulation
- Photometric jitter
Pipeline Modernization and Integration
We developed:- A custom dataset class for streaming and decoding parquet shards
- In-memory read/write support
- Integration layers for Hugging Face’s
datasetsAPI and Ray’sray dataAPI
Chapter Summary
The original dataset required a comprehensive review and restructuring:- Corrected systemic errors by cropping corrupted rows and rebuffering masks
- Addressed severe class imbalance by filtering background-only chips
- Applied spectral clustering and stratified split formation
- Designed a two-stage randomness strategy
- Introduced horizontal flipping as lightweight augmentation
- Undertook significant pipeline modernization effort

