
Task Overview
A set of primitive geometric shapes — blocks of various sizes — are laid on a table in front of the robotic arm. The robot receives spoken instructions to pick up the blocks and shelve them in a bookshelf that already contains other objects. The agent must:- Perceive the scene — identify block sizes, bookshelf layout, and existing objects
- Reason about placement — match each block to an appropriate shelf location given available space
- Plan collision-free trajectories — avoid hitting objects already on the bookshelf during insertion
- Optimize total shelving utility — maximize the number of blocks placed while respecting spatial constraints
Audio Instruction
Natural language commands via speech, eliminating the need for programming or teleoperation.
Spatial Reasoning
The agent reasons about object sizes, shelf dimensions, and occupied space to find valid placements.
Collision Avoidance
Trajectory planning accounts for existing objects on the bookshelf to prevent contact during placement.
Utility Maximization
The agent optimizes total shelving utility rather than following a fixed placement order.
Vision-Language-Action Model
The system is built on a VLA architecture that unifies perception, language understanding, and action generation in a single model. Unlike traditional pipelines that chain separate vision, planning, and control modules, a VLA model learns end-to-end mappings from visual observations and language instructions to robot actions.Architecture
Visual Encoder — Processes RGB-D camera input to build a scene representation including object identities, poses, and the spatial layout of the bookshelf. Language Encoder — Processes transcribed audio instructions into task-level goals (e.g., “place the large blue block on the top shelf”). Action Decoder — Generates a sequence of end-effector waypoints conditioned on the scene representation and language embedding, producing smooth pick-and-place trajectories.Reasoning Loop
The agent operates in a closed-loop reasoning cycle:Plan placement
Evaluate candidate shelf locations by scoring each for spatial fit, collision clearance, and utility contribution.
Utility Optimization
The agent does not simply place blocks in the first available slot. It solves a constrained optimization problem at each step:- Spatial fit — the block must physically fit in the candidate location with clearance margins
- Collision cost — candidate trajectories are penalized for proximity to existing objects
- Packing efficiency — placements that leave usable remaining space are preferred over those that fragment the shelf
- Instruction alignment — when the operator specifies a preference (e.g., “next to the red box”), the agent incorporates it as a soft constraint
Technical Stack
- VLA Model — End-to-end vision-language-action architecture for instruction-conditioned manipulation
- Speech Interface — Audio capture and transcription for natural instruction input
- Perception — RGB-D sensing with object detection and 6-DOF pose estimation
- Motion Planning — Collision-aware trajectory generation for the robotic arm
- Simulation — Development and evaluation in simulated environments before physical deployment

