Skip to main content
Robotic manipulation in cluttered environments requires more than reactive control — it demands spatial reasoning, constraint satisfaction, and utility optimization. This demonstrator showcases a Vision-Language-Action (VLA) model that accepts natural audio instructions and executes pick-and-place tasks with a robotic arm, reasoning about object geometry, available space, and collision avoidance in real time. Reasoning agent placing geometric blocks on a bookshelf while avoiding existing objects

Task Overview

A set of primitive geometric shapes — blocks of various sizes — are laid on a table in front of the robotic arm. The robot receives spoken instructions to pick up the blocks and shelve them in a bookshelf that already contains other objects. The agent must:
  1. Perceive the scene — identify block sizes, bookshelf layout, and existing objects
  2. Reason about placement — match each block to an appropriate shelf location given available space
  3. Plan collision-free trajectories — avoid hitting objects already on the bookshelf during insertion
  4. Optimize total shelving utility — maximize the number of blocks placed while respecting spatial constraints

Audio Instruction

Natural language commands via speech, eliminating the need for programming or teleoperation.

Spatial Reasoning

The agent reasons about object sizes, shelf dimensions, and occupied space to find valid placements.

Collision Avoidance

Trajectory planning accounts for existing objects on the bookshelf to prevent contact during placement.

Utility Maximization

The agent optimizes total shelving utility rather than following a fixed placement order.

Vision-Language-Action Model

The system is built on a VLA architecture that unifies perception, language understanding, and action generation in a single model. Unlike traditional pipelines that chain separate vision, planning, and control modules, a VLA model learns end-to-end mappings from visual observations and language instructions to robot actions.

Architecture

Visual Encoder — Processes RGB-D camera input to build a scene representation including object identities, poses, and the spatial layout of the bookshelf. Language Encoder — Processes transcribed audio instructions into task-level goals (e.g., “place the large blue block on the top shelf”). Action Decoder — Generates a sequence of end-effector waypoints conditioned on the scene representation and language embedding, producing smooth pick-and-place trajectories.

Reasoning Loop

The agent operates in a closed-loop reasoning cycle:
1

Observe

Capture the current scene with an RGB-D camera mounted on the workspace.
2

Listen

Receive and transcribe the operator’s spoken instruction.
3

Plan placement

Evaluate candidate shelf locations by scoring each for spatial fit, collision clearance, and utility contribution.
4

Execute

Generate and execute a collision-free grasp-transport-place trajectory.
5

Verify

Re-observe the scene to confirm successful placement and update the internal state for the next instruction.

Utility Optimization

The agent does not simply place blocks in the first available slot. It solves a constrained optimization problem at each step:
  • Spatial fit — the block must physically fit in the candidate location with clearance margins
  • Collision cost — candidate trajectories are penalized for proximity to existing objects
  • Packing efficiency — placements that leave usable remaining space are preferred over those that fragment the shelf
  • Instruction alignment — when the operator specifies a preference (e.g., “next to the red box”), the agent incorporates it as a soft constraint
The result is a placement strategy that maximizes the total number of blocks shelved while respecting physical and operator-specified constraints.

Technical Stack

  • VLA Model — End-to-end vision-language-action architecture for instruction-conditioned manipulation
  • Speech Interface — Audio capture and transcription for natural instruction input
  • Perception — RGB-D sensing with object detection and 6-DOF pose estimation
  • Motion Planning — Collision-aware trajectory generation for the robotic arm
  • Simulation — Development and evaluation in simulated environments before physical deployment