Reasoning Agent for Manipulation Tasks

Robotic manipulation in cluttered environments requires more than reactive control — it demands spatial reasoning, constraint satisfaction, and utility optimization. This demonstrator showcases a Vision-Language-Action (VLA) model that accepts natural audio instructions and executes pick-and-place tasks with a robotic arm, reasoning about object geometry, available space, and collision avoidance in real time.

Reasoning agent placing geometric blocks on a bookshelf while avoiding existing objects

Task Overview

A set of primitive geometric shapes — blocks of various sizes — are laid on a table in front of the robotic arm. The robot receives spoken instructions to pick up the blocks and shelve them in a bookshelf that already contains other objects. The agent must:

Perceive the scene — identify block sizes, bookshelf layout, and existing objects
Reason about placement — match each block to an appropriate shelf location given available space
Plan collision-free trajectories — avoid hitting objects already on the bookshelf during insertion
Optimize total shelving utility — maximize the number of blocks placed while respecting spatial constraints

Audio Instruction

Natural language commands via speech, eliminating the need for programming or teleoperation.

Spatial Reasoning

The agent reasons about object sizes, shelf dimensions, and occupied space to find valid placements.

Collision Avoidance

Trajectory planning accounts for existing objects on the bookshelf to prevent contact during placement.

Utility Maximization

The agent optimizes total shelving utility rather than following a fixed placement order.

Vision-Language-Action Model

The system is built on a VLA architecture that unifies perception, language understanding, and action generation in a single model. Unlike traditional pipelines that chain separate vision, planning, and control modules, a VLA model learns end-to-end mappings from visual observations and language instructions to robot actions.

Architecture

Visual Encoder — Processes RGB-D camera input to build a scene representation including object identities, poses, and the spatial layout of the bookshelf. Language Encoder — Processes transcribed audio instructions into task-level goals (e.g., “place the large blue block on the top shelf”). Action Decoder — Generates a sequence of end-effector waypoints conditioned on the scene representation and language embedding, producing smooth pick-and-place trajectories.

Reasoning Loop

The agent operates in a closed-loop reasoning cycle:

Observe

Capture the current scene with an RGB-D camera mounted on the workspace.

Listen

Receive and transcribe the operator’s spoken instruction.

Plan placement

Evaluate candidate shelf locations by scoring each for spatial fit, collision clearance, and utility contribution.

Execute

Generate and execute a collision-free grasp-transport-place trajectory.

Verify

Re-observe the scene to confirm successful placement and update the internal state for the next instruction.

Utility Optimization

The agent does not simply place blocks in the first available slot. It solves a constrained optimization problem at each step:

Spatial fit — the block must physically fit in the candidate location with clearance margins
Collision cost — candidate trajectories are penalized for proximity to existing objects
Packing efficiency — placements that leave usable remaining space are preferred over those that fragment the shelf
Instruction alignment — when the operator specifies a preference (e.g., “next to the red box”), the agent incorporates it as a soft constraint

The result is a placement strategy that maximizes the total number of blocks shelved while respecting physical and operator-specified constraints.

Technical Stack

VLA Model — End-to-end vision-language-action architecture for instruction-conditioned manipulation
Speech Interface — Audio capture and transcription for natural instruction input
Perception — RGB-D sensing with object detection and 6-DOF pose estimation
Motion Planning — Collision-aware trajectory generation for the robotic arm
Simulation — Development and evaluation in simulated environments before physical deployment

Edit this page on GitHub or file an issue.

Overview

Sports Analytics

Language Agents

Computer Using Agents

Physical AI

Reasoning Agent for Manipulation Tasks

Task Overview

Audio Instruction

Spatial Reasoning

Collision Avoidance

Utility Maximization

Vision-Language-Action Model

Architecture

Reasoning Loop

Utility Optimization

Technical Stack

Overview

Sports Analytics

Language Agents

Computer Using Agents

Physical AI

​Task Overview

Audio Instruction

Spatial Reasoning

Collision Avoidance

Utility Maximization

​Vision-Language-Action Model

​Architecture

​Reasoning Loop

​Utility Optimization

​Technical Stack

Task Overview

Vision-Language-Action Model

Architecture

Reasoning Loop

Utility Optimization

Technical Stack