SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

SpatiO main figure — Reliability-aware Spatial Agents Orchestration pipeline

SpatiO introduces a heterogeneous multi-agent framework for spatial reasoning. The Orchestrator classifies the query, assigns specialist roles (Implicit Visual Reasoning, Explicit 3D Reconstruction, Scene-graph Construction), and updates per-agent confidence scores through Test-Time Orchestration. A Reasoning Agent synthesizes outputs into a final answer.

Abstract

Spatial Adaptability through Heterogeneous Multi-Agent Orchestration

Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases—such as 2D appearance cues, depth signals, and geometric constraints—whose reliability varies across contexts. This suggests that effective spatial reasoning requires spatial adaptability: the ability to flexibly coordinate different reasoning strategies depending on the input.

However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive biases they can leverage.

In this work, we introduce SpatiO, a heterogeneous multi-agent framework for spatial reasoning. SpatiO adaptively orchestrates specialists with diverse spatial inductive biases at test time, guided by continuously updated confidence scores that reflect each specialist's past reliability on similar queries.

Method

Three-Stage Reliability-Aware Orchestration

Stage 1

Adaptive Role Assignment

The Orchestrator classifies the spatial query category and selects top-3 specialist agents, returning their roles with confidence scores.

Stage 2

Role-conditioned Specialist Execution

Each specialist leverages its assigned inductive bias—depth maps, scene-graphs, or visual reasoning traces—to produce an independent answer.

Stage 3

Reliability-aware Final Reasoning

A Reasoning Agent synthesizes all specialist outputs, weighted by their confidence scores, to produce the final answer and update scores.

Figure 1. Full pipeline of SpatiO's Reliability-aware Spatial Agents Orchestration. Specialist confidence scores are updated via Bayesian update and Dual EMA at every test-time step.

Test-Time Orchestration confidence update loop

Figure 2. Test-Time Orchestration (TTO) confidence score update pipeline. At each query step t, specialist outputs are scored, rewards are computed and scaled, then a Bayesian update followed by Dual EMA produces the updated confidence score s^(t+1).

Results

Qualitative Analysis

SpatiO's multi-agent design enables each specialist to leverage complementary spatial signals. Below we show side-by-side outputs from the Head-agent (routing), three specialist agents, and the Reasoning Agent's final synthesis.

$Qualitative result 1 — wine bottle vs white chair$

Figure 3. SpatiO correctly resolves a distance & depth query. The heuristic specialist (Qwen3-4B) returns an incorrect answer based on 2D appearance, while the 3D reconstruction specialist and scene-graph specialist both provide correct reasoning. The Reliability-aware Reasoning Agent overrides the heuristic and outputs the correct answer: (B) far away from each other.

Qualitative result 2 — minibus and bicycle orientation

Figure 4. SpatiO correctly resolves an orientation query. The Head-agent assigns Heuristic and 2D Scene-graph specialists. Despite the heuristic agent's initial error, the 3D reconstruction specialist (SpatialReasoner) and scene-graph specialist (Sa2va) both agree on (D) right, which the Reasoning Agent confirms as the final answer.

Benchmark Comparison

Method	MMSI-Bench	STVQA-7k	CV-Bench	3DSRBench	Avg.
LLaVA-4D	23.2	57.2	68.3	49.7	49.6
SpatialRGPT	17.3	67.1	61.0	39.8	46.3
Sa2VA	8.7	65.3	70.2	48.5	48.2
SpatialReasoner	22.1	63.4	77.4	54.3	54.3
Qwen-3.0-VL-4B	24.1	77.9	84.4	59.1	61.4
SpatiO (Ours)	43.6	88.2	86.9	72.4	72.8