SpatialClaw

Rethinking the Action Interface for Agentic Spatial Reasoning

Code is the right action interface for spatial reasoning agents.

Seokju Cho^1,2, Ryo Hachiuma¹, Abhishek Badki¹, Hang Su¹, Byung-Kwan Lee¹, Chan Hee Song¹, Sifei Liu¹,
Subhashree Radhakrishnan¹, Seungryong Kim², Yu-Chiang Frank Wang¹, Min-Hung Chen¹

¹NVIDIA · ²KAIST AI

_{Work done during Seokju Cho's internship at NVIDIA.}

Per-benchmark accuracy: SpatialClaw vs. prior spatial agents across 20 benchmarks (each axis normalized so SpatialClaw is the outer reference ring; baselines inside the green ring lose to SpatialClaw)

TL;DR. SpatialClaw is a training-free spatial reasoning framework that treats code as the action interface: a VLM-backed agent writes one Python cell per step into a persistent Jupyter kernel pre-loaded with perception primitives (SAM3 segmentation, Depth-Anything-3 reconstruction, geometry utilities) and scientific libraries (NumPy, SciPy, Matplotlib). Each cell can compose tool outputs, inspect intermediate evidence, and revise the analysis before committing an answer with ReturnAnswer(...). Across 20 spatial reasoning benchmarks, SpatialClaw reaches 59.9% average accuracy, outperforming the prior best spatial agent by +11.2 points — with the same system prompt, tool set, and hyperparameters across all benchmarks and six VLM backbones (Qwen3.5/3.6, Gemma4) from 26B to 397B parameters.

📄 Abstract

Spatial reasoning — the ability to determine where objects are, how they relate, and how they move in 3D — remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

🔍 What this repo contains. This is the official implementation of the paper. It includes the full agent runtime (LangGraph workflow, persistent Jupyter kernel, AST safety check, planning/reflection loops), all 20 benchmark loaders, perception tool wrappers, a FastAPI-served GPU tool server, vLLM auto-discovery and load balancing, and the SLURM launch managers used to reproduce every experiment in the paper.

How It Works

For every sample, SpatialClaw runs a five-stage loop on top of a persistent Python kernel: a planner drafts a strategy, the main VLM writes one Python cell per step, the cell is AST-checked and executed in a stateful kernel, and stdout / new variables / show() images flow back as the next observation — repeating until the agent commits with ReturnAnswer(...).

At runtime, three independent services — a vLLM backbone, a GPU perception-tool server (Reconstruct / SAM3), and the agent (Jupyter kernels) — coordinate through shared JSON registries and survive SLURM job-time limits via auto-restarting chain jobs. No SLURM cluster? Each service is also a plain Python entry point you can run on any GPU machine.

➡ Full details: docs/architecture.md

Quickstart

# 1. Clone with submodules and install (agent + vLLM envs, ~15–30 min)
git clone --recursive https://github.com/NVlabs/SpatialClaw.git
cd SpatialClaw
bash spatial_agent/scripts/setup.sh

# 2. Add API keys — or use self-hosted vLLM with no key
cp .env.example .env        # then edit

# 3. Run an experiment (single machine, no SLURM)
python -m spatial_agent.entrypoints.run \
    --dataset spatial_agent/config/dataset/erqa.json \
    --model   spatial_agent/config/model/gemini-3-pro.json \
    --concurrency 4

Pre-downloading model weights is mandatory before SLURM runs, and the vLLM/SLURM setup has extra steps — see Installation and Running experiments.

Documentation

Guide	Contents
📦 Installation	Prerequisites, conda / vLLM environments, third-party submodules, API keys & `.env`
🚀 Running experiments	SLURM setup, pre-downloading weights, launch managers & direct CLI, reproducing paper tables
📊 Monitoring & logs	Dashboards, SLURM logs, per-sample outputs, stopping services
⚙️ Configuration	Model / dataset JSON configs, env-var overrides, supported benchmarks
🧠 Architecture	Agentic loop, three-service runtime, directory structure
🛠️ Troubleshooting	Common errors and fixes

Supported Benchmarks

All 20 paper benchmarks ship as ready-to-run dataset configs under spatial_agent/config/dataset/:

Category	Benchmarks
Single-image spatial reasoning	ERQA, Omni3D, OmniSpatial, SPBench
Multi-view spatial reasoning	MindCube, MMSI, SPAR-Bench
General spatial reasoning	BLINK, SpatialTree, ViewSpatial
Video spatial & 4D reasoning	MMSI-Video, OSI-Bench, PAI-Bench, VSI-Bench-U, VSTI-Bench, DSI-Bench
General video understanding	CV-Bench, PerceptComp, Video-MME, Video-MME-v2

Details and per-benchmark loaders: docs/configuration.md.

Citation

If you find SpatialClaw useful, please cite the paper:

@article{cho2026spatialclaw,
  title   = {SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning},
  author  = {Cho, Seokju and Hachiuma, Ryo and Badki, Abhishek and
             Su, Hang and Lee, Byung-Kwan and Song, Chan Hee and
             Liu, Sifei and Radhakrishnan, Subhashree and Kim, Seungryong and
             Wang, Yu-Chiang Frank and Chen, Min-Hung},
  journal = {arXiv preprint},
  year    = {2026}
}

Licenses

This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.

This work will download and install additional third-party open source software projects. Review the license terms of these open source projects before use (see the corresponding tools/third_party/<repo>/LICENSE)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
docs		docs
spatial_agent		spatial_agent
tools/third_party		tools/third_party
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpatialClaw

Rethinking the Action Interface for Agentic Spatial Reasoning

How It Works

Quickstart

Documentation

Supported Benchmarks

Citation

Licenses

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpatialClaw

Rethinking the Action Interface for Agentic Spatial Reasoning

How It Works

Quickstart

Documentation

Supported Benchmarks

Citation

Licenses

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages