Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding

Dec 27, 2025·

Khoa Vo

Taisei Hanyu

Yuki Ikebe

Trong Thang Pham

Nhat Chung

Minh Nhat Vu

Duy Nguyen Ho Minh

Anh Nguyen

Anthony Gunderman

Chase Rainwater

Ngan Le

· 1 min read

PDF Preprint Code Project

Publication

arXiv preprint (2025), under submission at IEEE Transactions on Robotics (T-RO)

Overview

Recent Vision-Language-Action models have made strong progress by post-training large Vision-Language Models for action prediction, but many still entangle perception and control in a single monolithic pipeline. In practice, that weakens language-conditioned grounding: policies may over-grasp when the target is absent, drift toward clutter, or overfit to background appearance. OBEYED-VLA addresses this by explicitly separating perceptual grounding from action reasoning before control.

TL;DR

Status: Under submission at IEEE Transactions on Robotics (T-RO).
Problem: Monolithic VLAs can erode language-conditioned grounding, leading to failure in clutter, absent-target cases, background shifts, and unseen-object manipulation.
Method: OBEYED-VLA disentangles perception from control by grounding multi-view inputs into task-conditioned object-centric and geometry-aware observations before VLA action prediction.
Key result: On a real-world UR10e tabletop setup, the method improves robustness over strong VLA baselines across distractor-heavy, absent-target, background-shift, and unseen-object regimes.

Qualitative Result: Cluttered Scenes with Distractor Objects

In this distractor-scene example, the grounded observations suppress irrelevant objects and preserve the task-relevant target, helping the executor stay aligned with the instruction instead of drifting toward visually salient clutter.

Project Page

For the full method, additional videos, and quantitative experiments, see the OBEYED-VLA project page.

Last updated on Dec 27, 2025