Towards generalist Vision Language Action models

![image](https://hackmd.io/_uploads/rkGvvg-3-x.png) Towards generalist Vision Language Action models === ## Students Paul Misterka, 5327946, pmmisterka@tudelft.nl Marcin Jarosz, 5445019, m.w.jarosz@student.tudelft.nl Rafael Alani, 5701732, r.a.alani-1@student.tudelft.nl ## Table of contents [TOC] ## Abstract Vision Language Action (VLA) models are a class of foundational models for robotic learning. Combining perception, planning, and control in a single model, VLAs leverage recent advances in computer vision, deep learning, and natural language processing to generalize a range of difficult robotics problems. However, as a very recent and often secretive development, VLAs are poorly understood; their performance is often difficult to reproduce or even evaluate. In this blog we introduce BOard Manipulation Benchmark (BOMB), a new simulation benchmark for complex, long-term, small object manipulation tasks. We finetune and evaluate SOTA VLAs on BOMB to measure their efficiency and generalization power. Additionally, we analyze VLA failure modes using intervention-based explainability and suggest directions for further research. We share our code at github.com/squarerootminusone/dsait4125. ## Problem Robotics is a natural extension of computer vision. Combining perception, planning, and control together, advancements in robotics promise to bring general-purpose robots that can fuel the next industrial revolution. Indeed, we already have large-scale robotic deployments in e.g. highly-automated Amazon warehouses or with autonomous driving fleets from Waymo. ![kuka-production_systems (1)](https://hackmd.io/_uploads/HJnRTvZ2-x.png) However, current approaches in robotics fail to generalize well. Deployments are limited to precisely mapped, stable environments like modern factory floors. Robots are trained to perform a limited range of tasks and rely on carefully crafted data modality fusions and collection mechanisms for reliability - a Waymo uses 13 cameras, 5 lidars, and 6 radars, processing over a terabyte of data every hour. In contrast, humans can learn to drive in hours, and to a large extent rely on stereo vision only. Clearly, current methods in robotics seem to be inefficient. The root cause is not engineering effort but a fundamental limitation of classical methods: they cannot generalize. Just as rule-based NLP systems hit a wall that no amount of hand-engineering could overcome — until transformers and large-scale pretraining changed the paradigm entirely — classical robotics pipelines are hitting the same wall. The answer arriving in robotics is structurally identical: replace hand-crafted modules with end-to-end learned models, scale them up, and throw diverse data at them. Vision-Language-Action models are that answer. ### Vision Language Models (VLM) Recent years have seen the rise of Vision-Language Models (VLMs) - models like GPT-4V, LLaVA, and Gemini that can look at an image and talk about it. Under the hood, they chop images into visual tokens and process them alongside text, learning to bridge the gap between seeing and understanding. However, standard VLMs are fundamentally passive. They can describe a scene, but they cannot alter it. A Vision-Language-Action (VLA) model aims to close the sensorimotor loop. Formally, a VLA can be defined as a parameterized policy $\pi_\theta$ that maps a sequence of visual observations $o_{1:t}$ and a natural language instruction $l$ to an action $a_t$: $$a_t \sim \pi_\theta(a_t | o_{1:t}, l)$$ Where the action space $A$ typically comprises the end-effector pose (in our case a 3-Degree-of-Freedom translation vector) and the gripper state: $$a_t = [x, y, z, g] \in \mathbb{R}^4$$ ### OpenVLA: The Open-Source Standard To make this concrete, consider OpenVLA — the de facto open-source baseline — which illustrates how a VLA actually turns pixels and language into actions. It is built on a 7B-parameter LLaMA-2 backbone with a pretrained visual encoder (DINOv2 + SigLIP) that projects image patches into the language model's token space. The architecture is end-to-end: visual tokens and the language instruction are concatenated and passed through the transformer, with the final tokens decoded as discretized action bins - 256 bins per action dimension, mapped back to continuous values at inference time. ![image](https://hackmd.io/_uploads/ByOpEeb2-g.png) _Architecture diagram of OpenVLA_ Its training corpus is Open X-Embodiment, a large-scale cross-embodiment dataset aggregating over 1 million trajectories across 22 robot platforms. This breadth gives OpenVLA strong semantic generalization. It transfers to unseen objects and instructions with minimal fine-tuning. The architecture and training pipeline are fully open-sourced, making it the de facto baseline for VLA research, although there are other architectures that perform better on existing benchmarks. ### So what's the issue?  ![Screenshot 2026-04-06 at 21.39.08](https://hackmd.io/_uploads/ryiknY-3Ze.png) However, deep learning approaches in computer vision are notoriously data- and compute hungry. The complexity of VLAs only exponentiates this problem, requiring new architectures that often combine key DL computer vision techniques. Moreover, robotics benchmarks must be interactive, consistent with laws of physics, and representative of a large number of real-world tasks. Together, these constraints make VLA research: - fragmented - with over 170 VLA ICRL 2026 submissions (up from 1 in 2024) VLA research is a wild west with extremely low signal-to-noise ratio, - poorly generalizable - with just 3 mainstream VLA benchmarks (out of which one is solved) existing VLA methods maximize rankings and overfit to a small subset of tasks that doesn't represent real-world tasks well, - inefficient - in contrast to LLMs, VLAs lack an ecosystem that would promote reproducibility and decrease hardware entry requirements for insitutions and individuals with limited resources. ### Research goal In our research we aim to address these weakpoints. We pose the following 3 research questions: - How well do SOTA VLAs generalize to small-object manipulation tasks? - How much do SOTA VLAs overfit to experimental setups in existing benchmarks? - How compute efficient are SOTA VLAs? Together, these questions form a comparative evaluation of current SOTA VLAs. Our hope is that researchers will use this comparison to gain insights into generalization power, efficiency, and performance on a "hideout benchmark" paper authors couldn't optimize for. ## Experiments We evaluate SOTA VLA models on BOMB under two conditions — with and without visual perturbations — across performance and efficiency axes. We first survey existing benchmarks and their limitations, then motivate the design of BOMB, describe the models and training procedures, and finally present results. ### Existing Benchmarks Existing VLA benchmarks primarily test semantic generalization — following new instructions or manipulating unseen objects. They say much less about spatial precision: can the model place an object at a specific coordinate, not just roughly in the right area? We survey the landscape below, then introduce BOMB to target this gap. | Benchmark | Simulator | Tasks | Focus | Limitations | |---|---|---|---|---| | **LIBERO** | MuJoCo (robosuite) | 130 | Lifelong/multi-task learning, language conditioning | Near-saturated (>97% SOTA); homogeneous train/test splits encourage memorization over true generalization | | **LIBERO-PRO** | MuJoCo (robosuite) | ext. of LIBERO | Perturbation-based robustness (objects, positions, instructions, environments) | Models achieving >90% on LIBERO collapse to ~0% under position or instruction perturbations — exposing memorization | | **LIBERO-Plus** | MuJoCo (robosuite) | ext. of LIBERO | Fine-grained robustness across 7 perturbation dimensions with 5 difficulty levels | Reveals extreme sensitivity to camera and robot state changes; models largely ignore language instructions | | **SimplerEnv** | ManiSkill2 | 8 | Sim-to-real correlation proxy | Small task set; designed to *validate* real-world results, not stress-test capabilities | | **CALVIN** | PyBullet | 34 | Long-horizon sequential instruction following (up to 5 chained) | Tabletop-only; structurally similar scenes limit visual diversity | | **RoboCasa** | MuJoCo (robosuite) | 100 | Realistic kitchen environments, compositional tasks | Household-specific; large objects and coarse placement tolerances | These benchmarks have been instrumental in driving VLA progress. Notably, the LIBERO extensions (LIBERO-PRO, LIBERO-Plus) have shown that SOTA VLAs are far more brittle than vanilla LIBERO scores suggest - models that score >90% collapse under even modest perturbations to object positions or scene layouts. However, even these extended benchmarks share a blind spot: the manipulation targets are typically large, placement tolerances are generous, and success criteria rarely penalize centimeter-level inaccuracy. A model can score well on LIBERO by getting a block roughly onto a plate, it does not need to hit a very specific spatial location. ### BOard Manipulation Benchmark (BOMB) To expose the spatial reasoning and fine-motor control capabilities of SOTA VLAs we introduce **BOard Manipulation Benchmark (BOMB)**, a benchmark for board game playthrough evaluations. Due to the limited duration of the project BOMB introduces a single Go stone manipulation task: given a command, e.g. "place the black stone at position (3,4)", the robot arm must place the nearest Go stone on the desired position of a 5x5 Go board. ![image (38)](https://hackmd.io/_uploads/rJDyVAxn-g.png) _Example Go-playing task. The robot must locate and grasp a stone, then transport it to the chosen place on the board._ We choose Go as a starting point for BOMB due to the game's visual clarity - the board and stones are symmetric and have uniform coloring. Nevertheless, in contrast to LIBERO tasks, Go stones are much smaller and require finer observational and motor control skills. This moves BOMB out of distribution of LIBERO, into a regime where we expected performance to degrade — and as our results show, it collapses entirely under visual perturbations. To enable BOMB's evaluation across a wider range of VLAs we implement our benchmark in two simulators: MuJoCo and Robotwin2. While MuJoCo is the most prolific in the deep learning robotics community due to its simplicity, newer benchmarks tend to leverage more powerful and accurate simulation environments. In particular, we note it is difficult to finetune physics and simulation optimization parameters in MuJoCo for realistic interactions between small rigid objects like our Go-playing task. #### Perturbations As one of BOMB's primary goals is evaluating VLA generalization, we introduce a set of optional perturbations to: - table color and texture, - board and stone reflectivity, - amount and layout of stones, - initial stone position, - image color balance and brightness, - camera FoV and angle. ![image (40)](https://hackmd.io/_uploads/HkRqkJW2Wx.png) While these perturbations help better reflect real-world conditions, their main goal is to work as an implicit regularizer. Instead of memorizing the exact scene layout, we force the model to learn - e.g. recognize a Go stone through visual cues rather than its initial position. In turn, the model must generalize better. ### Models In addition to OpenVLA, we perform our experiments on a set of recent open-source VLA models. The model choices are guided by favorable reproducability characteristics (hardware requirements, boilerplate availability) and substantial architectural changes to the visual pipeline of OpenVLA. These models are: 1. **OpenVLA-OFT** (2024) - introduces a continuous action representation with parallel decoding. This improves inference efficiency and removes discretization errors, drastically improving real-world performance. 2. **SpatialVLA** (2024) - injects 3D spatial structure (egocentric positional information) into perception and action VLA blocks. By providing explicit geometric grounding prior, SpatialVLA helps the model learn and exploit a better spatial representation. 3. **pi0** (2025) - introduces action generation through flow matching over continuous action chunks and a larger VLM backbone. The stronger pretrained vision backbone and control mechanism help the model execute more complex tasks reliably. 4. **Motus** (2025) - uses a latent action unified world model inside a Mixture-of-Transformers architecture. Motus jointly learns a latent world model and an action policy, sharing representations across vision, language, and dynamics to improve generalization and data efficiency. #### Training We finetune each model on a single RTX 6000 Pro or RTX 5090 (where the training fits in 32GB VRAM). During training we freeze VLM backbones. The following hyperparameters are based on defaults recommended by each model's respective authors, adjusted where we observed training instability or divergence. They were not exhaustively optimized due to compute constraints. | Hyperparam | OpenVLA | OpenVLA-OFT | SpatialVLA | π₀ | Motus | |----------------------|------------:|------------:|-----------:|----------:|------:| | Batch size | 128 | 16 | 256 | 32 | 2 | | Training steps | 10,000 | 15,000 | 10,000 | 4000 | 10,000 | | Learning rate | 5e-4 | 1e-5 | 5e-4 | 2e-5 | 2e-5 | | LoRA rank | 32 | 32 | 32 | - | - | _Motus hyperparameters are listed for completeness; Motus was not evaluated due to compute and time constraints._ ## Results ![Screenshot 2026-04-06 at 21.40.05](https://hackmd.io/_uploads/S1jmht-3bg.png) _Unperturbed board configurations. Target intersection highlighted in blue_ #### Performance Without perturbations | Model | Train Loss | Train Acc | Val Loss | Val Acc | Val L1 | Success rate | |-------|------------|-----------|----------|---------|--------|--------| | OpenVLA | 0.015 | 98.2% | 0.046 | 65.3% | 0.048 | <b> 34% </b> | | OpenVLA-OFT | 0.013 | - | 0.046 | - | 0.049 | - | | SpatialVLA | - | -| -|- | - | - | | π₀ | - |- | - | -| - | - | With perturbations | Model | Train Loss | Train Acc | Val Loss | Val Acc | Val L1 | Success rate | |-------|------------|-----------|----------|---------|--------|--------| | OpenVLA | 0.255 | 89.1% | 2.59 | 45.0% | 0.077 | <b> 0% </b>| | OpenVLA-OFT | 0.045 | - | 0.069 | -| 0.065 | 0% | | SpatialVLA | 0.069 | 97.3% | 2.13 | 47.8% | 0.063 | 0% | | π₀ | 0.0068 | - | 0.0091 | - | 0.083 | 0% | #### Efficiency | Model | Params | VRAM (model only) | Fine-tune Method | Training Time | Inference Speed | |-------|--------|-------|------------------|------------|-----------------| | OpenVLA | 7B | ~14GB |LoRA | 2.5 hrs | ~3 Hz | | OpenVLA-OFT | 7.7B | ~16GB | LoRA | 7.4 hrs | ~3 Hz | | SpatialVLA | 4B | ~8GB | LoRA | 56 min | ~5 Hz | | π₀ | 3.3B | ~7GB |Full bf16 | 2.1 hrs | ~8 Hz | Accuracy metrics are omitted for OpenVLA-OFT and π₀ because these models use continuous action representations rather than discretized token prediction, making token-level accuracy undefined. More critically, several cells are missing because we simply did not have the time and compute to fully train every model — Motus is absent from the results entirely for this reason. It is also important to note that the offline metrics — train/val loss, accuracy, and L1 error — measure single-step prediction quality on a fixed dataset. They do not capture what happens during an actual simulator rollout, where errors compound across hundreds of sequential action steps, the robot encounters visual states that never appeared in training, and a small early mistake can cascade into complete task failure. This is precisely why success rate is the metric that matters: a model can achieve low L1 error on individual action predictions while still failing every rollout, because the distribution of states it encounters during execution drifts far from what it was trained on. The gap between OpenVLA-OFT's reasonable offline metrics and its missing success rate is a direct illustration of this problem. Of the models we did train, OpenVLA is the only one to achieve a non-zero success rate, reaching 34% on the unperturbed benchmark — a meaningful result that confirms the task is learnable, but far from solved. Under perturbations, every evaluated model collapses to 0%: even the best-performing model has learned a policy that is entirely brittle to visual distribution shift. Given these constraints, we made a deliberate choice to invest our limited hardware budget into understanding why models fail rather than exhaustively training all baselines — the intervention analysis and activation maps that follow are the result of that tradeoff. ### Failure modes To study why OpenVLA fails on BOMB, we use intervention-based explanations rather than attention maps. In the image setting, we mask one visual patch at a time and measure the drop in the likelihood of the originally predicted action tokens. We chose patch masking because it gives a causal test: if removing a region changes the action distribution, that region was functionally important for the action. In the text setting, we mask selected instruction tokens through the attention mask and again measure the change in the predicted action. We chose this because our task instruction is highly structured, letting us directly test whether the policy uses color words and target coordinates. More formally, for a given image $o$, instruction $l$, and clean greedy action-token sequence $\hat{y}_{1:K}$, we first compute the forced log-probability of that clean sequence: $$\mathcal{L}_{\text{clean}}(o,l)=\sum_{k=1}^{K}\log p_\theta(\hat{y}_k \mid o,l,\hat{y}_{<k}).$$  For image interventions, let $o^j$ denote the image where patch $j$ is replaced by a mean-color occluder. The patch effect score is $$E_{\text{patch}}(j)=\mathcal{L}_{\text{clean}}(o,l)-\sum_{k=1}^{K}\log p_\theta(\hat{y}_k \mid o^j,l,\hat{y}_{<k}).$$ For text interventions, let $M_m$ be the prompt-token positions belonging to instruction span $m$, and let $l^{M_m}$ denote the same prompt with those attention-mask entries set to zero. The text effect score is $$E_{\text{text}}(m)=\mathcal{L}_{\text{clean}}(o,l)-\sum_{k=1}^{K}\log p_\theta(\hat{y}_k \mid o,l^{M_m},\hat{y}_{<k}).$$ Large positive values of $E_{\text{patch}}$ or $E_{\text{text}}$ indicate that the removed visual region or text span was causally important for the original action prediction. Our action intervention maps show a consistent pattern. Across many action steps, the model is strongly sensitive to patches containing the puck. During pickup and release, it also becomes sensitive to the gripper, which suggests that OpenVLA tracks the local interaction between end-effector and object reasonably well. However, we rarely observe strong sensitivity to the target board intersection itself. In other words, *the model appears to know where the puck is, and where the gripper is, but not where the puck should end up.* ![Model picking up puck](https://hackmd.io/_uploads/B11gT5W3Zx.png) *Model prediction is affected on puck obstruction. Left: input image for model. Center: Effect on masking the image patch on the prediction of action tokens. Right: Interpolated heatmap. Blue - no change in action tokens; Red - highest change in the action tokens normalised over the whole image.* ![Model about to pick up the puck](https://hackmd.io/_uploads/HJA_2c-n-x.png) *Model prediction is affected on both puck and arm/gripper obstruction while the puck is being picked up.* When looking at interventions on text, that we let play out for the whole scenario, we observe similar findings with Libero-PLUS: the model behaviour seems almost independent of the textual information. We rerun the full simulator rollout after masking the most relevant instruction spans, such as the target row / column phrase. **The dominant failure mode** in BOMB is **its inability to ground the final destination position in the image based on the text prompt**.   ![online_task_report_sim_demo_006_text copy 2](https://hackmd.io/_uploads/HJ_Qtob2Ze.png) *Top down perspective on the movement (x, y) of the arm gripper during the whole simulator task.* <div style="display: flex; flex-direction: column; align-items: center;"> <i>Legend</i> <img src="https://hackmd.io/_uploads/S1djFo-hbx.png" alt="online_task_report_sim_demo_006_text" width="350"> </div> <table style="border-collapse: collapse; margin: 0 auto;"> <tr> <td style="width: 256px; text-align: center; vertical-align: top; padding: 8px;"> <iframe src="https://player.vimeo.com/video/1180605657?badge=0&autopause=0&autoplay=1&loop=1&muted=1&player_id=0&app_id=58479" width="256" height="256" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="online_task_report_baseline_sim_demo_006_text" style="display: block; margin: 0 auto;"></iframe> <div style="margin-top: 6px;"><i>Unmasked run</i></div> </td> <td style="width: 256px; text-align: center; vertical-align: top; padding: 8px;"> <iframe src="https://player.vimeo.com/video/1180606529?badge=0&autopause=0&autoplay=1&loop=1&muted=1&player_id=0&app_id=58479" width="256" height="256" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="online_task_report_baseline_sim_demo_006_text_masked_attempt_01" style="display: block; margin: 0 auto;"></iframe> <div style="margin-top: 6px;"><i>Mask 1: "row 0, column 2"</i></div> </td> <td style="width: 256px; text-align: center; vertical-align: top; padding: 8px;"> <iframe src="https://player.vimeo.com/video/1180606722?badge=0&autopause=0&autoplay=1&loop=1&muted=1&player_id=0&app_id=58479" width="256" height="256" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="online_task_report_baseline_sim_demo_006_text_masked_attempt_02" style="display: block; margin: 0 auto;"></iframe> <div style="margin-top: 6px;"><i>Mask 2: "black"</i></div> </td> <td style="width: 256px; text-align: center; vertical-align: top; padding: 8px;"> <iframe src="https://player.vimeo.com/video/1180606887?badge=0&autopause=0&autoplay=1&loop=1&muted=1&player_id=0&app_id=58479" width="256" height="256" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="online_task_report_baseline_sim_demo_006_text_masked_attempt_03" style="display: block; margin: 0 auto;"></iframe> <div style="margin-top: 6px;"><i>Mask 3: "row 0"</i></div> </td> </tr> <tr> <td style="width: 256px; text-align: center; vertical-align: top; padding: 8px;"> <iframe src="https://player.vimeo.com/video/1180606995?badge=0&autopause=0&autoplay=1&loop=1&muted=1&player_id=0&app_id=58479" width="256" height="256" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="online_task_report_baseline_sim_demo_006_text_masked_attempt_04" style="display: block; margin: 0 auto;"></iframe> <div style="margin-top: 6px;"><i>Mask 4: "the board"</i></div> </td> <td style="width: 256px; text-align: center; vertical-align: top; padding: 8px;"> <iframe src="https://player.vimeo.com/video/1180607140?badge=0&autopause=0&autoplay=1&loop=1&muted=1&player_id=0&app_id=58479" width="256" height="256" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="online_task_report_baseline_sim_demo_006_text_masked_attempt_05" style="display: block; margin: 0 auto;"></iframe> <div style="margin-top: 6px;"><i>Mask 5: "2"</i></div> </td> <td style="width: 256px; text-align: center; vertical-align: top; padding: 8px;"> <iframe src="https://player.vimeo.com/video/1180607262?badge=0&autopause=0&autoplay=1&loop=1&muted=1&player_id=0&app_id=58479" width="256" height="256" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="online_task_report_baseline_sim_demo_006_text_masked_attempt_06" style="display: block; margin: 0 auto;"></iframe> <div style="margin-top: 6px;"><i>Mask 6: "column 2"</i></div> </td> <td style="width: 256px; text-align: center; vertical-align: top; padding: 8px;"> <iframe src="https://player.vimeo.com/video/1180607401?badge=0&autopause=0&autoplay=1&loop=1&muted=1&player_id=0&app_id=58479" width="256" height="256" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="online_task_report_baseline_sim_demo_006_text_masked_attempt_07" style="display: block; margin: 0 auto;"></iframe> <div style="margin-top: 6px;"><i>Mask 7: "column 2"</i></div> </td> </tr> </table> ### Activation maps We complement the input interventions with localization of the effects of interventions within the model layers. Starting from the strongest intervention per step, we corrupt the input and then patch clean internal activations back into the model to measure which decoder layers and attention heads restore the clean action-token probability. This lets us ask not only which input mattered, but also where in the network that information is being recovered. More formally, let $\mathcal{L}_{\text{clean}}$ be the clean forced sequence log-probability, $\mathcal{L}_{\text{corr}}$ the score after the strongest patch or text corruption, and $\mathcal{L}_{\text{patched}}^{(u)}$ the score obtained after restoring clean activations at unit $u$ (either a full decoder layer or a single attention head) during the corrupted forward pass. We report the normalized restoration ratio, which is normalised between maximum restoration (red) and minimal (blue) separately for the layer and attention head basis. $$R(u)=\frac{\mathcal{L}_{\text{patched}}^{(u)} - \mathcal{L}_{\text{corr}}}{\mathcal{L}_{\text{clean}} - \mathcal{L}_{\text{corr}}}.$$ <table style="margin: 0 auto; border-collapse: collapse;"> <tr> <td style="width: 420px; text-align: center; vertical-align: top; padding: 8px;"> <img src="https://hackmd.io/_uploads/Bk6B8ab3Wl.png" alt="image 1" style="width: 100%; height: auto;"> <div style="margin-top: 6px;"><i>Masking of patch (5, 5) gripper left pincher claw </i></div> </td> <td style="width: 420px; text-align: center; vertical-align: top; padding: 8px;"> <img src="https://hackmd.io/_uploads/SkjyUabh-e.png" alt="image 2" style="width: 100%; height: auto;"> <div style="margin-top: 6px;"><i>Masking of patch (7,5) containing the puck</i></div> </td> </tr> </table> When the corrupted input removes the gripper but the model is passed the correct layer activations, the model gets confused in the initial layers of the VLM Backbone. When the corrupted input removes puck-relevant evidence, only a limited subset of layers and heads strongly restores the clean action prediction. This suggests that OpenVLA relies on object-centric cues such as puck location and gripper-object interaction, but that it also has an over-reliance on the puck patch. This has more to do with design of our task than it has to do with inherent properties of VLAs. ## Conclusions  BOMB exposes a limitation that is easy to miss on existing VLA benchmarks: current open-source VLAs can look competent on coarse semantic manipulation while still failing at fine-grained spatial grounding. Even in the simplest version of our Go task, the models struggle to reliably map a language-specified board coordinate to an exact placement location. OpenVLA reaches only modest success without perturbations, and the evaluated models collapse under perturbations, suggesting that their learned policies remain highly brittle outside the narrow visual distribution seen during training. At the same time, BOMB helps separate *what* the models can do from *why* they fail. The intervention results indicate that OpenVLA tracks the stone and the gripper well enough to support local object interaction, but it does not robustly use the text-conditioned destination signal. The causal-localization analysis reinforces this picture: only a limited subset of internal components restores the clean action score after corrupting object evidence, pointing to a representation that is object-centric but not yet strongly grounded in board-level spatial structure. Finally, the project highlights a practical issue for the field: reproducible VLA evaluation is still expensive and fragmented. Building BOMB required custom dataset conversion, simulator wrappers, perturbation controls, and explainability tooling. That engineering overhead is itself part of the result. Beyond the benchmark scores, we hope BOMB contributes a reusable evaluation and analysis framework for testing whether future VLAs truly generalize to precise, long-horizon manipulation rather than overfitting to familiar benchmark layouts. ## Further work  There are several clear next steps. First, BOMB should be scaled beyond a single stone-placement task into multi-move Go sequences and other board-manipulation settings. That would let us test whether the current failure is purely about precise placement or whether it compounds further in genuinely long-horizon planning. Second, the perturbation study can be made more granular. Our current perturbation suite already shows that success collapses under distribution shift, but a stronger ablation would isolate which changes matter most. What would be the results of applying the activation maps to the text tokens as well, would we notice the model is fully indifferent to them? Third, the explainability results suggest concrete modeling directions. Methods that inject stronger spatial priors or more explicit goal representations may help, such as 3D-aware representations, coordinate-conditioned auxiliary objectives, contrastive alignment between language and board locations, or explicit state abstractions over board intersections. In other words, the problem may not be grasping a stone, but grounding a symbolic target into a metric destination. Finally, the benchmark should be broadened across models and environments. Completing the comparison with the remaining VLA baselines, validating across both MuJoCo and Robotwin2, and eventually testing sim-to-real transfer would make it possible to distinguish architecture effects from simulator artifacts. A shared evaluation harness for data generation, fine-tuning, perturbation analysis, and intervention-based explainability would substantially lower the barrier to entry for future VLA research. ## References #### VLA Models 1. Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., & Finn, C. (2024). *OpenVLA: An Open-Source Vision-Language-Action Model.* arXiv:2406.09246. CoRL 2025. 2. Kim, M.J., Finn, C., & Liang, P. (2025). *Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success* (OpenVLA-OFT). arXiv:2502.19645. 3. Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., & Li, X. (2025). *SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model.* arXiv:2501.15830. 4. Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. (2024). *π₀: A Vision-Language-Action Flow Model for General Robot Control.* arXiv:2410.24164. Physical Intelligence. 5. Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., Zhao, H., Liu, H., Su, Z., Ma, L., Su, H., & Zhu, J. (2025). *Motus: A Unified Latent Action World Model.* arXiv:2512.13030. Tsinghua University. #### Vision-Language Models 6. OpenAI. (2023). *GPT-4V Technical Report.* 7. Liu, H., Li, C., Wu, Q., & Lee, Y.J. (2023). *Visual Instruction Tuning* (LLaVA). NeurIPS 2023. 8. Google DeepMind. (2023). *Gemini: A Family of Highly Capable Multimodal Models.* #### Backbone Components 9. Touvron, H., Lavril, T., Izacard, G., et al. (2023). *LLaMA 2: Open Foundation and Fine-Tuned Chat Models.* Meta AI. 10. Oquab, M., Darcet, T., Moutakanni, T., et al. (2024). *DINOv2: Learning Robust Visual Features without Supervision.* TMLR. 11. Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). *Sigmoid Loss for Language Image Pre-Training* (SigLIP). ICCV 2023. 12. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). *LoRA: Low-Rank Adaptation of Large Language Models.* ICLR 2022. #### Datasets 13. Open X-Embodiment Collaboration. (2024). *Open X-Embodiment: Robotic Learning Datasets and RT-X Models.* ICRA 2024. #### Benchmarks 14. Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., & Stone, P. (2023). *LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning.* NeurIPS 2023. 15. Zhou, X., Xu, Y., Tie, G., Chen, Y., Zhang, G., Chu, D., Zhou, P., & Sun, L. (2025). *LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization.* arXiv:2510.03827. 16. Fei, Z., et al. (2025). *LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models.* arXiv:2510.13626. 17. Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., & Xiao, T. (2024). *SimplerEnv: Simulated Manipulation Policy Evaluation Environments for Real Robot Setups.* CoRL 2024. 18. Mees, O., Hermann, L., Rosete-Beas, E., & Burgard, W. (2022). *CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks.* IEEE RA-L. 19. Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., & Zhu, Y. (2024). *RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots.* RSS 2024. #### Explainability inspiration 20. Wang, Q., Hu, J., & Jiang, M. (2025). V-SEAM: Visual semantic editing and attention modulating for causal interpretability of vision-language models. arXiv. #### Simulators 21. Todorov, E., Erez, T., & Tassa, Y. (2012). *MuJoCo: A Physics Engine for Model-Based Control.* IROS 2012. 22. Chen, Y., et al. (2025). *RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation.* arXiv:2506.18088. #### Tools 23. Google DeepMind. *Gemini* was used for text refinement, brainstorming, and image generation during the preparation of this blog post.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.