PokéVLA: Empowering Pocket-Sized
Vision-Language-Action Model with Comprehensive World Knowledge Guidance

*Equal Contribution

Demonstration Video

Abstract

Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning.

Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert.

Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset.

Our approach effectively leverages pre-trained knowledge to learn representations relevant to the robot manipulation. In both evaluation sets, PokeVLA demonstrated strong performance, maintaining a high success rate even in the presence of significant disturbances in the environment, showcasing its remarkable generalization ability.

Method

The architecture of our proposed PokeVLA
The architecture of our proposed PokeVLA.

Results on LIBERO-Plus Benchmarks

Concurrent works are highlighted in gray. Results with a blue background indicate models trained on unperturbed data and evaluated via direct transfer.

Real-world Performance

P1–P5 denote: (P1) end-effector initial pose perturbation, (P2) object perturbation, (P3) background perturbation, (P4) lighting perturbation, and (P5) unseen language instructions. We provide the success rate of the original setting for reference, which is excluded from the average calculation.

Qualitative Results

Supervised by the goal-aware segmentation as an auxiliary task, our model can precisely guide its attention to the manipulation targets, thereby achieving superior performance across diverse tasks.
Our model can generalize to complex instructions involving spatial and color referencing on real-world tasks, showing superior semantic understanding and task performance while consistently maintaining accurate manipulation target segmentation and goal-oriented attention over long time horizons.