MetaVLA: Unified Meta Co-Training for Efficient Embodied Adaptation

Introduction

Vision–Language–Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists—they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism, Meta Action Reasoner (MAR)—derived from Attentive Neural Processes—to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable—paving the way toward general-purpose embodied agents.

MetaVLA Architecture. VLA backbone married with Context-Aware Meta Co-Training Framework, where the context memory bank is composed of both in-domain target tasks and out-of-domain auxiliary tasks.

Motivation

Despite progress in task adaptation, current VLAs are not true generalists—still reliant on costly, task-specific post-training. Fine-tuning per task limits transfer, raises training costs, and often requires many brittle gradient steps before stable behaviors emerge, slowing adaptation and risking poor generalization. For instance, OpenVLA needs 240K steps across LIBERO, while OpenVLA-OFT requires 150K–500K, with long-horizon tasks like LIBERO-Long becoming bottlenecks.

While prior work expands datasets or innovates at pretraining, we instead target post-training. Starting with multi-task co-training (SFT on all four LIBERO suites), we reduce GPU hours and improve success. This raises a question: can auxiliary tasks further boost VLAs? Surprisingly, naive inclusion hurts—convergence slows and performance drops. We attribute this to optimization instability from heterogeneous distributions, where mismatches in feature (e.g., camera views) and action spaces (e.g., DoFs) offset co-training benefits.

Stronger cross-task generalization with one single model. OpenVLA requires training separate models for each task suite, resulting in higher training costs and poor cross-task performance. In contrast, MetaVLA achieves strong generalization across all four suites with a single unified model.

Higher success rate with fewer training steps. MetaVLA achieves a 4.4% higher average success rate while requiring 68.75% fewer training steps compared to the OpenVLA baseline on LIBERO benchmarks.

Faster convergence to higher accuracy across all target tasks. Comparison of training accuracy between MetaVLA and a baseline multi-task SFT when auxiliary tasks are added. MetaVLA consistently converges to higher accuracy across all LIBERO suites, while the baseline underperforms throughout training.

Contributions

We investigate an underexplored direction: improving post-training efficiency and generalization ability through incorporating diverse auxiliary tasks with negligible optimization overhead.
We propose MetaVLA, a suite of plug-in module and training recipes that enables fast and scalable adaptation with strong generalization. MetaVLA is engineering-friendly and agnostic to backbone architectures and underlying training pipelines.
We conduct comprehensive experiments to show that MetaVLA delivers superior performance with significant efficiency gains by reducing model count and GPU training hours, while preserving fast inference.

Method

We enhance MetaVLA with auxiliary tasks by adding GR00T, which was unseen during OpenVLA pretraining, that balances LIBERO's domain relevance with structural diversity.

Comparison between auxiliary tasks and LIBERO evaluation benchmark. LIBERO tasks use third-person front-view images and 7-DoF actions for a single-arm robot. In contrast, our auxiliary data from GR00T introduces variation through side-view observations and a two-arm robot with 14-DoF actions. MetaVLA benefits from this data diversity, while OpenVLA struggles with the domain mismatch.

Experiments

Success rate comparison with prior methods. All MetaVLA variants are single models trained for 75K steps. MetaVLA (ours) uses only LIBERO suites in the context bank without the stochastic module, while MetaVLA+Stochastic (ours) includes it. SFT-4LIBERO is a single-model baseline trained with vanilla multi-task SFT across all suites. OpenVLA (top) comprises four Hugging Face models fine-tuned separately on LIBERO using the OpenVLA-7B backbone, totaling roughly 240K steps. MetaVLA with six auxiliary tasks surpasses OpenVLA by 4.4% and SFT-4LIBERO by 3.1% on average, with even larger gains on LIBERO-Long (8.0% and 5.1%, respectively).

Left: Per-suite LIBERO success rate across varying context batch sizes. OpenVLA refers to the four Hugging Face baseline models, each fine-tuned individually on LIBERO using the OpenVLA-7B backbone, while SFT-4LIBERO is a single-model baseline trained with vanilla multi-task SFT across all suites. For each suite, success rate increases monotonically with context batch size. Right: Average success rate across LIBERO suites with varying context batch sizes. OpenVLA denotes the four Hugging Face models baselines fine-tuned individually on LIBERO with the OpenVLA-7B backbone, while SFT-4LIBERO is a single-model baseline trained with vanilla multi-task SFT across all suites. b_c indicates the context batch size. Larger context batches consistently yield higher average success rates.

MetaVLA-EACH: Per-suite success rates across LIBERO. OpenVLA denotes the four baseline models fine-tuned separately for each LIBERO suite, released on Hugging Face and trained for 240K total steps. OpenVLA-120K follows the same setup but with 120K steps. MetaVLA-EACH-120K and MetaVLA-EACH-240K are our models trained separately per suite for 120K and 240K steps, respectively, without co-training. Thanks to the MAR design, all MetaVLA-EACH variants outperform their OpenVLA counterparts with fewer steps. For Goal and Long, performance continues to improve at 240K steps, indicating stronger learning potential.

Conclusion

We presented MetaVLA, a framework that addresses the inefficiencies and brittleness of current VLA post-training pipelines. By introducing Context-Aware Meta Co-Training, MetaVLA integrates auxiliary tasks without destabilizing optimization, achieving superior convergence speed, efficiency, and generalization. MetaVLA is lightweight, plug-and-play, and backbone-agnostic, making it easy to extend beyond supervised fine-tuning to reinforcement learning or hybrid pipelines. Empirical results on LIBERO show consistent gains over both per-task fine-tuning and naive multi-task SFT, while significantly reducing training cost and model count. Looking forward, we envision extending MetaVLA to broader backbones and wider benchmarks, incorporating web-scale multimodal data, and deploying on real robots. We hope this work inspires future research toward efficient, scalable, and truly generalist embodied VLA systems.

BibTeX


      @misc{li2025metavlaunifiedmetacotraining,
            title={MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption}, 
            author={Chen Li and Zhantao Yang and Han Zhang and Fangyi Chen and Chenchen Zhu and Anudeepsekhar Bolimera and Marios Savvides},
            year={2025},
            eprint={2510.05580},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2510.05580}, 
      }