Vision–Language–Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists—they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism, Meta Action Reasoner (MAR)—derived from Attentive Neural Processes—to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable—paving the way toward general-purpose embodied agents.
Despite progress in task adaptation, current VLAs are not true generalists—still reliant on costly, task-specific post-training. Fine-tuning per task limits transfer, raises training costs, and often requires many brittle gradient steps before stable behaviors emerge, slowing adaptation and risking poor generalization. For instance, OpenVLA needs 240K steps across LIBERO, while OpenVLA-OFT requires 150K–500K, with long-horizon tasks like LIBERO-Long becoming bottlenecks.
While prior work expands datasets or innovates at pretraining, we instead target post-training. Starting with multi-task co-training (SFT on all four LIBERO suites), we reduce GPU hours and improve success. This raises a question: can auxiliary tasks further boost VLAs? Surprisingly, naive inclusion hurts—convergence slows and performance drops. We attribute this to optimization instability from heterogeneous distributions, where mismatches in feature (e.g., camera views) and action spaces (e.g., DoFs) offset co-training benefits.
We enhance MetaVLA with auxiliary tasks by adding GR00T, which was unseen during OpenVLA pretraining, that balances LIBERO's domain relevance with structural diversity.
We presented MetaVLA, a framework that addresses the inefficiencies and brittleness of current VLA post-training pipelines. By introducing Context-Aware Meta Co-Training, MetaVLA integrates auxiliary tasks without destabilizing optimization, achieving superior convergence speed, efficiency, and generalization. MetaVLA is lightweight, plug-and-play, and backbone-agnostic, making it easy to extend beyond supervised fine-tuning to reinforcement learning or hybrid pipelines. Empirical results on LIBERO show consistent gains over both per-task fine-tuning and naive multi-task SFT, while significantly reducing training cost and model count. Looking forward, we envision extending MetaVLA to broader backbones and wider benchmarks, incorporating web-scale multimodal data, and deploying on real robots. We hope this work inspires future research toward efficient, scalable, and truly generalist embodied VLA systems.
@misc{li2025metavlaunifiedmetacotraining,
title={MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption},
author={Chen Li and Zhantao Yang and Han Zhang and Fangyi Chen and Chenchen Zhu and Anudeepsekhar Bolimera and Marios Savvides},
year={2025},
eprint={2510.05580},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.05580},
}