Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM_S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks. The data and code will be available.
In this work, we propose STELAR-Vision, a training framework that enables topology-aware reasoning in VLMs via generated responses. STELAR-Vision enhances vision-language reasoning by leveraging diverse topological structures, achieving a 9.7% accuracy improvement over its base model and outperforming its larger variant Qwen2VL-72B-Instruct by 7.3%. The Frugal Learning variant reduces output length by 18.1% while maintaining comparable accuracy, surpassing Chain-Only baselines in both efficiency and task effectiveness. STELAR-Vision demonstrates strong generalization across five diverse OOD datasets and achieves 4.3% higher overall accuracy on in-distribution tasks, consistently outperforming Chain-Only training.
@misc{li2025stelarvisionselftopologyawareefficientlearning,
title={STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision},
author={Chen Li and Han Zhang and Zhantao Yang and Fangyi Chen and Zihan Wang and Anudeepsekhar Bolimera and Marios Savvides},
year={2025},
eprint={2508.08688},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.08688},
}