STELAR-VISION

Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Carnegie Mellon University

Introduction

Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM_S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks. The data and code will be available.

STELAR-Vision framework overview

An overview of the STELAR-Vision framework.

Motivation

Limitations of Chain-of-Thought reasoning

Limitations of the Popular Chain-of-Thought Reasoning Structures. The widely adopted Chain-of-Thought (CoT) reasoning paradigm (in green) often results in unnecessarily verbose reasoning processes, as demonstrated in the first example. Under CoT reasoning, the model redundantly counts each cube, whereas with Graph topology (in blue), it quickly identifies the key point of the question. In the bottom-row example, CoT reasoning begins with a detailed examination of each subplot but ultimately arrives at an incorrect answer. In contrast, Tree topology (in red) initiates reasoning with a high-level overview before delving into specific features. In both scenarios, CoT-style reasoning proves suboptimal.

Comparison of topology accuracy across subjects

Comparison of topology accuracy across subjects: Accuracy of Chain, Tree, and Graph reasoning topological structures per subject of MATH-V dataset. Chain remains the best overall reasoning structure, while Tree, and Graph perform better in reasoning subjects such as "graph theory" or "statistics".

Distribution of reasoning token length

Distribution of generated reasoning token length of Chain, Tree, and Graph topological structures in TopoAug Dataset. The box within each violin plot represents the median, and 25% and 75% percentile thresholds.

Contributions

  • We propose STELAR-Vision, a training framework explicitly trained for topology-aware reasoning. It leverages diverse reasoning topologies such as chains, trees, and graphs, aligns reasoning paths with question characteristics, and enables adaptive and efficient multimodal inference.
  • We introduce TopoAug, a data generation pipeline that automatically produces diverse topological reasoning and annotates optimal structures per question. We also integrate Frugal Learning into the learning framework, achieving reductions in output length with minimal accuracy tradeoff.
  • By conducting experiments with post-training supervision and reinforcement learning, STELAR-Vision improves accuracy by 9.7% over its base model and its larger variant Qwen2VL-72B-Instruct by 7.3%. On the out-of-distribution dataset, it surpasses The Frugal Learning variant reduces output length by 18.1% while maintaining comparable accuracy.

Experiments

Quantitative evaluation results

Quantitative Evaluation. STELAR-Vision achieves strong gains across both in-distribution and out-of-distribution reasoning benchmarks. On ID datasets, it outperforms its base model Qwen2VL-7B-Instruct by 9.7%, and even surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On OOD benchmarks, it exceeds Phi-4-multimodal-instruct by up to 36% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%. Compared to Chain-Only training, STELAR-Vision achieves up to 13% higher accuracy, highlighting the power of topological augmentation.



Impact of TopoAug dataset and training methods

Impact of TopoAug Dataset and Training Methods. We present an ablation study on the in-distribution VLM_S2H and Math-V datasets to compare the performance of our models against counterparts trained exclusively on chain-based reasoning data across all training methods. STELAR-Vision consistently outperforms all Chain-Only variants across all ID datasets-specifically it improves the highest variant Chain-Only from 25% to 31% by 6%, and boosts overall accuracy by 4.3%, highlighting the effectiveness of topological augmentation.



Comparison of accuracy and token length

Comparison of accuracy and generated token length across models: STELAR-Vision improves performance while using fewer generation tokens. Frugal learning further improves generation efficiency.



Comparison of accuracy and token length

Impact of Training on Test-time Topology Selection. Percentage of reasoning topologies autonomously selected by each model on our evaluation datasets, without explicit prompting. ID denotes the in-distribution test split.



Conclusion

In this work, we propose STELAR-Vision, a training framework that enables topology-aware reasoning in VLMs via generated responses. STELAR-Vision enhances vision-language reasoning by leveraging diverse topological structures, achieving a 9.7% accuracy improvement over its base model and outperforming its larger variant Qwen2VL-72B-Instruct by 7.3%. The Frugal Learning variant reduces output length by 18.1% while maintaining comparable accuracy, surpassing Chain-Only baselines in both efficiency and task effectiveness. STELAR-Vision demonstrates strong generalization across five diverse OOD datasets and achieves 4.3% higher overall accuracy on in-distribution tasks, consistently outperforming Chain-Only training.

BibTeX


      @misc{li2025stelarvisionselftopologyawareefficientlearning,
        title={STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision}, 
        author={Chen Li and Han Zhang and Zhantao Yang and Fangyi Chen and Zihan Wang and Anudeepsekhar Bolimera and Marios Savvides},
        year={2025},
        eprint={2508.08688},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2508.08688}, 
      }