Logo MATHEMETRIC (aka. MathGlance)

Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs

*Project Lead,     Corresponding Author
Email: shan.zhang@adelaide.edu.au; yanpeng_sun@nus.edu.sg; anton.vandenhengel@adelaide.edu.au
geometric reasoning

Although both natural images and symbolic diagrams can be represented as grids of pixels, they constitute very different forms of information. Images represent samples of the intensity of the real world, while diagrams are human-constructed and convey geometric concepts through structured symbols and their interrelationships. Figure (a) illustrates that diagrams pose unique challenges for current Multimodal Large Language Models (MLLMs), particularly in fine-grained grounding task. Figure (b) demonstrates a positive correlation between low-level perception and high-level reasoning tasks, emphasizing that clear diagram perception leads to substantial improvements in mathematical reasoning performance.

Abstract

Diagrams represent a form of visual language that encodes abstract concepts and relationships through structured symbols and their spatial arrangements. Unlike natural images, they are inherently symbolic, and entirely artificial. They thus pose unique challenges for Multimodal Large Language Models (MLLMs) distinct from natural image processing. Recent studies have shown that MLLMs often exhibit flawed reasoning and hallucinations when handling diagram inputs. We investigate here whether these limitations stem from shortcomings in the models' ability to interpret diagrams themselves. To this end, we develop a diagnostic test suite that isolates perception from reasoning. Our systematic evaluation reveals that MLLMs perform poorly on basic perceptual tasks, e.g., shape classification, object counting, relationship identification, and object grounding, with near-zero accuracy on fine-grained grounding. Further analysis shows that weak diagram perception leads to "blind faith in text", where models rely on textual shortcuts rather than visual understanding (that is, they are Math Blind). We hypothesize that enabling models to capture the inherent structural properties of diagrams, represented as graphs of primitives and their interrelationships, is essential for improving diagram understanding. Experiments with 7B and 32B MLLMs validate this assumption, with models trained on such representations achieving a +79% gain on the grounding task. Crucially, these gains transfer to reasoning, achieving 3–4% cross-suite improvements on three public benchmarks even without additional chain-of-thought reasoning data. Our findings demonstrate that low-level perception supports faithful high-level reasoning in mathematical MLLMs. We provide both methodological frameworks and empirical evidence to guide future research in this direction.

Leaderboard on Logo MATHEMETRIC

Accuracy scores on the subset of Plane Geometry (PG), Soild Geometry (SG), Graphs (G) in Logo MATHEMETRIC.

# Model Source Avg. PG_ALL PG_cls PG_cnt PG_grd PG_rlat SG_ALL SG_cls SG_cnt SG_grd SG_rlat G_ALL G_cls G_cnt G_grd G_rlat
1 Qwen2.5-VL+-32B (ours) 🥇 Link 74.2 77.9 70.7 79.6 84.0 79.5 73.8 98.8 86.4 15.0 85.0 71.1 98.6 98.2 2.7 99.0
2 Qwen2.5-VL+-7B (ours) 🥈 Link 72.9 78.5 70.7 79.2 82.6 85.0 71.9 97.9 86.2 12.9 70.0 68.2 94.2 96.3 4.9 89.4
3 SVE-Math-DeepSeek+-7B (ours) 🥉 Link 68.4 84.6 75.8 88.4 82.9 97.5 54.1 85.3 65.8 20.3 45.0 60.7 85.1 78.4 1.6 75.7
4 InternVL2.5-38B Link 63.1 44.0 59.9 52.0 2.5 66.0 78.8 98.8 92.8 38.1 72.5 66.5 98.6 96.3 3.2 69.7
5 Qwen2.5-VL-32B Link 62.2 43.3 56.9 54.8 0.0 67.0 72.5 98.8 89.7 1.6 87.5 68.8 91.3 100.0 1.6 97.0
6 Qwen2-VL-72B Link 59.9 42.4 51.2 50.8 17.4 52.0 71.2 97.7 84.5 6.4 77.5 66.1 76.8 98.2 16.1 84.9
7 Qwen2.5-VL-7B Link 59.2 44.0 56.2 51.3 18.5 52.0 68.0 98.8 88.7 0.0 65.0 65.7 89.9 100.0 3.2 78.8
8 InternLM-XComposer2-7B Link 55.6 35.8 49.4 48.8 0.0 47.0 62.9 90.7 86.6 0.0 53.8 54.6 60.9 94.4 0.0 78.8
9 GPT-4o Link 53.3 42.8 58.4 53.2 1.1 62.5 60.7 72.1 84.5 1.6 66.3 56.4 92.8 72.2 1.6 57.6
10 DeepSeek-VL2-Small (16B) Link 51.5 37.6 47.6 43.6 12.5 48.5 63.8 98.8 70.1 11.1 60.0 53.2 76.8 53.7 11.3 81.8
11 Qwen2-VL-7B Link 51.4 37.9 47.6 41.2 12.8 53.0 64.1 93.0 78.4 14.3 55.0 52.3 84.1 88.9 3.2 18.2
12 InternVL2.5-8B Link 50.7 35.0 48.8 36.0 0.0 60.0 65.6 98.8 72.2 4.8 70.0 51.4 68.1 77.8 0.0 69.7
13 mPLUG-owl3-7B Link 50.0 36.4 46.7 41.6 3.9 58.5 65.3 95.4 83.5 0.0 62.5 48.2 59.4 77.8 0.0 66.7
14 InternVL2-8B Link 48.4 31.9 44.3 38.0 0.0 48.5 62.9 98.8 62.9 4.8 70.0 50.5 68.1 75.9 0.0 66.7
15 SVE-Math-DeepSeek-7B Link 46.6 35.4 52.4 36.0 3.56 51.0 49.4 77.9 62.9 0.0 41.3 55.1 81.2 75.9 0.0 69.7
16 MultiMath-7B Link 41.8 31.2 44.0 30.4 1.07 53.0 45.7 81.4 53.6 0.0 33.8 48.6 79.7 57.4 0.0 33.8
17 Math-LLaVA-13B Link 40.0 27.9 34.4 32.4 0.0 50.5 44.8 81.4 55.7 0.0 27.5 47.3 78.3 59.3 0.0 51.5
18 GPT-o1 Link 36.5 15.8 33.2 11.6 0.0 14.0 41.4 75.6 52.6 0.0 23.8 52.3 82.6 81.5 0.0 39.4
19 LLaVA-v1.5-13B Link 35.4 32.8 29.3 40.4 23.5 42.0 35.9 60.5 38.1 0.0 35.0 37.6 63.8 42.6 0.0 45.5
20 LLaVA-v1.5-7B Link 33.3 29.2 29.0 39.6 14.2 37.5 31.6 43.0 42.3 0.0 31.3 39.0 76.8 35.2 0.0 39.4
21 DeepSeek-VL2-Tiny Link 32.6 29.5 45.2 34.4 4.6 32.0 39.0 76.7 32.0 0.0 37.5 29.4 39.1 57.4 0.0 18.2
22 G-LLaVA-7B Link 30.3 25.6 27.8 41.2 0.4 38.0 31.3 45.4 38.1 0.0 32.5 33.9 58.0 37.0 0.0 42.4

Logo MATHEMETRIC

Overview

Logo The MATHEMETRIC benchmark is a novel evaluation framework designed to assess the mathematical perception abilities of Multimodal Large Language Models (MLLMs). Unlike existing benchmarks that often conflate perception with high-level reasoning tasks, MATHEMETRIC isolates perceptual skills by focusing on mathematical visual reasoning with minimal cognitive load. It provides both quantitative and qualitative assessments across different granularity levels. The benchmark covers a diverse range of mathematical contexts, including Plane Geometry (66%), Solid Geometry (20%), and Graphical data representations (14%) such as line plots, bar charts, and pie charts. It comprises 1,609 questions and 1,198 unique images, formulated mainly as multiple-choice or true/false questions to streamline evaluation. MATHEMETRIC features four key task categories: 1) Shape Classification which identifies object classes based on visual attributes (e.g., vertices, material, color, size) across 16 plane geometry categories, 3 CLEVR-defined solid objects, and 5 graphical types; 2) Object Counting which evaluates the model's ability to count either the total number of objects or specific geometric shapes within an image; 3) Relationship Identification which assesses understanding of spatial and mathematical relationships between geometric primitives, covering 4 spatial and over 10 mathematical relationships; 4) Object Grounding which measures fine-grained localization by predicting object coordinates (x1, y1, x2, y2) based on textual descriptions. MATHEMETRIC is designed to challenge MLLMs in mathematical perception while minimizing high-level reasoning demands, offering a comprehensive and fine-grained evaluation of visual reasoning abilities.

Key statistics and subject-task distribution of Logo MATHEMETRIC.
data-overview
geometric reasoning

The synthetic construction process for plane geometry. We synthesize geometric figures by randomly sampling elements from the geometric shape pool and relationship pool, ensuring consistency through a verifier that enforces logical constraints based on manually designed rules, fundamental mathematical principles, and prerequisite points. All visual elements are structured and saved in JSON format. Images are rendered using the Matplotlib package, and corresponding Q&A pairs are generated using a template-based pipeline.

Experiment Results

Main Results of Logo MATHEMETRIC

geometric reasoning

Performance comparison of different MLLMs on MATHEMETRIC across Plane Geometry, Solid Geometry, and Graphs. cls, cnt, grd, and rlat represent different question categories: shape classification, object counting, object grounding, and relationship identification, respectively. all indicates the overall accuracy, calculated as the ratio of correctly answered questions to the total number of questions in the benchmark, while Avg. denotes the average all score across all subjects.

More Results

geometric reasoning

Model Responses

BibTeX

@article{sun2025mathglance,
  author    = {Yanpeng Sun and Shan Zhang and Wei Tang and Aotian Chen and Piotr Koniusz and Kai Zou and Yuan Xue and Anton van den Hengel},
  title     = {MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams},
  booktitle = {arXiv preprint arXiv:2503.20745},
  year      = {2025}
}
@article{zhang2025primitive,
  author    = {Shan Zhang and Aotian Chen and Yanpeng Sun and Jindong Gu and Yi-Yu Zheng and Piotr Koniusz and Kai Zou and Anton van den Hengel and Yuan Xue},
  title     = {Primitive Vision: Improving Diagram Understanding in MLLMs},
  booktitle = {Proceedings of the 42th International Conference on Machine Learning},
  year      = {2025}
}