MathGlance

Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs

^*Project Lead, ^✉Corresponding Author

Email: shan.zhang@adelaide.edu.au; yanpeng_sun@nus.edu.sg; anton.vandenhengel@adelaide.edu.au

Abstract

Diagrams represent a form of visual language that encodes abstract concepts and relationships through structured symbols and their spatial arrangements. Unlike natural images, they are inherently symbolic, and entirely artificial. They thus pose unique challenges for Multimodal Large Language Models (MLLMs) distinct from natural image processing. Recent studies have shown that MLLMs often exhibit flawed reasoning and hallucinations when handling diagram inputs. We investigate here whether these limitations stem from shortcomings in the models' ability to interpret diagrams themselves. To this end, we develop a diagnostic test suite that isolates perception from reasoning. Our systematic evaluation reveals that MLLMs perform poorly on basic perceptual tasks, e.g., shape classification, object counting, relationship identification, and object grounding, with near-zero accuracy on fine-grained grounding. Further analysis shows that weak diagram perception leads to "blind faith in text", where models rely on textual shortcuts rather than visual understanding (that is, they are Math Blind). We hypothesize that enabling models to capture the inherent structural properties of diagrams, represented as graphs of primitives and their interrelationships, is essential for improving diagram understanding. Experiments with 7B and 32B MLLMs validate this assumption, with models trained on such representations achieving a +79% gain on the grounding task. Crucially, these gains transfer to reasoning, achieving 3–4% cross-suite improvements on four public benchmarks even without additional chain-of-thought reasoning data. Our findings demonstrate that low-level perception supports faithful high-level reasoning in mathematical MLLMs. We provide both methodological frameworks and empirical evidence to guide future research in this direction.

Leaderboard on MATHEMETRIC

Accuracy scores on the subset of Plane Geometry (PG), Soild Geometry (SG), Graphs (G) in Logo MATHEMETRIC.

#	Model	Source	*Avg.*	PG_ALL	PG_cls	PG_cnt	PG_grd	PG_rlat	SG_ALL	SG_cls	SG_cnt	SG_grd	SG_rlat	G_ALL	G_cls	G_cnt	G_grd	G_rlat
1	Qwen2.5-VL⁺-32B (ours) 🥇	Link	74.2	77.9	70.7	79.6	84.0	79.5	73.8	98.8	86.4	15.0	85.0	71.1	98.6	98.2	2.7	99.0
2	Qwen2.5-VL⁺-7B (ours) 🥈	Link	72.9	78.5	70.7	79.2	82.6	85.0	71.9	97.9	86.2	12.9	70.0	68.2	94.2	96.3	4.9	89.4
3	SVE-Math-DeepSeek⁺-7B (ours) 🥉	Link	68.4	84.6	75.8	88.4	82.9	97.5	54.1	85.3	65.8	20.3	45.0	60.7	85.1	78.4	1.6	75.7
4	InternVL2.5-38B	Link	63.1	44.0	59.9	52.0	2.5	66.0	78.8	98.8	92.8	38.1	72.5	66.5	98.6	96.3	3.2	69.7
5	Qwen2.5-VL-32B	Link	62.2	43.3	56.9	54.8	0.0	67.0	72.5	98.8	89.7	1.6	87.5	68.8	91.3	100.0	1.6	97.0
6	Qwen2-VL-72B	Link	59.9	42.4	51.2	50.8	17.4	52.0	71.2	97.7	84.5	6.4	77.5	66.1	76.8	98.2	16.1	84.9
7	Qwen2.5-VL-7B	Link	59.2	44.0	56.2	51.3	18.5	52.0	68.0	98.8	88.7	0.0	65.0	65.7	89.9	100.0	3.2	78.8
8	InternLM-XComposer2-7B	Link	55.6	35.8	49.4	48.8	0.0	47.0	62.9	90.7	86.6	0.0	53.8	54.6	60.9	94.4	0.0	78.8
9	GPT-4o	Link	53.3	42.8	58.4	53.2	1.1	62.5	60.7	72.1	84.5	1.6	66.3	56.4	92.8	72.2	1.6	57.6
10	DeepSeek-VL2-Small (16B)	Link	51.5	37.6	47.6	43.6	12.5	48.5	63.8	98.8	70.1	11.1	60.0	53.2	76.8	53.7	11.3	81.8
11	Qwen2-VL-7B	Link	51.4	37.9	47.6	41.2	12.8	53.0	64.1	93.0	78.4	14.3	55.0	52.3	84.1	88.9	3.2	18.2
12	InternVL2.5-8B	Link	50.7	35.0	48.8	36.0	0.0	60.0	65.6	98.8	72.2	4.8	70.0	51.4	68.1	77.8	0.0	69.7
13	mPLUG-owl3-7B	Link	50.0	36.4	46.7	41.6	3.9	58.5	65.3	95.4	83.5	0.0	62.5	48.2	59.4	77.8	0.0	66.7
14	InternVL2-8B	Link	48.4	31.9	44.3	38.0	0.0	48.5	62.9	98.8	62.9	4.8	70.0	50.5	68.1	75.9	0.0	66.7
15	SVE-Math-DeepSeek-7B	Link	46.6	35.4	52.4	36.0	3.56	51.0	49.4	77.9	62.9	0.0	41.3	55.1	81.2	75.9	0.0	69.7
16	MultiMath-7B	Link	41.8	31.2	44.0	30.4	1.07	53.0	45.7	81.4	53.6	0.0	33.8	48.6	79.7	57.4	0.0	33.8
17	Math-LLaVA-13B	Link	40.0	27.9	34.4	32.4	0.0	50.5	44.8	81.4	55.7	0.0	27.5	47.3	78.3	59.3	0.0	51.5
18	GPT-o1	Link	36.5	15.8	33.2	11.6	0.0	14.0	41.4	75.6	52.6	0.0	23.8	52.3	82.6	81.5	0.0	39.4
19	LLaVA-v1.5-13B	Link	35.4	32.8	29.3	40.4	23.5	42.0	35.9	60.5	38.1	0.0	35.0	37.6	63.8	42.6	0.0	45.5
20	LLaVA-v1.5-7B	Link	33.3	29.2	29.0	39.6	14.2	37.5	31.6	43.0	42.3	0.0	31.3	39.0	76.8	35.2	0.0	39.4
21	DeepSeek-VL2-Tiny	Link	32.6	29.5	45.2	34.4	4.6	32.0	39.0	76.7	32.0	0.0	37.5	29.4	39.1	57.4	0.0	18.2
22	G-LLaVA-7B	Link	30.3	25.6	27.8	41.2	0.4	38.0	31.3	45.4	38.1	0.0	32.5	33.9	58.0	37.0	0.0	42.4

Overview

Logo The MATHEMETRIC benchmark is a novel evaluation framework designed to assess the mathematical perception abilities of Multimodal Large Language Models (MLLMs). Unlike existing benchmarks that often conflate perception with high-level reasoning tasks, MATHEMETRIC isolates perceptual skills by focusing on mathematical visual reasoning with minimal cognitive load. It provides both quantitative and qualitative assessments across different granularity levels. The benchmark covers a diverse range of mathematical contexts, including Plane Geometry (66%), Solid Geometry (20%), and Graphical data representations (14%) such as line plots, bar charts, and pie charts. It comprises 1,609 questions and 1,198 unique images, formulated mainly as multiple-choice or true/false questions to streamline evaluation. MATHEMETRIC features four key task categories: 1) Shape Classification which identifies object classes based on visual attributes (e.g., vertices, material, color, size) across 16 plane geometry categories, 3 CLEVR-defined solid objects, and 5 graphical types; 2) Object Counting which evaluates the model's ability to count either the total number of objects or specific geometric shapes within an image; 3) Relationship Identification which assesses understanding of spatial and mathematical relationships between geometric primitives, covering 4 spatial and over 10 mathematical relationships; 4) Object Grounding which measures fine-grained localization by predicting object coordinates (x1, y1, x2, y2) based on textual descriptions. MATHEMETRIC is designed to challenge MLLMs in mathematical perception while minimizing high-level reasoning demands, offering a comprehensive and fine-grained evaluation of visual reasoning abilities.

Examples of Shape Classification in Logo MATHEMETRIC.

Examples of Object Counting in Logo MATHEMETRIC.

Examples of Object Grounding in Logo MATHEMETRIC.

Examples of Relationship Identification in Logo MATHEMETRIC.

Key statistics and subject-task distribution of Logo

MATHEMETRIC.

Main Results of Logo

MATHEMETRIC

Performance comparison of different MLLMs on MATHEMETRIC across Plane Geometry, Solid Geometry, and Graphs. cls, cnt, grd, and rlat represent different question categories: shape classification, object counting, object grounding, and relationship identification, respectively. all indicates the overall accuracy, calculated as the ratio of correctly answered questions to the total number of questions in the benchmark, while Avg. denotes the average all score across all subjects.

BibTeX

@article{sun2025mathglance, author = {Yanpeng Sun and Shan Zhang and Wei Tang and Aotian Chen and Piotr Koniusz and Kai Zou and Yuan Xue and Anton van den Hengel}, title = {MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams}, booktitle = {arXiv preprint arXiv:2503.20745}, year = {2025} } @article{zhang2025primitive, author = {Shan Zhang and Aotian Chen and Yanpeng Sun and Jindong Gu and Yi-Yu Zheng and Piotr Koniusz and Kai Zou and Anton van den Hengel and Yuan Xue}, title = {Primitive Vision: Improving Diagram Understanding in MLLMs}, booktitle = {Proceedings of the 42th International Conference on Machine Learning}, year = {2025} }

MATHEMETRIC (aka. MathGlance)

Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs

Abstract

Leaderboard on MATHEMETRIC

MATHEMETRIC

Overview

Experiment Results

Main Results of MATHEMETRIC

More Results

Model Responses

BibTeX