Hierarchical Process Reward Models are Symbolic Vision Learners

Shan Zhang1,5,*,†, Aotian Chen2,*, Kai Zou3, Jindong Gu4, Yuan Xue2,‡, Anton van den Hengel1
1Adelaide AIML 2Ohio State University 3NetMind.ai 4University of Oxford 5Data61 & CSIRO
*Core Contribution   Project Lead   Corresponding Author

Comparison of latent spaces formed by semantic auto-encoder and symbolic auto-encoder. Semantic auto-encoders capture color and texture, which are uninformative for semantically sparse diagrams. Our symbolic auto-encoder forms structured latent spaces representing dependencies among primitives, with the decoder reconstructing diagrams based on visual-logic rules.

Abstract

Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision. This requires fundamentally different learning paradigms from pixel-based visual models. Symbolic visual learners parse diagrams into geometric primitives—points, lines, and shapes—whereas pixel-based learners operate on textures and colors. We propose a novel self-supervised symbolic auto-encoder that encodes diagrams into structured primitives and their interrelationships within the latent space, and decodes them through our executable engine to reconstruct the input diagrams. Central to this architecture is SymHPR (Symbolic Hierarchical Process Reward Modeling), which applies hierarchical step-level parsing rewards to enforce point-on-line, line-on-shape, and shape-on-relation consistency.

Evaluations across reconstruction, perception, and reasoning tasks demonstrate the effectiveness of our approach: achieving a 98.2% reduction in MSE for geometric diagram reconstruction, surpassing GPT-4o by 0.6% with a 7B model on chart reconstruction, improving by +13% on the MathGlance perception benchmark, and by +3% on MathVerse and GeoQA reasoning benchmarks.

Method

SymHPR Framework Overview

Key Innovations

  • Cost-Free Dataset Construction: A synthetic logic-form engine that constructs structured parsing paths (Point → Line → Shape → Shape Properties → Geometric Relations) paired with rendered diagrams, eliminating manual annotation.
  • Rule-based Rewards: Computed via verifiable metrics such as F1 scores and L2 distances, removing hallucinated supervision.
  • Hierarchical Dependency Modeling: Enforces step-level constraints (point-on-line, line-on-shape, shape-on-relation), ensuring geometric rewards are granted only when lower-level detections are reliable.

Symbolic Auto-Encoder

Our auto-encoder uses a rendering engine as decoder that reconstructs diagrams from symbolic logic forms. This enables self-supervised training through perceptual loss between input diagrams and their reconstructions. To address training collapse due to sparse supervision, we introduce two stabilization strategies:

  • Hard Negative Contrastive Learning: Gaussian noise injection to increase reward variance.
  • Power Normalization Annealing: Amplifies reward differences while preserving relative ranking.

Main Results

We evaluate our symbolic auto-encoder (SymVAE) across three tasks: geometric diagram reconstruction, diagram understanding, and mathematical reasoning.

Geometric Diagram Reconstruction

Model Synthetic Diagrams Geo170K Diagrams
MSE↓ LPIPS↓ SSIM↑ DINO↑ MSE↓ LPIPS↓ SSIM↑ DINO↑
Pixel-Level Auto-Encoder
VAE 12.90.290.620.81 37.90.370.640.80
VQ-GAN 11.70.210.690.81 37.20.310.730.83
Close-Source MLLMs
GPT-4o 34.30.150.820.96 39.90.220.760.92
Our Symbolic Models
SymParser-3B 7.640.120.760.93 36.80.280.760.93
SymHPR-3B 7.020.080.830.96 27.70.260.770.95
SymVAE-3B 6.130.010.890.95 21.80.200.790.95
SymVAE-7B 6.010.050.940.96 16.80.170.830.96

Key Findings

  • Superior Reconstruction Fidelity: SymVAE-7B achieves the best performance across all metrics on both synthetic and real-world diagrams.
  • Outperforms Closed-Source Models: SymVAE-7B significantly surpasses GPT-4o on MSE (6.01 vs 34.3 on synthetic; 16.8 vs 39.9 on Geo170K).

Diagram Understanding

We evaluate on the MathGlance benchmark covering plane geometry, solid geometry, and graphs.

Model Size Avg. Plane Geo. Solid Geo. Graphs
GPT-4o-53.342.860.756.4
GPT-o4-mini-high-48.019.164.760.1
Qwen2.5-VL7B59.244.063.165.7
Math-LLaVA13B40.027.944.847.3
Primitive7B46.635.449.455.1
SymVAE+GeoPeP 7B 72.6 77.9 67.6 72.2

Mathematical Reasoning

MathVerse Results

ModelAllVision Int.Vision Only
G-LLaVA-7B16.617.29.4
Math-LLaVA-13B24.117.616.4
MAVIS-7B28.424.718.3
Qwen2.5-VL-7B49.233.221.1
SymVAE+CoTs-7B 51.835.224.9

GeoQA Results

ModelAccuracy (%)
G-LLaVA-7B64.2
MAVIS-7B66.7
MultiMath-7B74.1
Qwen2.5-VL-7B76.4
SymVAE+CoTs-7B79.4

Cross-Modal Attention Visualization

Attention on MathGlance
Geometric Description Task
Attention on MathVerse
Reasoning Task

Reconstruction Visualization

Comparison of geometric reconstructions between our approach and general-purpose multimodal models. General-purpose multimodal models struggle to preserve fine-grained geometric structure, whereas our approach yields substantially more faithful geometric and relational reconstructions.

Reconstruction Comparison

Logic Form Examples

Our symbolic representations consist of compositional geometric primitives organized in a hierarchy: PointsLinesShapesShape IndicatorsRelations.

Logic Form Example 1
Logic Form Example 2

Cross-Domain Generalization

Our symbolic encoder demonstrates strong cross-domain generalization, extending beyond geometric diagrams to electrical circuits and chemical structures.

Chemical structure reconstruction
Electrical circuit reconstruction

Cross-Lingual and Cross-Modal Transfer

Emergent Cross-Lingual Transfer

After fine-tuning on perception tasks, our model exhibits emergent cross-lingual transfer: formal logic forms are automatically translated into fluent natural-language chain-of-thoughts without any explicit CoT supervision.

Emergent Cross-Modal Adaptation

Although trained only on 2D geometric diagrams, the model automatically adapts point primitives to 3D scenes, enriching them with attributes such as object shape, color, and material.

Qualitative Reasoning Examples

Our model produces correct reasoning chains grounded in accurate visual perception.

BibTeX

@article{zhang2025symhpr,
  title={Hierarchical Process Reward Models are Symbolic Vision Learners},
  author={Zhang, Shan and Chen, Aotian and Zou, Kai and Gu, Jindong and Xue, Yuan and van den Hengel, Anton},
  journal={arXiv preprint arXiv:2512.03126},
  year={2025}
}