Wei Tang1,†, Yanpeng Sun2,†,‡, Shan Zhang3,5,†, Xiaofan Li4,*, Piotr Koniusz5, Wei Li6, Na Zhao2, Zechao Li1
1NJUST IMAG 2SUTD IMPL 3Adelaide AIML 4Baidu Inc. 5Data61CSIRO 6SenseTime
*Project Leader   ViOcean Initiative Collaborators   Corresponding Author
Comparison of latent spaces

Motivation of Artemis. Comparison between current perception-policy models and human perception. (a) Query: find the shortest player. (b) Perception–policy models depend on ungrounded language reasoning, leading to wrong localization. (c) Humans perform structured visual reasoning, progressively refining attention to identify the correct player.

Abstract

Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space.

In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning.

Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.

Method

Artemis is a unified framework for RL-based perception-policy learning. Rollouts generated by MLLM are encouraged to perceive structured visual evidence before decision-making, guided by the structured visual reasoning reward, while the outcome rewards supervise the format and answer generation. GRPO is employed to optimize the unified perception-policy learning framework.

SymHPR Framework Overview

Key Innovations

  • Rethink of Perception-Policy Learning: Instead of reasoning in linguistic space or removing the thinking process, we rethink what form of thinking truly benefits perception, and align the learning with spatial and object-centric representations.
  • Structured Visual Reasoning: Intermediate steps are represented as (label, bounding-box) pairs, enabling explicit tracking of key and contextual objects and reducing ambiguity from language-based reasoning.
  • Cross-task Generalization: A single perception policy transfers from grounding to counting and from natural images to diagrams, achieving scalable improvements across diverse visual tasks.

Structured Visual Reasoning

Artemis explicitly generates structured visual evidence during the <think> phase. By tracking intermediate states as labeled bounding boxes, the model learns to locate key and contextual objects before producing final answers. This approach strengthens object-centric perception, reduces ambiguity from language-based reasoning, and enables robust generalization across multiple visual domains.

In-domain Visual Perception

We evaluate our unified visual perception learning framework, Artemis, on two in-domain tasks: visual grounding and object detection. Models marked with denote results from our own inference.

Referring Expression Comprehension (Grounding)

Method Size RefCOCO
val@50testA@50testB@50 val@75testA@75testB@75 val@95testA@95testB@95 valAvgtestAAvgtestBAvg
Expert Models
MDETR-87.590.482.6 --- --- ---
OFA-88.490.683.3 --- --- ---
General MLLMs
LLaVA-v1.57B 49.154.943.3 10.713.66.9 0.40.30.3 20.122.916.8
LLaVA-OV7B 73.082.363.5 24.229.615.9 0.50.50.5 32.637.526.6
Qwen2-VL2B 86.889.682.0 77.280.670.1 33.035.726.9 65.768.659.7
Qwen2.5-VL3B 88.691.784.0 79.183.571.2 34.637.927.8 67.471.061.0
DeepSeek-VL2-Tiny3B 83.586.777.9 69.774.160.0 24.629.219.3 59.363.352.4
RL-based MLLMs
Perception-R12B 89.191.484.5 79.583.672.4 35.038.528.8 67.971.261.9
Vision-R17B 89.692.984.9 80.084.772.6 33.636.828.6 67.771.562.0
VLM-R13B 90.792.885.9 81.684.773.5 35.637.927.7 69.371.862.4
Artemis3B 91.393.487.0 83.686.476.5 40.142.833.4 71.774.265.6
Method Size RefCOCO+
val@50testA@50testB@50 val@75testA@75testB@75 val@95testA@95testB@95 valAvgtestAAvgtestBAvg
Expert Models
MDETR- 81.185.572.9 --- --- ---
OFA- 81.387.174.2 --- --- ---
General MLLMs
LLaVA-v1.57B 42.449.736.4 9.812.46.4 0.50.50.2 17.620.814.3
LLaVA-OV7B 65.879.057.2 23.628.815.3 0.60.60.4 30.036.124.3
Qwen2-VL2B 77.182.570.1 68.773.860.0 29.432.323.0 58.462.951.0
Qwen2.5-VL3B 81.987.374.7 73.279.363.9 32.335.825.4 62.567.554.7
DeepSeek-VL2-Tiny3B 73.381.363.5 61.970.249.4 22.127.316.1 52.459.643.0
RL-based MLLMs
Perception-R12B 81.786.874.3 73.679.364.2 32.636.926.7 62.667.755.1
Vision-R17B 83.089.075.3 74.781.764.1 31.535.225.6 63.168.655.0
VLM-R13B 84.289.376.6 76.181.265.7 33.436.425.9 64.669.056.1
Artemis3B 85.389.977.8 78.382.968.7 38.341.730.0 67.371.558.7
Method Size RefCOCOg
val@50testA@50testB@50 val@75testA@75testB@75 val@95testA@95testB@95 valAvgtestAAvgtestBAvg
Expert Models
MDETR - 83.383.3 -- -- --
OFA - 82.282.3 -- -- --
General MLLMs
LLaVA-v1.5 7B 43.245.1 8.59.3 0.30.3 17.318.2
LLaVA-OV 7B 70.870.8 23.323.6 0.60.7 31.631.7
Qwen2-VL 2B 83.383.1 72.773.0 28.927.9 61.661.3
Qwen2.5-VL 3B 85.185.7 74.475.8 32.133.1 63.964.9
DeepSeek-VL2-Tiny 3B 75.779.2 60.463.1 19.121.0 38.854.4
RL-based MLLMs
Perception-R1 2B 85.785.4 75.776.0 32.133.1 64.564.8
Vision-R1 7B 86.486.9 76.477.8 32.433.1 65.165.9
VLM-R1 3B 86.086.7 75.176.8 32.732.9 64.665.5
Artemis 3B 87.387.3 77.779.4 36.337.9 67.168.2

Object Detection (COCO2017 Val)

Method Size Epoch mAP AP50 AP75 AR100
Expert Models
YOLOv3 - 273 27.9 49.2 28.3 -
Faster-RCNN - 12 35.6 55.7 37.9 -
General MLLMs
Qwen2.5-VL† 3B 1 15.4 22.5 15.9 29.8
Griffon 13B 1 24.8 40.6 25.1 -
RL-based MLLMs
VLM-R1† 3B 1 21.6 35.6 21.7 33.2
Vision-R1 7B 1 26.6 40.0 27.8 -
Perception-R1 3B 1 31.9 46.7 33.4 41.2
Artemis 3B 1 31.0 48.0 31.9 46.6

Key Findings

  • Superior Visual Grounding: Artemis consistently outperforms SOTA methods across RefCOCO/+/g splits, especially at high IoU thresholds, producing highly precise bounding boxes.
  • Strong Object Detection: On COCO2017 val, Artemis achieves mAP 31.0, AP50 48.0, and AR100 46.6, representing the only unified RL-based 3B-scale model to surpass mAP 30 while demonstrating complete scene understanding.
  • Enhanced In-domain Perception: Structured visual reasoning significantly boosts MLLM perception abilities, with Artemis outperforming other RL-based models in in-domain tasks, highlighting its effectiveness for accurate object localization and recognition.

Zero-shot Out-of-domain Visual Perception on Natural Scenes

In zero-shot settings, we evaluate Artemis on out-of-domain visual counting using the Pixmo dataset and on reasoning grounding using the LISA grounding dataset. Models marked with denote results from our own inference. Models marked with follow a detect-then-count paradigm, deriving counts from predicted boxes.

Method Size Pixmo_val Pixmo_test LISA_test
General MLLMs
LLaVA-v1.5 7B 33.3 31.0 -
LLaVA-OV 7B 55.8 53.7 -
Qwen2-VL 2B 60.2 50.5 -
Qwen2.5-VL 3B 58.0 57.8 67.4
RL-based MLLMs
VisionReasoner 7B 70.1 69.5 -
Perception-R1 2B 78.1 75.6 -
UniVG-R1 7B - - 59.7
No-Thinking-RL 2B - - 61.8
VLM-R1 3B - - 63.1
Artemis 3B 81.4 76.92 78.3

Key Findings

  • Human-like Zero-shot Counting: On the Pixmo-Count dataset, Artemis internally enumerates instances during the <think> phase and directly outputs numeric counts, achieving strong zero-shot performance without any counting-specific training and outperforming detect-then-count baselines.
  • Enhanced Zero-shot Visual Perception: On the LISA test set, Artemis reaches 78.3 accuracy, substantially exceeding prior methods, showing that structured visual reasoning improves reasoning-dependent perception in natural scenes and reduces linguistic hallucinations.

Zero-shot Out-of-domain Visual Perception on Diagram Understanding

We evaluate Artemis on the MATHGLANCE benchmark covering plane geometry, solid geometry, and graphs. The evaluation reports zero-shot accuracy for overall questions in each domain. Models marked with denote results from our own inference.

Model Size Avg. Plane Geo. Solid Geo. Graphs
General MLLMs
G-LLaVA 7B 30.3 25.6 32.3 33.9
DeepSeek-VL2-Tiny 3B 32.6 29.5 39.0 29.4
Qwen2.5-VL† 3B 33.1 31.0 37.1 31.2
LLaVA-v1.5 7B 33.3 29.2 31.6 39.0
RL-based MLLMs
VLM-R1† 3B 34.4 26.9 43.6 32.6
No-Thinking-RL† 2B 45.3 33.2 56.4 46.3
Perception-R1† 2B 45.3 29.7 59.5 46.8
Artemis 3B 49.3 39.2 56.4 52.3

Key Findings

  • Superior Zero-shot Math Perception: On the MATHGLANCE benchmark, Artemis achieves the best overall average score (49.3), substantially outperforming general MLLMs such as Qwen2.5-VL and LLaVA-v1.5.
  • Outperforming RL-based Models: Artemis surpasses recent R1-based models including Perception-R1 and No-Thinking-RL, demonstrating stronger visual reasoning across plane geometry, solid geometry, and graph-based problems.
  • Cross-domain Generalization: Structured visual reasoning learned from natural images transfers effectively to math-related visual scenes, highlighting Artemis’ robust perceptual capabilities and generalization across diverse visual tasks.

Zero-shot Comprehensive Visual Perception

We evaluate Artemis on multiple mainstream multimodal benchmarks, including MMBench, MMVet, MMStar, ScienceQA, SeedBench, MME, AI2D, OCRBench, POPE, and BLINK. The evaluation reports zero-shot performance across all benchmarks. Models marked with denote results from our own inference.

Model Size MMBench Avg. MMVet Avg. MMStar Avg. ScienceQA Avg. SeedBench Avg. MME Sum AI2D Avg. OCRBench Avg. POPE Avg. BLINK Avg.
General MLLMs
LLaVA-v1.5 7B 62.8 32.8 32.6 65.4 60.1 1338.3 51.9 - - -
Qwen2-VL 2B 71.9 45.6 46.3 74.0 72.7 1471.1 71.6 - - -
DeepSeek-VL2-Tiny 3B 74.6 52.5 - - - 1905.5 - 805 - -
Qwen2.5-VL† 3B 79.1 60.0 53.8 79.3 74.0 2200.0 78.3 826 85.9 48.8
RL-based MLLMs
VLM-R1† 3B 70.7 58.8 53.1 69.4 68.8 2156.2 73.3 774 79.3 46.9
Perception-R1 2B 71.8 48.9 45.7 73.4 73.0 1903.9 71.8 - - -
Artemis 3B 79.3 61.4 55.9 79.6 74.3 2229.7 78.2 828 88.6 48.5

Key Findings

  • Enhanced General Visual Comprehension: Across a wide range of multimodal benchmarks, Artemis demonstrates consistently improved zero-shot performance, showing strengthened perception and uniform alignment across diverse visual tasks without any task-specific tuning.

Visual Grounding Visualization

Comparison of visual grounding results among Artemis, skip-reasoning Perception-R1, and linguistic reasoning VLM-R1. Skip-reasoning often leads to inaccurate predictions, while linguistic reasoning suffers from the difficulty of supervising perception-oriented reasoning within the linguistic space, frequently causing mismatches between intermediate reasoning and the final answer. In contrast, Artemis produces precise and high-quality bounding boxes, delivering coherent and accurate grounding by explicitly linking structured visual reasoning with the predicted outputs.

Reconstruction Comparison

Visual Counting Visualization

Comparison of zero-shot visual counting results among Perception-R1 (trained on Pixmo), Qwen2.5-VL baseline, and Artemis. Both Perception-R1 and Qwen2.5-VL rely on detect-then-count post-processing, while Artemis internally enumerates queried objects during the reasoning process and directly outputs numeric counts. By leveraging structured visual reasoning, Artemis acquires perceptual skills that transfer seamlessly to counting tasks, resulting in behavior remarkably similar to human intuition.

Reconstruction Comparison

Mathematical Perception Visualization

Comparison of mathematical perception between Artemis and Qwen2.5-VL on MATHGLANCE. Artemis correctly identifies shapes and relational configurations in counting and relation tasks, while Qwen2.5-VL makes mistakes, such as selecting an invalid quadrilateral or misperceiving relative positions. This demonstrates that perception capabilities learned from natural images transfer effectively to visually distinct domains, enabling accurate perceptual reasoning in mathematical scenes.

Reconstruction Comparison

Object Detection & Reasoning Grounding Visualization

Visualization of Artemis on detection and reasoning grounding tasks shows enhanced scene-level perceptual capabilities. In detection cases, Artemis accurately perceives all ground-truth objects, while in LISA grounding, it identifies relevant visual evidence, such as a person and a ladder, and uses it to infer answers. These results demonstrate that structured visual reasoning training strengthens object-centric perception and improves reasoning-based grounding across diverse tasks.

Reconstruction Comparison
Reconstruction Comparison

Training Data Examples

We supervise structured visual reasoning using a dedicated post-training dataset. To support this objective within our unified Artemis framework, we construct the Artemis-RFT dataset from MS-COCO, yielding roughly 77k post-training instances.

A data example of Artemis-RFT is shown below.

Reconstruction Comparison

BibTeX

@misc{tang2026artemis,
  title={Artemis: Structured Visual Reasoning for Perception Policy Learning},
  author={Tang, Wei and Sun, Yanpeng and Zhang, Shan and Li, Xiaofan and Koniusz, Piotr and Li, Wei and Zhao, Na, and Li, Zechao},
  booktitle={arXiv:2512.01988},
  year={2026}
}