Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space.
In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning.
Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.
Artemis is a unified framework for RL-based perception-policy learning. Rollouts generated by MLLM are encouraged to perceive structured visual evidence before decision-making, guided by the structured visual reasoning reward, while the outcome rewards supervise the format and answer generation. GRPO is employed to optimize the unified perception-policy learning framework.
Artemis explicitly generates structured visual evidence during the <think> phase.
By tracking intermediate states as labeled bounding boxes, the model learns to locate key and contextual objects before producing final answers.
This approach strengthens object-centric perception, reduces ambiguity from language-based reasoning, and enables robust generalization across multiple visual domains.
We evaluate our unified visual perception learning framework, Artemis, on two in-domain tasks: visual grounding and object detection. Models marked with † denote results from our own inference.
| Method | Size | RefCOCO | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| val@50 | testA@50 | testB@50 | val@75 | testA@75 | testB@75 | val@95 | testA@95 | testB@95 | valAvg | testAAvg | testBAvg | ||
| Expert Models | |||||||||||||
| MDETR | - | 87.5 | 90.4 | 82.6 | - | - | - | - | - | - | - | - | - |
| OFA | - | 88.4 | 90.6 | 83.3 | - | - | - | - | - | - | - | - | - |
| General MLLMs | |||||||||||||
| LLaVA-v1.5 | 7B | 49.1 | 54.9 | 43.3 | 10.7 | 13.6 | 6.9 | 0.4 | 0.3 | 0.3 | 20.1 | 22.9 | 16.8 |
| LLaVA-OV | 7B | 73.0 | 82.3 | 63.5 | 24.2 | 29.6 | 15.9 | 0.5 | 0.5 | 0.5 | 32.6 | 37.5 | 26.6 |
| Qwen2-VL | 2B | 86.8 | 89.6 | 82.0 | 77.2 | 80.6 | 70.1 | 33.0 | 35.7 | 26.9 | 65.7 | 68.6 | 59.7 |
| Qwen2.5-VL† | 3B | 88.6 | 91.7 | 84.0 | 79.1 | 83.5 | 71.2 | 34.6 | 37.9 | 27.8 | 67.4 | 71.0 | 61.0 |
| DeepSeek-VL2-Tiny† | 3B | 83.5 | 86.7 | 77.9 | 69.7 | 74.1 | 60.0 | 24.6 | 29.2 | 19.3 | 59.3 | 63.3 | 52.4 |
| RL-based MLLMs | |||||||||||||
| Perception-R1 | 2B | 89.1 | 91.4 | 84.5 | 79.5 | 83.6 | 72.4 | 35.0 | 38.5 | 28.8 | 67.9 | 71.2 | 61.9 |
| Vision-R1† | 7B | 89.6 | 92.9 | 84.9 | 80.0 | 84.7 | 72.6 | 33.6 | 36.8 | 28.6 | 67.7 | 71.5 | 62.0 |
| VLM-R1† | 3B | 90.7 | 92.8 | 85.9 | 81.6 | 84.7 | 73.5 | 35.6 | 37.9 | 27.7 | 69.3 | 71.8 | 62.4 |
| Artemis | 3B | 91.3 | 93.4 | 87.0 | 83.6 | 86.4 | 76.5 | 40.1 | 42.8 | 33.4 | 71.7 | 74.2 | 65.6 |
| Method | Size | RefCOCO+ | |||||||||||
| val@50 | testA@50 | testB@50 | val@75 | testA@75 | testB@75 | val@95 | testA@95 | testB@95 | valAvg | testAAvg | testBAvg | ||
| Expert Models | |||||||||||||
| MDETR | - | 81.1 | 85.5 | 72.9 | - | - | - | - | - | - | - | - | - |
| OFA | - | 81.3 | 87.1 | 74.2 | - | - | - | - | - | - | - | - | - |
| General MLLMs | |||||||||||||
| LLaVA-v1.5 | 7B | 42.4 | 49.7 | 36.4 | 9.8 | 12.4 | 6.4 | 0.5 | 0.5 | 0.2 | 17.6 | 20.8 | 14.3 |
| LLaVA-OV | 7B | 65.8 | 79.0 | 57.2 | 23.6 | 28.8 | 15.3 | 0.6 | 0.6 | 0.4 | 30.0 | 36.1 | 24.3 |
| Qwen2-VL | 2B | 77.1 | 82.5 | 70.1 | 68.7 | 73.8 | 60.0 | 29.4 | 32.3 | 23.0 | 58.4 | 62.9 | 51.0 |
| Qwen2.5-VL† | 3B | 81.9 | 87.3 | 74.7 | 73.2 | 79.3 | 63.9 | 32.3 | 35.8 | 25.4 | 62.5 | 67.5 | 54.7 |
| DeepSeek-VL2-Tiny† | 3B | 73.3 | 81.3 | 63.5 | 61.9 | 70.2 | 49.4 | 22.1 | 27.3 | 16.1 | 52.4 | 59.6 | 43.0 |
| RL-based MLLMs | |||||||||||||
| Perception-R1 | 2B | 81.7 | 86.8 | 74.3 | 73.6 | 79.3 | 64.2 | 32.6 | 36.9 | 26.7 | 62.6 | 67.7 | 55.1 |
| Vision-R1† | 7B | 83.0 | 89.0 | 75.3 | 74.7 | 81.7 | 64.1 | 31.5 | 35.2 | 25.6 | 63.1 | 68.6 | 55.0 |
| VLM-R1† | 3B | 84.2 | 89.3 | 76.6 | 76.1 | 81.2 | 65.7 | 33.4 | 36.4 | 25.9 | 64.6 | 69.0 | 56.1 |
| Artemis | 3B | 85.3 | 89.9 | 77.8 | 78.3 | 82.9 | 68.7 | 38.3 | 41.7 | 30.0 | 67.3 | 71.5 | 58.7 |
| Method | Size | RefCOCOg | |||||||||||
| val@50 | testA@50 | testB@50 | val@75 | testA@75 | testB@75 | val@95 | testA@95 | testB@95 | valAvg | testAAvg | testBAvg | ||
| Expert Models | |||||||||||||
| MDETR | - | 83.3 | 83.3 | - | - | - | - | - | - | ||||
| OFA | - | 82.2 | 82.3 | - | - | - | - | - | - | ||||
| General MLLMs | |||||||||||||
| LLaVA-v1.5 | 7B | 43.2 | 45.1 | 8.5 | 9.3 | 0.3 | 0.3 | 17.3 | 18.2 | ||||
| LLaVA-OV | 7B | 70.8 | 70.8 | 23.3 | 23.6 | 0.6 | 0.7 | 31.6 | 31.7 | ||||
| Qwen2-VL | 2B | 83.3 | 83.1 | 72.7 | 73.0 | 28.9 | 27.9 | 61.6 | 61.3 | ||||
| Qwen2.5-VL† | 3B | 85.1 | 85.7 | 74.4 | 75.8 | 32.1 | 33.1 | 63.9 | 64.9 | ||||
| DeepSeek-VL2-Tiny† | 3B | 75.7 | 79.2 | 60.4 | 63.1 | 19.1 | 21.0 | 38.8 | 54.4 | ||||
| RL-based MLLMs | |||||||||||||
| Perception-R1 | 2B | 85.7 | 85.4 | 75.7 | 76.0 | 32.1 | 33.1 | 64.5 | 64.8 | ||||
| Vision-R1† | 7B | 86.4 | 86.9 | 76.4 | 77.8 | 32.4 | 33.1 | 65.1 | 65.9 | ||||
| VLM-R1† | 3B | 86.0 | 86.7 | 75.1 | 76.8 | 32.7 | 32.9 | 64.6 | 65.5 | ||||
| Artemis | 3B | 87.3 | 87.3 | 77.7 | 79.4 | 36.3 | 37.9 | 67.1 | 68.2 | ||||
| Method | Size | Epoch | mAP | AP50 | AP75 | AR100 |
|---|---|---|---|---|---|---|
| Expert Models | ||||||
| YOLOv3 | - | 273 | 27.9 | 49.2 | 28.3 | - |
| Faster-RCNN | - | 12 | 35.6 | 55.7 | 37.9 | - |
| General MLLMs | ||||||
| Qwen2.5-VL† | 3B | 1 | 15.4 | 22.5 | 15.9 | 29.8 |
| Griffon | 13B | 1 | 24.8 | 40.6 | 25.1 | - |
| RL-based MLLMs | ||||||
| VLM-R1† | 3B | 1 | 21.6 | 35.6 | 21.7 | 33.2 |
| Vision-R1 | 7B | 1 | 26.6 | 40.0 | 27.8 | - |
| Perception-R1 | 3B | 1 | 31.9 | 46.7 | 33.4 | 41.2 |
| Artemis | 3B | 1 | 31.0 | 48.0 | 31.9 | 46.6 |
In zero-shot settings, we evaluate Artemis on out-of-domain visual counting using the Pixmo dataset and on reasoning grounding using the LISA grounding dataset. Models marked with † denote results from our own inference. Models marked with ‡ follow a detect-then-count paradigm, deriving counts from predicted boxes.
| Method | Size | Pixmo_val | Pixmo_test | LISA_test |
|---|---|---|---|---|
| General MLLMs | ||||
| LLaVA-v1.5† | 7B | 33.3 | 31.0 | - |
| LLaVA-OV | 7B | 55.8 | 53.7 | - |
| Qwen2-VL | 2B | 60.2 | 50.5 | - |
| Qwen2.5-VL† | 3B | 58.0 | 57.8 | 67.4 |
| RL-based MLLMs | ||||
| VisionReasoner‡ | 7B | 70.1 | 69.5 | - |
| Perception-R1‡ | 2B | 78.1 | 75.6 | - |
| UniVG-R1 | 7B | - | - | 59.7 |
| No-Thinking-RL | 2B | - | - | 61.8 |
| VLM-R1 | 3B | - | - | 63.1 |
| Artemis | 3B | 81.4 | 76.92 | 78.3 |
<think> phase and directly outputs numeric counts, achieving strong zero-shot performance without any counting-specific training and outperforming detect-then-count baselines.We evaluate Artemis on the MATHGLANCE benchmark covering plane geometry, solid geometry, and graphs. The evaluation reports zero-shot accuracy for overall questions in each domain. Models marked with † denote results from our own inference.
| Model | Size | Avg. | Plane Geo. | Solid Geo. | Graphs |
|---|---|---|---|---|---|
| General MLLMs | |||||
| G-LLaVA | 7B | 30.3 | 25.6 | 32.3 | 33.9 |
| DeepSeek-VL2-Tiny | 3B | 32.6 | 29.5 | 39.0 | 29.4 |
| Qwen2.5-VL† | 3B | 33.1 | 31.0 | 37.1 | 31.2 |
| LLaVA-v1.5 | 7B | 33.3 | 29.2 | 31.6 | 39.0 |
| RL-based MLLMs | |||||
| VLM-R1† | 3B | 34.4 | 26.9 | 43.6 | 32.6 |
| No-Thinking-RL† | 2B | 45.3 | 33.2 | 56.4 | 46.3 |
| Perception-R1† | 2B | 45.3 | 29.7 | 59.5 | 46.8 |
| Artemis | 3B | 49.3 | 39.2 | 56.4 | 52.3 |
We evaluate Artemis on multiple mainstream multimodal benchmarks, including MMBench, MMVet, MMStar, ScienceQA, SeedBench, MME, AI2D, OCRBench, POPE, and BLINK. The evaluation reports zero-shot performance across all benchmarks. Models marked with † denote results from our own inference.
| Model | Size | MMBench Avg. | MMVet Avg. | MMStar Avg. | ScienceQA Avg. | SeedBench Avg. | MME Sum | AI2D Avg. | OCRBench Avg. | POPE Avg. | BLINK Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| General MLLMs | |||||||||||
| LLaVA-v1.5 | 7B | 62.8 | 32.8 | 32.6 | 65.4 | 60.1 | 1338.3 | 51.9 | - | - | - |
| Qwen2-VL | 2B | 71.9 | 45.6 | 46.3 | 74.0 | 72.7 | 1471.1 | 71.6 | - | - | - |
| DeepSeek-VL2-Tiny | 3B | 74.6 | 52.5 | - | - | - | 1905.5 | - | 805 | - | - |
| Qwen2.5-VL† | 3B | 79.1 | 60.0 | 53.8 | 79.3 | 74.0 | 2200.0 | 78.3 | 826 | 85.9 | 48.8 |
| RL-based MLLMs | |||||||||||
| VLM-R1† | 3B | 70.7 | 58.8 | 53.1 | 69.4 | 68.8 | 2156.2 | 73.3 | 774 | 79.3 | 46.9 |
| Perception-R1 | 2B | 71.8 | 48.9 | 45.7 | 73.4 | 73.0 | 1903.9 | 71.8 | - | - | - |
| Artemis | 3B | 79.3 | 61.4 | 55.9 | 79.6 | 74.3 | 2229.7 | 78.2 | 828 | 88.6 | 48.5 |
Comparison of visual grounding results among Artemis, skip-reasoning Perception-R1, and linguistic reasoning VLM-R1. Skip-reasoning often leads to inaccurate predictions, while linguistic reasoning suffers from the difficulty of supervising perception-oriented reasoning within the linguistic space, frequently causing mismatches between intermediate reasoning and the final answer. In contrast, Artemis produces precise and high-quality bounding boxes, delivering coherent and accurate grounding by explicitly linking structured visual reasoning with the predicted outputs.
Comparison of zero-shot visual counting results among Perception-R1 (trained on Pixmo), Qwen2.5-VL baseline, and Artemis. Both Perception-R1 and Qwen2.5-VL rely on detect-then-count post-processing, while Artemis internally enumerates queried objects during the reasoning process and directly outputs numeric counts. By leveraging structured visual reasoning, Artemis acquires perceptual skills that transfer seamlessly to counting tasks, resulting in behavior remarkably similar to human intuition.
Comparison of mathematical perception between Artemis and Qwen2.5-VL on MATHGLANCE. Artemis correctly identifies shapes and relational configurations in counting and relation tasks, while Qwen2.5-VL makes mistakes, such as selecting an invalid quadrilateral or misperceiving relative positions. This demonstrates that perception capabilities learned from natural images transfer effectively to visually distinct domains, enabling accurate perceptual reasoning in mathematical scenes.
Visualization of Artemis on detection and reasoning grounding tasks shows enhanced scene-level perceptual capabilities. In detection cases, Artemis accurately perceives all ground-truth objects, while in LISA grounding, it identifies relevant visual evidence, such as a person and a ladder, and uses it to infer answers. These results demonstrate that structured visual reasoning training strengthens object-centric perception and improves reasoning-based grounding across diverse tasks.
We supervise structured visual reasoning using a dedicated post-training dataset. To support this objective within our unified Artemis framework, we construct the Artemis-RFT dataset from MS-COCO, yielding roughly 77k post-training instances.
A data example of Artemis-RFT is shown below.
@misc{tang2026artemis,
title={Artemis: Structured Visual Reasoning for Perception Policy Learning},
author={Tang, Wei and Sun, Yanpeng and Zhang, Shan and Li, Xiaofan and Koniusz, Piotr and Li, Wei and Zhao, Na, and Li, Zechao},
booktitle={arXiv:2512.01988},
year={2026}
}