Perceive the world in its nature and explore the unknown
Perception, reasoning, and interaction: The world reveals itself not through pixels, but through the symbols and structures underneath. Vision speaks its own language, directing where to look and what to understand. To live in the world is to engage with the knowledge it offers—absorbing it and transforming experience into capability, not just caching isolated data.
A framework that enables structured visual reasoning in spatial and object-centric space, improving visual perception tasks through reinforcement learning with verifiable reasoning chains.
Current MLLMs perform poorly on basic diagram perceptual tasks, relying on textual shortcuts rather than visual understanding (math blind). Representing diagrams as graphs of primitives is crucial; our results show that strong low-level perception drives faithful high-level mathematical reasoning.
A self-supervised symbolic auto-encoder that encodes diagrams into structured primitives and their interrelationships, achieving 98.2% MSE reduction in geometric diagram reconstruction, improving by +13% on the diagram perception benchmark, and by +3% on MathVerse and GeoQA reasoning benchmarks.
An agentic learning framework that enables progressive improvement through multimodal semantic memory, integrating visual and logical memory to refine both perception and reasoning for lifelong and cross-domain agentic learning.
Pre-trained Artemis models for structured visual reasoning and perception policy learning across various visual tasks.
The model trained with GEOMETRIC for enhanced geometric diagram understanding and mathematical reasoning.
A benchmark that isolates diagram perception from reasoning in MLLMs, featuring 1.2K diagrams and 1.6K curated questions across four tasks: shape classification, counting, relationship identification, and grounding.
A structure-aware geometric diagram-description dataset encoding shapes, attributes, and interrelationships as graphs with fine-grained spatial annotations for model training.
Official implementation of the Artemis framework for structured visual reasoning and perception policy learning with reinforcement learning.
Official implementation of the ViLoMem framework, featuring multimodal semantic memory architecture and agentic learning algorithms.