CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

Shuhang Chen1, Yunqiu Xu1, Junjie Xie1, Aojun Lu4, Tao Feng3, Zeying Huang2, Ning Zhang2, Yi Sun2, Yi Yang1, Hangjie Yuan1*
1Zhejiang University 2Intelligent Learning 3Tsinghua University 4Sichuan University

Overview of the proposed visual mathematical reasoning framework CogFlow. Inspired by the canonical three-stage human reasoning flow, CogFlow adopts a hierarchical pipeline that integrates Synergistic Visual Rewards (SynVRs) for enhanced perception, a Knowledge Internalization Reward (IntlzR) to bridge perception and reasoning, and Visual-Gated Policy Optimization (VGPO) with Inference Reward (InfR) to anchor multi-step reasoning in perceptual accuracy.

-->

Abstract

Despite recent advances, multimodal large language models continue to struggle with visual mathematical problem solving. Some recent works recognize that visual perception is a bottleneck in visual mathematical reasoning, but their solutions are limited to improving the extraction and interpretation of visual inputs. Notably, they all ignore the key issue of whether the extracted visual cues are faithfully integrated and properly utilized in subsequent reasoning. Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception ⇒ internalization ⇒ reasoning. In line with this hierarchical flow, we holistically enhance all its stages. We devise synergistic visual rewards to boost perception capabilities in parametric and semantic spaces, jointly improving visual information extraction from symbols and diagrams. To guarantee faithful integration of extracted visual cues into subsequent reasoning, we introduce a visual-anchored reward model in the internalization stage, bridging perception and reasoning. Moreover, we design a visual-gated policy optimization algorithm to further enforce the reasoning is grounded with the visual knowledge, preventing models seeking shortcuts that appear coherent but are visually ungrounded reasoning chains. Moreover, we contribute a new dataset MathCog for model training, which contains samples with over 120K high-quality perception-reasoning aligned annotations. Comprehensive experiments and analysis on three commonly used visual mathematical reasoning benchmarks validate the superiority of the proposed CogFlow.

CogFlow Framework

Experiments

Main results

Accuracy (%) and FlowVerse-style CoT-E (%) results on FlowVerse.
Model All Text Centric Text Limited Text Plus Vision Dense Vision Centric Vision Primary
CoT-EAcc CoT-EAcc CoT-EAcc CoT-EAcc CoT-EAcc CoT-EAcc CoT-EAcc
Claude-3.5-Sonnet 55.545.1 60.852.6 58.750.3 64.058.3 45.025.4 56.548.0 48.145.2
GPT-4o 56.949.7 61.056.8 58.754.4 62.258.2 45.230.0 58.652.6 54.151.0
GPT-4V 64.258.7 69.157.1 65.055.0 72.061.4 48.130.3 61.846.3 42.036.7
MathFlowGPT-4V 64.259.5 69.558.2 67.257.4 71.164.1 52.747.5 62.157.1 60.457.0
Gemini-2.5-pro 64.556.2 68.361.9 66.160.8 68.964.1 52.137.1 65.757.9 57.054.6
GPT-5 68.259.3 74.368.1 73.566.7 77.069.2 53.844.7 67.161.7 60.357.5
InfiMM-Math-7B 37.829.5 43.838.1 40.636.7 46.140.1 28.815.4 39.630.3 26.123.2
InternVL2.5-8B 46.340.1 49.241.3 40.538.4 49.642.7 38.420.2 41.035.9 35.833.9
Math-LLaVA-13B 39.330.8 45.139.3 44.437.4 -- 36.218.6 41.735.9 37.034.2
MultiMath-7B 45.235.3 50.644.8 49.942.9 -- 41.722.1 47.240.4 39.738.8
SVE-Math-Qwen2.5-7B 47.938.7 53.147.3 53.445.8 -- 44.228.6 48.944.2 45.842.0
VLM-R1-7B 50.741.2 59.054.2 57.949.8 65.558.9 36.224.5 46.137.8 30.626.1
CogFlow-7B 66.056.2 67.958.6 67.358.3 68.160.9 57.842.7 68.261.1 66.763.5
Accuracy (%) and MathVerse-style CoT-E (%) results on testmini set of MathVerse.
Model All Text Dominant Text Lite Text Only Vision Intensive Vision Dominant Vision Only
CoT-EAcc CoT-EAcc CoT-EAcc CoT-EAcc CoT-EAcc CoT-EAcc CoT-EAcc
Qwen-VL-Plus 21.311.8 26.015.7 21.211.1 25.214.5 18.59.0 19.113.0 21.810.0
Gemini-Pro 35.323.5 39.826.3 34.723.5 44.527.3 32.023.0 36.822.3 33.322.2
Qwen-VL-Max 37.225.3 42.830.7 37.726.1 47.928.9 33.624.1 35.924.1 35.921.4
GPT-4V 54.439.4 63.154.7 56.641.4 60.348.7 51.434.9 50.834.4 50.331.6
MathFlowGPT-4V 56.743.8 65.251.1 58.946.4 62.148.5 53.740.3 52.137.4 52.539.0
SPHINX-MoE-56B 25.815.6 33.322.2 21.916.4 40.718.3 21.114.8 19.612.6 18.39.1
InternLM-XC2-7B 25.916.5 36.922.3 28.317.0 42.516.5 20.115.7 24.416.4 19.811.0
Math-LLaVA-13B -20.1 -22.8 -21.8 -- -21.1 -19.2 -15.4
MultiMath-7B -26.9 -34.8 -30.8 -- -28.1 -25.9 -15.0
SVE-Math-Qwen2.5-7B -31.4 -37.6 -36.8 -- -34.9 -31.5 -16.0
DVLR-14B 48.1- 54.3- 49.0- -- 46.3- 47.2- 43.8-
SophiaVL-R1-7B 48.8- 45.4- 43.9- -- 45.1- 58.5- 51.3-
CogFlow-7B 53.939.5 60.741.9 51.237.0 52.340.1 55.042.4 58.744.8 44.226.3
Accuracy (%) results on MathVista. CogFlow demonstrates consistent superiority.
Model All FQA GPS MWP TQA VQA
GPT-4V 49.943.150.557.565.238.0
Claude-3.5-Sonnet 67.7-----
Doubao-pro-1.5 79.5 77.7 88.9 86.0 82.3 62.0
G-LLaVA-7B 25.119.148.73.625.028.7
VCAR-7B 33.730.934.638.737.328.5
SPHINX-Plus-56B 36.754.616.423.141.843.0
SVE-Math-7B 37.431.953.929.041.430.8
MultiMath-7B 50.040.166.861.850.033.0
SophiaVL-R1-7B 71.3---73.4-
ThinkLite-VL-7B 71.6-----
VL-Rethinker-7B 73.7-----
CogFlow-7B 76.8 70.4 93.1 73.7 86.9 59.3
CogFlow shows competitive accuracy (%) results on more visual math benchmarks.
Models WeMath LogicVista DynaMath
Claude-3.7-Sonnet49.358.239.7
GLM-4.5V68.862.453.9
Doubao-1.5-Pro65.764.244.9
GPT-571.170.060.9
Gemini-2.5-Pro78.073.856.3
Ovis-8B27.239.420.4
Qwen2.5-VL-8B35.244.121.0
InternVL3-8B37.144.125.5
Keye-VL-8B60.754.837.3
InternVL3.5-8B57.057.337.7
GLM-4.1V-9B63.860.442.5
CogFlow-7B 64.1 58.1 46.2

More analysis

Ablation of three proposed components (i.e., SynVRs, IntlzR and VGPO). The visual gate is always enabled during inference.
SynVRs IntlzR VGPO FlowVerse MathVerse
CoT-EAcc CoT-EAcc
57.448.7 48.235.6
63.254.7 50.536.9
62.753.5 49.936.2
63.454.8 50.837.3
64.455.1 52.138.0
66.056.2 53.939.5
Ablation analysis of SynVRs
Ablation analysis of SynVRs. Variants exhibit consistent improvements.
Impact of different error types
(a) Impact of different error types, where All indicates all types are used.
Impact of DPO variants
(b) Impact of DPO variants.
Ablation analysis of the proposed Knowledge Internalization Reward.
Distribution of visual reward values among different post-training methods
The distribution of visual reward values among different post-training methods. A higher concentration of values indicates stronger perceptual grounding achieved by the corresponding training strategy.
Ablation analysis of visual gate
Ablation analysis of visual gate. Training and Inference indicates visual gate is only used in training and inference phases respectively.
Error-type analysis
Error-type analysis. We analyze error-type distributions for CogFlow variants alongside specialized visual–math models, the GRPO-style model, and the decoupled method. The baseline is denoted by the SFT+GRPO setting.

Case study

Case study
Case study. We provide a case study of the proposed CogFlow framework.

BibTeX

@article{chen2026cogflow,
  title   = {CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving},
  author  = {Chen, Shuhang and Xu, Yunqiu and Xie, Junjie and Lu, Aojun and Feng, Tao and Huang, Zeying and Zhang, Ning and Sun, Yi and Yang, Yi and Yuan, Hangjie},
  journal = {arXiv preprint arXiv:2601.01874},
  year    = {2026}
}