TL;DR Large Reasoning Models (LRMs) implicitly know the appropriate time to stop thinking—we unlock this ability with SAGE, a self-aware sampling paradigm, and SAGE-RL, which integrates SAGE-discovered efficient reasoning patterns into standard pass@1 inference for +2.1% avg accuracy and 44.1% fewer tokens on six challenging mathematical benchmarks.
SAGE Unleashes Efficient Reasoning Potential. SAGE uncovers the optimal concise reasoning chains hidden in pass@k that are obscured by standard pass@1 sampling. SAGE-RL integrates these efficient patterns into LRMs, achieving higher accuracy with far fewer tokens across six challenging mathematical benchmarks (AIME, MATH-500, OlympiadBench, AMC23, etc.).
Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms.
Motivated by this insight, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential by leveraging the model's self-confidence to discover precise, concise reasoning chains. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) effectively incorporates SAGE-discovered efficient reasoning patterns into standard pass@1 inference. Extensive experiments on six challenging mathematical benchmarks (MATH-500, AIME 2024/2025, AMC23, OlympiadBench, Minerva) show that SAGE-RL markedly enhances both the reasoning accuracy and efficiency of LRMs, achieving an average +2.1% accuracy gain and 44.1% token reduction compared to state-of-the-art baselines.
Modern LRMs rely on long Chain of Thought (CoT) reasoning to solve complex mathematical and logical problems, enabled by Reinforcement Learning from Verifiable Rewards (RLVR) algorithms like GRPO and GSPO that incentivize "thinking longer". While this boosts performance on hard tasks, it introduces severe reasoning redundancy: models often continue generating steps long after deriving the correct answer, wasting computational resources and increasing latency.
To quantify the redundancy of reasoning chains, we define the RFCS metric: the step index of the first correct answer divided by the total number of reasoning steps. For over half of MATH-500 samples across all tested LRMs (DS-1.5B, DeepScaleR, Qwen3-8B), RFCS < 1—meaning the model finds the correct answer early but continues reasoning unnecessarily.

RFCS Statistics on MATH-500. All LRMs show significant ineffective reasoning steps (RFCS(<1) > 80%) and low average RFCS (~0.6), demonstrating widespread overthinking under standard pass@1 sampling.

Redundant Reasoning Example. The model derives the correct answer in 500 tokens but generates an additional 452 redundant tokens of double-checking—this is typical of LRMs under standard sampling.
Key Takeaway : the surprising performance of relatively shorter responses in pass@k reveals the inherent potential of the model for efficient reasoning. The pervasive redundancy of reasoning steps in pass@1 indicates that current sampling paradigms obscure this potential.
Through systematic experiments with token-wise reasoning path exploration, we uncover three critical observations that confirm LRMs have an inherent sense of optimal reasoning termination:
Using a cumulative log-probability score $\Phi$ to track high-confidence reasoning paths, we find that increasing exploration width leads to consistent response length reduction AND accuracy improvement. In contrast, paths selected by next-token probability ($\phi$) suffer rapid accuracy degradation (length collapse).
The eot (end of thinking) token consistently ranks 1st in high-confidence paths (selected by $\Phi$) when it appears, meaning the model is highly confident in stopping reasoning. For next-token selected paths, the eot token's rank increases with exploration width, showing greater uncertainty to stop thinking. Greedy/random sampling misses these short, high-confidence chains.
As exploration width increases, LRMs converge to higher accuracy with shorter responses, with token efficiency (accuracy/length) rising gradually. This convergence is universal across models (DS-1.5B, DeepScaleR) and datasets (MATH-500, AMC23), proving the model's universal inherent efficient reasoning capability.


Key Takeaway : LRMs have an innate ability to identify concise, correct reasoning chains—this capability is simply locked by current sampling paradigms that prioritize next-token probability over cumulative confidence.
Building on our core insight, we design SAGE—a simple, training-free sampling paradigm that unlocks the LRM's implicit efficient reasoning capability by leveraging cumulative self-confidence ($\Phi$) to discover concise, correct reasoning chains. SAGE replaces token-level exploration with step-wise reasoning chain exploration and uses the model's inherent confidence for automatic termination.
SAGE's core is the average cumulative log-probability score that measures the model's confidence in an entire reasoning chain (not just the next token):
$$\Phi(\mathbf{y}_{\le k})=\frac{1}{k} \sum_{i=1}^{k} \log \pi_\theta \bigl( y_i \mid \mathbf{y}_{\lt i}, \mathbf{x} \bigr)$$
Where $y_{\leq k}$ is the reasoning chain up to step $k$, $\phi$ is the next-token log-probability, and $\pi_\theta$ is the model's policy. This score captures the model's overall confidence in a reasoning path—critical for identifying high-quality, concise chains.
Extends candidate sequences by full reasoning steps (not individual tokens) using random sampling, maintaining top-$m$ high-confidence sequences via $\Phi$. This aligns with human reasoning (step-by-step) and avoids token-level noise.
$y_{\leq i}^{(j, k)}=y_{\leq i-1}^{(j)} \oplus r_{i}^{(j, k)}$
Automatically terminates reasoning when a candidate sequence ends with $$—no manual tolerance parameters needed. High-confidence paths lead to confident termination, so the model's own signal dictates stopping.
Add to $O$ if $r_{k}^{(j, k)}$ ends with the eot token
SAGE outperforms degraded SAGE (greedy step sampling) across all step budgets, with two key scaling properties:

SAGE Inference Scaling on MATH-500 & AMC23. SAGE consistently outperforms Degrade-SAGE across step budgets, with larger gains on harder datasets (AMC23) and stronger models (DeepScaleR).
SAGE unlocks efficient reasoning during inference, but to incorporate these patterns into the model's core policy for standard pass@1 inference, we introduce SAGE-RL—a minimal modification to RLVR (GRPO/GSPO) that uses SAGE as mixed sampling in the rollout phase.
For each query, RLVR typically samples $G$ responses via random sampling. SAGE-RL modifies this to:
This is a one-line modification to existing RLVR implementations—no changes to the reward function, optimization objective, or training pipeline.
SAGE-RL retains the original GRPO/GSPO objectives but splits the rollout set into SAGE and random samples. For SAGE-GRPO, the objective is:
$$\mathcal{J}_{SAGE-GRPO }(\theta)=\mathbb{E}\left[\frac{1}{G}\left(\underbrace{\sum_{i=1}^{r} \frac{1}{|y_{i}|} \sum_{t=1}^{|y_{i}|} min \left(w_{i,t}\hat{A}_{i,t}, clip(w_{i,t}) \hat{A}_{i,t}\right)}_{SAGE(m,r)}+\underbrace{\sum_{i=r+1}^{G} \frac{1}{|y_{i}|} \sum_{t=1}^{|y_{i}|} min \left(w_{i,t}\hat{A}_{i,t}, clip(w_{i,t}) \hat{A}_{i,t}\right)}_{Random Sampling }\right)\right]$$
Where $w_{i,t}$ is the token-level importance ratio, $\hat{A}_{i,t}$ is the advantage, and $G=8$, $r=2$ (our default) balances efficiency and exploration.
SAGE-RL exhibits distinct training behavior compared to vanilla RLVR:


SAGE-RL Performance Scaling with Problem Difficulty. SAGE-RL achieves larger accuracy gains on harder benchmarks (AIME 2025, OlympiadBench) compared to easier ones (MATH-500), demonstrating its effectiveness on challenging reasoning tasks.
We tested the running latency of SAGE and the approximate running latency of SAGE-RL tuned models under different exploration widths.

Once the exploration width exceeds 2, the growth rate of inference time accelerates further. Therefore, we primarily set exploration width $m$ = 2, which represents the transition point between the slow-growth and fast-growth regions, to achieve a balanced trade-off between efficiency and performance.
Through experiments, we investigated the impact of different hyperparameter combinations on SAGE-RL performance and obtained the following key findings:


When the number of rollouts per group $r$ increases from 1 to 2, the model performance improvement is limited with minimal impact on policy updates, as rollouts with similar reasoning trajectories cannot provide additional valid information.
When the exploration width $m$ increases from 1 to 2, the model achieves significant improvements in both performance and efficiency. A limited $m$ causes SAGE-RL's optimization behavior to approach standard GRPO, confirming the critical role of exploration width in activating the model's efficient reasoning capability.
Among different hyperparameter combinations, SAGE (2,2)-GRPO demonstrates the optimal overall performance, representing a favorable balance between performance and exploration efficiency.
In this work, we make a surprising discovery: Large Reasoning Models (LRMs) implicitly know when to stop thinking. This capability is obscured by current sampling paradigms that prioritize next-token probability over cumulative confidence, leading to redundant reasoning and inefficient computation.
To unlock this potential, we introduce SAGE, a self-aware sampling paradigm that leverages cumulative self-confidence to discover concise, correct reasoning chains. We further integrate SAGE into reinforcement learning via SAGE-RL, a minimal modification to RLVR that incorporates efficient reasoning patterns into the model's core policy.
Key Contributions:
SAGE and SAGE-RL provide a new path toward efficient reasoning in LRMs, demonstrating that models can achieve both higher accuracy and lower computational cost by simply learning to trust their own confident reasoning chains.
@article{huang2026does,
title={Does Your Reasoning Model Implicitly Know When to Stop Thinking?},
author={Huang, Zixuan and Xia, Xin and Ren, Yuxi and Zheng, Jianbin and Wang, Xuanda and Zhang, Zhixia and Xie, Hongyan and Liang, Songshi and Chen, Zehao and Xiao, Xuefeng and others},
journal={arXiv preprint arXiv:2602.08354},
year={2026}
}