Does Your Reasoning Model Implicitly Know When to Stop Thinking?

SAGE Unleashes Efficient Reasoning Potential. SAGE uncovers the optimal concise reasoning chains hidden in pass@k that are obscured by standard pass@1 sampling. SAGE-RL integrates these efficient patterns into LRMs, achieving higher accuracy with far fewer tokens across six challenging mathematical benchmarks (AIME, MATH-500, OlympiadBench, AMC23, etc.).

Abstract

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms.

Motivated by this insight, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential by leveraging the model's self-confidence to discover precise, concise reasoning chains. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) effectively incorporates SAGE-discovered efficient reasoning patterns into standard pass@1 inference. Extensive experiments on six challenging mathematical benchmarks (MATH-500, AIME 2024/2025, AMC23, OlympiadBench, Minerva) show that SAGE-RL markedly enhances both the reasoning accuracy and efficiency of LRMs, achieving an average +2.1% accuracy gain and 44.1% token reduction compared to state-of-the-art baselines.

Dilemmas of Reasoning Models under Current Sampling Paradigms

Modern LRMs rely on long Chain of Thought (CoT) reasoning to solve complex mathematical and logical problems, enabled by Reinforcement Learning from Verifiable Rewards (RLVR) algorithms like GRPO and GSPO that incentivize "thinking longer". While this boosts performance on hard tasks, it introduces severe reasoning redundancy: models often continue generating steps long after deriving the correct answer, wasting computational resources and increasing latency.

Pass@k: Scaling CoT length does not lead to correct answers.

Longer CoTs do not correlate with correctness: On AIME 2025, DeepSeek-R1 produces responses 5× longer than Claude 3.7 Sonnet with comparable accuracy.
Shorter CoTs often outperform longer ones: QwQ-32B's shortest responses on AIME/HMMT achieve +2% accuracy with 31% fewer tokens than random samples.
72% of AIME 2025 problems with both correct/incorrect answers have longer responses as incorrect (Shrivastava et al., 2025).

Pass@1: Existing sampling strategies fail to enable timely termination of thinking.

To quantify the redundancy of reasoning chains, we define the RFCS metric: the step index of the first correct answer divided by the total number of reasoning steps. For over half of MATH-500 samples across all tested LRMs (DS-1.5B, DeepScaleR, Qwen3-8B), RFCS < 1—meaning the model finds the correct answer early but continues reasoning unnecessarily.

RFCS Statistics on MATH-500. All LRMs show significant ineffective reasoning steps (RFCS(<1) > 80%) and low average RFCS (~0.6), demonstrating widespread overthinking under standard pass@1 sampling.

Redundant Reasoning Example. The model derives the correct answer in 500 tokens but generates an additional 452 redundant tokens of double-checking—this is typical of LRMs under standard sampling.

Key Takeaway : the surprising performance of relatively shorter responses in pass@k reveals the inherent potential of the model for efficient reasoning. The pervasive redundancy of reasoning steps in pass@1 indicates that current sampling paradigms obscure this potential.

Core Insight: LRMs Implicitly Know When to Stop Thinking

Through systematic experiments with token-wise reasoning path exploration, we uncover three critical observations that confirm LRMs have an inherent sense of optimal reasoning termination:

Observation 1: High-Confidence Paths Are Short & Effective

Using a cumulative log-probability score $\Phi$ to track high-confidence reasoning paths, we find that increasing exploration width leads to consistent response length reduction AND accuracy improvement. In contrast, paths selected by next-token probability ($\phi$) suffer rapid accuracy degradation (length collapse).

Observation 2: High-Confidence Paths Have Confident Ends

The eot (end of thinking) token consistently ranks 1st in high-confidence paths (selected by $\Phi$) when it appears, meaning the model is highly confident in stopping reasoning. For next-token selected paths, the eot token's rank increases with exploration width, showing greater uncertainty to stop thinking. Greedy/random sampling misses these short, high-confidence chains.

Observation 3: Scaling Exploration Drives Capability Convergence

As exploration width increases, LRMs converge to higher accuracy with shorter responses, with token efficiency (accuracy/length) rising gradually. This convergence is universal across models (DS-1.5B, DeepScaleR) and datasets (MATH-500, AMC23), proving the model's universal inherent efficient reasoning capability.

Key Takeaway : LRMs have an innate ability to identify concise, correct reasoning chains—this capability is simply locked by current sampling paradigms that prioritize next-token probability over cumulative confidence.

SAGE: Self-Aware Guided Efficient Reasoning

Building on our core insight, we design SAGE—a simple, training-free sampling paradigm that unlocks the LRM's implicit efficient reasoning capability by leveraging cumulative self-confidence ($\Phi$) to discover concise, correct reasoning chains. SAGE replaces token-level exploration with step-wise reasoning chain exploration and uses the model's inherent confidence for automatic termination.

Cumulative Confidence Score $\Phi$

SAGE's core is the average cumulative log-probability score that measures the model's confidence in an entire reasoning chain (not just the next token):

$$\Phi(\mathbf{y}_{\le k})=\frac{1}{k} \sum_{i=1}^{k} \log \pi_\theta \bigl( y_i \mid \mathbf{y}_{\lt i}, \mathbf{x} \bigr)$$

Where $y_{\leq k}$ is the reasoning chain up to step $k$, $\phi$ is the next-token log-probability, and $\pi_\theta$ is the model's policy. This score captures the model's overall confidence in a reasoning path—critical for identifying high-quality, concise chains.

SAGE Key Components

Step-Wise Exploration

Extends candidate sequences by full reasoning steps (not individual tokens) using random sampling, maintaining top-$m$ high-confidence sequences via $\Phi$. This aligns with human reasoning (step-by-step) and avoids token-level noise.

$y_{\leq i}^{(j, k)}=y_{\leq i-1}^{(j)} \oplus r_{i}^{(j, k)}$

Confidence-Based Termination

Automatically terminates reasoning when a candidate sequence ends with $$—no manual tolerance parameters needed. High-confidence paths lead to confident termination, so the model's own signal dictates stopping.

Add to $O$ if $r_{k}^{(j, k)}$ ends with the eot token

SAGE Inference Scaling

SAGE outperforms degraded SAGE (greedy step sampling) across all step budgets, with two key scaling properties:

Constrained budgets: SAGE stops earlier with more complete CoTs, outperforming degraded SAGE in accuracy with similar length.
Ample budgets: SAGE uncovers superior reasoning chains (shorter, more accurate) with a stable performance gap.
Model/dataset adaptation: Boosts accuracy for strong models/hard datasets (DeepScaleR, AMC23) and efficiency for weaker models/simple datasets (DS-1.5B, MATH-500).

SAGE Inference Scaling on MATH-500 & AMC23. SAGE consistently outperforms Degrade-SAGE across step budgets, with larger gains on harder datasets (AMC23) and stronger models (DeepScaleR).

SAGE-RL: Integrating Efficient Reasoning into Standard Inference

SAGE unlocks efficient reasoning during inference, but to incorporate these patterns into the model's core policy for standard pass@1 inference, we introduce SAGE-RL—a minimal modification to RLVR (GRPO/GSPO) that uses SAGE as mixed sampling in the rollout phase.

SAGE-RL Core Idea

For each query, RLVR typically samples $G$ responses via random sampling. SAGE-RL modifies this to:

Generate $r$ responses with SAGE (m,r) (high-quality, efficient reasoning chains).
Generate the remaining $G-r$ responses with standard random sampling.
Use the combined set for RLVR advantage estimation and policy update.

This is a one-line modification to existing RLVR implementations—no changes to the reward function, optimization objective, or training pipeline.

SAGE-RL Optimization Objectives

SAGE-RL retains the original GRPO/GSPO objectives but splits the rollout set into SAGE and random samples. For SAGE-GRPO, the objective is:

$$\mathcal{J}_{SAGE-GRPO }(\theta)=\mathbb{E}\left[\frac{1}{G}\left(\underbrace{\sum_{i=1}^{r} \frac{1}{|y_{i}|} \sum_{t=1}^{|y_{i}|} min \left(w_{i,t}\hat{A}_{i,t}, clip(w_{i,t}) \hat{A}_{i,t}\right)}_{SAGE(m,r)}+\underbrace{\sum_{i=r+1}^{G} \frac{1}{|y_{i}|} \sum_{t=1}^{|y_{i}|} min \left(w_{i,t}\hat{A}_{i,t}, clip(w_{i,t}) \hat{A}_{i,t}\right)}_{Random Sampling }\right)\right]$$

Where $w_{i,t}$ is the token-level importance ratio, $\hat{A}_{i,t}$ is the advantage, and $G=8$, $r=2$ (our default) balances efficiency and exploration.

SAGE-RL Training Dynamics

SAGE-RL exhibits distinct training behavior compared to vanilla RLVR:

Faster accuracy gain: Pass@1 rises more rapidly and converges to a higher value.
Continuous length reduction: Response length decreases throughout training (no plateau).
Greater entropy reduction: Policy becomes more confident in efficient reasoning chains.
Higher KL divergence: Policy shifts from the original distribution to adopt SAGE's efficient patterns.

SAGE-RL Performance Scaling with Problem Difficulty. SAGE-RL achieves larger accuracy gains on harder benchmarks (AIME 2025, OlympiadBench) compared to easier ones (MATH-500), demonstrating its effectiveness on challenging reasoning tasks.

Hyperparameters Sensitivity Analysis

Impact on Computational Cost

We tested the running latency of SAGE and the approximate running latency of SAGE-RL tuned models under different exploration widths.

Once the exploration width exceeds 2, the growth rate of inference time accelerates further. Therefore, we primarily set exploration width $m$ = 2, which represents the transition point between the slow-growth and fast-growth regions, to achieve a balanced trade-off between efficiency and performance.

Impact on SAGE-RL Performance

Through experiments, we investigated the impact of different hyperparameter combinations on SAGE-RL performance and obtained the following key findings:

When the number of rollouts per group $r$ increases from 1 to 2, the model performance improvement is limited with minimal impact on policy updates, as rollouts with similar reasoning trajectories cannot provide additional valid information.

When the exploration width $m$ increases from 1 to 2, the model achieves significant improvements in both performance and efficiency. A limited $m$ causes SAGE-RL's optimization behavior to approach standard GRPO, confirming the critical role of exploration width in activating the model's efficient reasoning capability.

Among different hyperparameter combinations, SAGE (2,2)-GRPO demonstrates the optimal overall performance, representing a favorable balance between performance and exploration efficiency.

Conclusion

In this work, we make a surprising discovery: Large Reasoning Models (LRMs) implicitly know when to stop thinking. This capability is obscured by current sampling paradigms that prioritize next-token probability over cumulative confidence, leading to redundant reasoning and inefficient computation.

To unlock this potential, we introduce SAGE, a self-aware sampling paradigm that leverages cumulative self-confidence to discover concise, correct reasoning chains. We further integrate SAGE into reinforcement learning via SAGE-RL, a minimal modification to RLVR that incorporates efficient reasoning patterns into the model's core policy.

Key Contributions:

Novel Insight: LRMs have an innate ability to identify optimal stopping points in reasoning chains, which is locked by current sampling methods.
SAGE: A training-free sampling paradigm that unlocks efficient reasoning by leveraging cumulative self-confidence for automatic termination.
SAGE-RL: A minimal RLVR modification that integrates efficient reasoning patterns into standard pass@1 inference with a one-line change.
State-of-the-Art Results: +2.1% average accuracy gain and 44.1% token reduction across six challenging mathematical benchmarks, outperforming all baselines.

SAGE and SAGE-RL provide a new path toward efficient reasoning in LRMs, demonstrating that models can achieve both higher accuracy and lower computational cost by simply learning to trust their own confident reasoning chains.

BibTeX


          @article{huang2026does,
            title={Does Your Reasoning Model Implicitly Know When to Stop Thinking?},
            author={Huang, Zixuan and Xia, Xin and Ren, Yuxi and Zheng, Jianbin and Wang, Xuanda and Zhang, Zhixia and Xie, Hongyan and Liang, Songshi and Chen, Zehao and Xiao, Xuefeng and others},
            journal={arXiv preprint arXiv:2602.08354},
            year={2026}
          }