Training Language Models to Reason Efficiently

Introduction

Training large language models to perform advanced reasoning with reinforcement learning has significantly advanced their problem-solving abilities. However, their reliance on long chain-of-thoughts commands a high inference cost, posing challenges for efficient deployment. To address this, we propose post-training reasoning models with reinforcement learning using an objective function that favors correct responses with concise chain-of-thoughts.

Figure 1: Our models achieve comparable accuracy to the Full Reasoning model (DeepSeek-R1-Distill-Qwen-7B) while significantly reducing token consumption during inference. Results are normalized across the GSM8K, MATH500, and AIME2024 datasets; Instruct refers to the Qwen2.5-Math-7B-Instruct model. Token usage and accuracy are normalized relative to the Full Reasoning model.

Learning When to Stop Thinking

We evaluate our post-training procedure on three common benchmarks, ordered by increasing level of difficulty:

GSM8K, a dataset containing grade-school-level math problems;
MATH500, a standard benchmark with problems harder than those in GSM8K;
AIME2024, a competition-level dataset of challenging mathematical problems.

Figure 2: Detailed comparison of the number of tokens and accuracy achieved by different models to solve problems in the GSM8K, MATH500, and AIME2024 datasets.

Our post-training procedure refines a reasoning model—such as DeepSeek-R1-Distill-Qwen-7B, labeled "Full Reasoning" in the figure—into a more token-efficient version. By selecting a scalar coefficient α at the beginning of post-training, users can balance accuracy and token efficiency. A higher α enhances generation efficiency by shortening chain-of-thoughts while largely maintaining response accuracy.

Models trained in this way know when to stop thinking: they recognize when they have found the correct answer and conclude their reasoning efficiently. For straightforward problems (e.g., those in GSM8K) these models deliver efficient, direct solutions, while for more demanding ones (e.g., those in AIME2024), they invest additional computational effort to perform advanced reasoning; see Figure 2 for additional details.

Post-Training Procedure

Figure 3: Overview of the training method. The ❌ and ✅ symbols indicate incorrect and correct responses, respectively.

We use reinforcement learning to post-train models for both accuracy and token efficiency. For each prompt, multiple solutions are sampled and rewarded based on correctness and response length. The shortest correct answers receive the highest rewards, followed by longer correct ones, while incorrect responses receive the lowest rewards. The models are then updated using policy gradients.

Figure 4: Reward function used for training. Incorrect responses receive a reward of 0, while correct responses are assigned higher rewards if they are shorter compared to other correct responses. To compute the self-normalized length penalty of a response \( y \) with respect to a prompt \( x \), the mean \( \texttt{MEAN}(x) \) and variance \( \texttt{VAR}(x) \) of correct response lengths are calculated for each prompt \( x \). Here \( \mathbb I_{correct}(x,y) \) indicates that the response is correct.

BibTeX

@article{arora2025traininglanguagemodelsreason,
  title={Training Language Models to Reason Efficiently}, 
  author={Daman Arora and Andrea Zanette},
  year={2025},
  eprint={2502.04463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2502.04463},
}