
Figure 3: Overview of the training method. The ❌ and ✅ symbols indicate incorrect and correct responses, respectively.
We use reinforcement learning to post-train models for both accuracy and token efficiency. For each prompt, multiple solutions are sampled and rewarded based on correctness and response length. The shortest correct answers receive the highest rewards, followed by longer correct ones, while incorrect responses receive the lowest rewards. The models are then updated using policy gradients.

Figure 4: Reward function used for training. Incorrect responses receive a reward of 0, while correct responses are assigned higher rewards if they are shorter compared to other correct responses. To compute the self-normalized length penalty of a response \( y \) with respect to a prompt \( x \), the mean \( \texttt{MEAN}(x) \) and variance \( \texttt{VAR}(x) \) of correct response lengths are calculated for each prompt \( x \). Here \( \mathbb I_{correct}(x,y) \) indicates that the response is correct.