SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning

Ruiqi Zhang, Daman Arora, Song Mei, Andrea Zanette

UC Berkeley, Carnegie Mellon University

We introduce SPEED, an online curriculum learning framework to accelerate reinforcement learning (RL) training of large reasoning models by training on prompts with high learning signal.

  • Traditional RL training often expends most of the compute resources on prompts with low learning signals (pass rates near 0% or 100%), which have low signal-to-noise ratio (SNR), a notion formalized later in the blog post.
  • SPEED efficiently identifies and excludes these low-SNR prompts before creating training batches for the RL trainer.
  • This ensures computational resources (inference and gradient updates) are focused on useful prompts (high-SNR ones).
  • SPEED is compatible with Rule-based RL algorithms like GRPO, REINFORCE, RLOO, and DAPO, achieving average training speedups from 2x to 6x over them.
Teaser image
Figure 1: SPEED expends some compute (left figure, red region) to identify and exclude low-signal (low-SNR) prompts from the training batch, ensuring the majority of compute is effectively utilized on informative prompts. This yields an average 4× speedup across various benchmarks and training configurations (right figure; see paper for details).

Introduction

Training large language models (LLMs) with reinforcement learning (RL) against verifiable rewards significantly enhances their reasoning capabilities. However, such methods remain computationally expensive, primarily due to inefficient uniform sampling of training prompts and the associated heavy inference costs.

  • In typical datasets (e.g., that used in DAPO), many prompts are either trivially easy or excessively difficult relative to the model's current training state.
  • LLMs spend most of the training time on inference, generating all-correct or all-incorrect completions for those prompts that are either trivially easy or excessively difficult.
Pass rate distribution and timing
Figure 2: Pass rate distribution in DAPO-17k evaluated by Qwen2.5-Math-1.5B (left) and Qwen2.5-Math-7B (middle). Most of the prompts have close to zero pass rate, i.e., they are too difficult for the model. Right: Average per-step inference and training times while running RLOO on the Qwen2.5-Math-7B model. More than 2/3 of the training time is spent on inference.

Intuitively, if the model consistently succeeds or fails, it learns little. Formally, prompts with pass rates near 0% or 100% produce zero gradients since the advantage function approaches zero. This can be formally shown as follows:

\[ \nabla_{\theta} J_\text{prompt}(\theta) = \mathbb{E}_{\text{response} \sim \pi(\cdot \mid \text{prompt})} \left[ \underbrace{A(\text{response})}_{=0} \nabla_{\theta} \log \pi(\text{response} \mid \text{prompt}) \right] = 0, \quad\quad \text{if the pass rate is } 0\% \text{ or } 100\%. \]

Consequently, we can exclude them.

SPEED-RL: Algorithm Design

We introduce Selective Prompting with Efficient Estimation of Difficulty (SPEED), which implements a curriculum learning strategy that efficiently screens prompts to identify those with intermediate pass rates—thus maximizing the signal-to-noise ratio (SNR)—without performing full inference on them (for more details on the Signal-to-Noise Ratio, see the end of the blog).

SPEED breaks down inference into two phases:

  • Screening Phase: Generate a small number (e.g., Ninit = 4) of completions for each prompt. Prompts clearly at extremes (0% or 100%) pass rate are immediately excluded.
  • Continuation Phase: Perform extensive inference (e.g., Ncont = 20 additional completions per prompt) only on the intermediate-difficulty prompts identified as informative.

This targeted approach allocates the majority of compute to high-value prompts. To avoid inference overhead from multiple calls, SPEED combines both phases into a single inference batch: simultaneously completing the continuation phase for current prompts and the preliminary screening phase for future prompts within one single call to the inference engine (e.g., vLLM). See the paper for additional technical details. Moreover, SPEED integrates seamlessly with popular rule-based RL algorithms, including RLOO, GRPO, DAPO.

Algorithm schematic
Algorithm schematic: Simplified two-phase SPEED procedure.

Experiments

We demonstrate SPEED's efficacy by comparing the wall-clock time to achieve certain target performance for baseline RL algorithms and for SPEED variants.

  • Baseline: RLOO and DAPO. We compare RLOO vs SPEED-RLOO, and DAPO vs SPEED-DAPO.
  • Training Dataset: DeepScaleR, NuminaMath, and DAPO-17k (1k held-out set excluded)
  • Benchmarks: 1k held-out set from DAPO-17k, MATH500, AMC23, AIME2024, and AIME 2025. See here.

Empirically, SPEED variants substantially accelerate the training compared to the baseline methods by 2x - 6x times. For detailed empirical results, please refer to our full paper.

Main experimental figure
Figure 3: Validation accuracy on various mathematical reasoning benchmarks for SPEED variants of RL algorithms and base RL algorithms. Top: RLOO versus SPEED-RLOO; bottom: DAPO versus SPEED-DAPO. The initial model used is Qwen2.5-Math-7B, trained on the DeepScaleR dataset.

Insights: Why Moderately Difficult Prompts?

We contribute with an information-theoretic justification for training on intermediate-difficulty prompts. Recall that prompts with 0% or 100% pass rates yield zero advantage and hence zero gradient. What happens on prompts with pass rates close to these extremes?

Let \(J(\theta)\) be the objective function (on some prompt \(x\)). We usually compute some stochastic gradient estimator for \(J(\theta)\) to update the model's parameters. We quantify the information value through the Signal-to-Noise Ratio (SNR)---defined as the ratio between the squared norm of the gradient estimator and its variance.

A corollary of the standard analysis of stochastic gradient descent shows that the expected reward improvement can be lower-bounded by a function of the SNR. More specifically, let \(\theta\) be the model parameter and \(\theta^{+}\) be the updated model parameter. Then, given the stochastic gradient estimator is unbiased and \(J(\theta)\) is 1-smooth, the expected reward improvement can be lower-bounded by: \[ \mathbb{E}\left[J(\theta^+)\right] - J(\theta) \geq \frac{1}{2} \left\| \nabla_{\theta} J(\theta) \right\|^2 \left(1 - \frac{1}{\text{SNR}}\right). \] Therefore, if the SNR approaches zero, variance dominates and little improvement is expected from a single step.

In fact, we prove that the pass rate of a prompt is tightly correlated with its SNR. In this paper, we show that the prompts with extreme pass rates (near 0% or 100%) have very low SNR and thus yield little improvement according to the above expression.

Theorem (Fundamental Connection between SNR and Pass Rate, Informal) Fix a prompt. Let \( p \) denote the (expected) pass rate (i.e., the probability that the question is correctly solved) of prompt under the current policy. We generate \( N \) completions for the prompt independently. The SNR of the stochastic gradient estimator (when \( N \geq 3 \) and \( p < 1/4 \) or \( p > 3/4 \)) satisfies the bound: \[ \text{SNR} \leq 4N \cdot p (1 - p) \] Moreover, for fixed \( N \), we have: \[ \lim_{p \to 1} \text{SNR} = \lim_{p \to 0} \text{SNR} = 0 \]

This result is significant: It establises that increasing the number of samples \( N \) increases the SNR at a linear rate, as we'd expect, but in a way proportional to the pass rate \( p \) when this is close to zero (similar considerations apply when \( p \) is close to \( 100\% \)). When the pass rate is close to zero, we have the upper bound \( \text{SNR} < Np \), while the cost of inference is proportional to \( N \). This suggests that the ``utility / compute cost" ratio of a prompt is at most its pass rate \( p \), justifying the idea of saving compute on prompts with \(p \approx 0\% \). Similar considerations apply for prompts with pass rate around \( 100\% \).

Citation

If you find this work useful, please cite it as follows:

              @misc{zhang2025speedrlfastertrainingreasoning,
                title={SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning}, 
                author={Ruiqi Zhang and Daman Arora and Song Mei and Andrea Zanette},
                year={2025},
                eprint={2506.09016},
                archivePrefix={arXiv},
                primaryClass={cs.LG},
                url={https://arxiv.org/abs/2506.09016}
              }