Maximum Likelihood Reinforcement Learning

1CMU 2Tsinghua University 3Zhejiang University 4UC Berkeley 5Impossible, Inc. *Equal Contribution

TL;DR  A framework to optimize maximum likelihood with reinforcement learning.

MaxRL Teaser
MaxRL Results on Qwen3-4B

Large-scale results on Qwen3-4B. MaxRL Pareto-dominates GRPO across all benchmarks, achieving similar or better Pass@1 while simultaneously improving Pass@K. This translates to 2.3×–19.2× gains at test-time scaling efficiency.

We introduce Maximum Likelihood Reinforcement Learning (MaxRL), enabling reinforcement learning to do maximum likelihood optimization.

Why Maximum Likelihood?

Maximum likelihood emerges as a principled objective in supervised learning, as it has reliably translated increases in model capacity, data, and compute into consistent performance improvements. In contrast, many modern learning problems such as code generation, mathematical reasoning, and multi-step planning involve non-differentiable generation but admit an implicit binary notion of correctness. For each input $x$, the model induces a probability of success $p_\theta(x) = p_\theta(y^* | x)$ over the correct answer $y^*$, defining an implicit likelihood over correct outcomes.

In these settings, reinforcement learning is typically applied. RL was conceived to handle problems where the likelihood cannot be optimized directly due to non-differentiable intermediate sampling. However, the two approaches optimize fundamentally different objectives:

Reinforcement Learning

$\nabla_\theta J_{\mathrm{RL}} = \mathbb{E}_x\left[\nabla_\theta p_\theta(x)\right]$

Maximizes expected correctness (pass@1)

Maximum Likelihood

$\nabla_\theta J_{\mathrm{ML}} = \mathbb{E}_x\left[\nabla_\theta \log p_\theta(x)\right]$

Reweights by inverse success probability

Taking the gradient of the log introduces a $1/p_\theta(x)$ factor, which places greater emphasis on hard, low-success inputs. This leads to very different optimization dynamics: maximum likelihood pushes learning signal toward difficult problems where the model struggles.

Nevertheless, maximum likelihood is statistically challenging to estimate when $p_\theta(x)$ is small. Can we approximate maximum likelihood in a way that scales with compute?

MaxRL: Approximating Maximum Likelihood with More Compute

We show that the challenge of estimating maximum likelihood admits a principled resolution that scales with compute. The key insight comes from a Maclaurin expansion of the log-likelihood.

Maclaurin Expansion of Maximum Likelihood

The log-likelihood admits a Maclaurin expansion in terms of failure events:

$J_{\mathrm{ML}}(x) = \log p = -\sum_{k=1}^{\infty}\frac{(1-p)^k}{k} = -\sum_{k=1}^{\infty}\frac{\mathrm{fail@}k(x)}{k}$

Differentiating yields the population-level gradient identity:

$\nabla_\theta J_{\mathrm{ML}}(x) = \sum_{k=1}^{\infty}\frac{1}{k}\,\nabla_\theta \mathrm{pass@}k(x)$

From this Maclaurin expansion, we see that maximum likelihood optimizes an infinite harmonic mixture of pass@k gradients. Higher-order terms encode learning signal from increasingly rare success patterns—critical when the pass rate $p$ is small.

In contrast, standard RL optimizes only $\nabla_\theta \mathrm{pass@}1(x)$—the first-order term of this expansion:

Reinforcement learning is a first-order approximation of maximum likelihood.

MaxRL Objective Function

Optimizing the full infinite mixture is infeasible. We define the truncated maximum likelihood objective at level $T$:

$\nabla_\theta J_{\mathrm{MaxRL}}^{(T)}(x) = \sum_{k=1}^{T}\frac{1}{k}\,\nabla_\theta \mathrm{pass@}k(x)$

This defines a compute-indexed family of objectives:

  • $T = 1$ recovers standard reinforcement learning (pass@1)
  • $T \to \infty$ recovers exact maximum likelihood
  • $T = N$ (number of rollouts) is what we use in practice

We will show that $T$ corresponds directly to the number of rollouts $N$, enabling us to trade sampling compute for higher-fidelity approximations to the maximum likelihood objective.

Practical Gradient Estimator of MaxRL

A natural approach to estimate the truncated objective is to approximate each pass@k term separately. We take an alternate approach that leads to a simpler estimator and a new viewpoint.

Theorem 1: Conditional Form of the ML Gradient

The gradient of the maximum likelihood objective admits the following conditional expectation representation:

$\nabla_\theta J_{\mathrm{ML}}(x) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(z \mid x) \;\middle|\; \text{success}\right]$

Interpretation: The ML gradient equals the average score function over successful trajectories only.

This theorem suggests a simple estimator: sample $N$ trajectories from the policy, then average the score functions only over successful ones. Given $N$ rollouts with $K$ successes, we define:

$\widehat{g}_N(x) = \begin{cases} \displaystyle\frac{1}{K}\sum_{i=1}^N r_i S_i, & K \ge 1 \\[0.6em] 0, & K = 0 \end{cases}$

where $r_i \in \{0,1\}$ is the binary reward and $S_i = \nabla_\theta \log \pi_\theta(z_i \mid x)$ is the score function.

Theorem 2: Estimator–Objective Equivalence

The estimator $\widehat{g}_N(x)$ is an unbiased estimator for the MaxRL gradient of order $T = N$:

$\mathbb{E}\left[\widehat{g}_N(x)\right] = \nabla_\theta J_{\mathrm{MaxRL}}^{(N)}(x)$

Implication: Using $N$ rollouts automatically targets the $T=N$ truncated ML objective—no explicit pass@k estimation needed.

The difference between REINFORCE and MaxRL is remarkably simple at the estimator level:

REINFORCE

$\displaystyle\frac{1}{N}\sum_{i=1}^N r_i S_i$

Unbiased for: $\nabla_\theta\,\mathrm{pass@}1$

Normalize by total samples $N$

MaxRL (Ours)

$\displaystyle\frac{1}{K}\sum_{i=1}^N r_i S_i$

Unbiased for: $\sum_{k=1}^{N}\frac{1}{k}\nabla_\theta \mathrm{pass@}k$

Normalize by successful samples $K$

Increasing $N$ in REINFORCE reduces variance of a fixed pass@1 objective. In MaxRL, increasing $N$ improves the objective itself, approaching maximum likelihood.

In addition, we can reduce variance by using a zero-mean control variate, the unconditional average score $V_N = \frac{1}{N}\sum_{i=1}^N S_i$, which satisfies $\mathbb{E}[V_N]=0$. Subtracting $V_N$ preserves unbiasedness while reducing variance. The on-policy implementation differs from standard REINFORCE by a single-line modification to the advantage calculation, where the advantage is normalized by the per-task mean reward $\hat{r}$, rather than left unnormalized (RLOO) or normalized by standard deviation (GRPO):

A Unifying Weight Function View

All methods—RL, GRPO, MaxRL, and ML—admit population-level gradients of the form $\nabla_\theta J = \mathbb{E}_{x}\left[w(p_\theta(x)) \nabla_\theta p_\theta(x)\right]$, where $w(p)$ determines how learning signal is allocated across inputs of varying difficulty. The key distinction among objectives is how strongly they emphasize hard, low-pass-rate inputs. As $T$ increases, MaxRL uniquely approaches maximum likelihood weighting in the low pass rate regime.

GRPO's $1/\sqrt{p(1-p)}$ weighting provides moderate upweighting of hard inputs, which contributes to its superior performance compared with vanilla REINFORCE. However, GRPO assigns increased weight to very easy inputs ($p \to 1$), unlike likelihood-based objectives. In contrast, MaxRL's normalization by $K$ enables principled approximation to maximum likelihood as the number of rollouts increases.

Experiments

We first show that MaxRL closely approximates exact maximum likelihood where it is computable on a toy image classification task, and then demonstrate consistent improvements across maze navigation, GSM8K math reasoning, and finally on large-scale Qwen3 training and challenging math reasoning problems.

📊 Takeaway 1: MaxRL approaches exact maximum likelihood given infinite compute.

ImageNet: Comparison with Exact Likelihood

We first validate MaxRL where exact maximum likelihood (cross-entropy) is computable. Image classification provides a clean testbed: reward is 1 if predicted class matches ground truth, 0 otherwise. When direct maximum-likelihood optimization is available, MaxRL converges to it as sampling compute increases.

ImageNet results

ImageNet training dynamics. With sufficient rollouts, MaxRL closely matches cross-entropy training, while REINFORCE fails to make progress from low initial pass rates.

🚀 Takeaway 2: MaxRL scales better with additional compute in the infinite data regime.

Maze Navigation: Infinite Data Regime

We study training with continually fresh data using procedurally generated mazes. Each training input is newly generated, and the model never encounters the same maze twice. In a data-rich training regime, MaxRL scales more favorably with additional compute compared to existing methods.

Maze visualization

Example maze: successful navigation (left) vs. failure case (right).

Maze scaling results

Scaling behavior with increasing rollouts per prompt. MaxRL consistently outperforms GRPO, which outperforms RLOO.

🛡️ Takeaway 3: MaxRL is more resistant to overfitting.

GSM8K: Data-Scarce Regime

In the data-scarce regime, models train for many epochs over a fixed dataset. This exposes differences in how objectives allocate learning signal under repeated training. MaxRL can sustain improvement over a large number of epochs, demonstrating less pass@k degradation (overfitting) and converging to a higher average performance.

GSM8K training dynamics

Training dynamics on GSM8K. MaxRL shows slower initial gains but sustained improvement, with substantially less pass@k degradation.

🧠 Takeaway 4: MaxRL's benefits transfer to larger scale mathematical reasoning.

Large-Scale LLM Training

We train Qwen3-1.7B-Base and Qwen3-4B-Base models on POLARIS-53K (~50K math reasoning prompts), and evaluate on AIME 2025, BeyondAIME, MATH-500, and Minerva. On larger scale mathematical reasoning, MaxRL Pareto-dominates GRPO and shows little to no diversity degradation with respect to the base model.

Large scale LLM results

Evaluation on math benchmarks. MaxRL consistently Pareto dominates GRPO: similar or better pass@1 and improved pass@k. Improved coverage means achieving the same pass@k requires 2.3× – 19.2× fewer samples than GRPO.

⚡ Takeaway 5: MaxRL shows characteristically different optimization dynamics.

Gradient Norm & Training Dynamics Analysis

Besides performance metrics, MaxRL exhibits different optimization dynamics. Most notably, it produces stronger gradients on harder prompts, and also leads to a larger fraction of prompts with at least one correct rollout during training.

Gradient norm analysis

Gradient norm analysis. MaxRL generates larger gradient norms over prompts with close to 0 pass rates, concentrating learning signal on harder problems.

Training dynamics

Training dynamics comparison. MaxRL consistently produces at least one correct rollout for more prompts, demonstrating effectiveness at extracting more learning signal.

Conclusion

We introduce Maximum Likelihood Reinforcement Learning (MaxRL), a framework that bridges the gap between reinforcement learning and maximum likelihood for correctness-based tasks. Through a Maclaurin expansion of the log-likelihood, we show that standard RL optimizes only the first-order term (pass@1), while maximum likelihood corresponds to an infinite harmonic mixture of pass@k objectives.

MaxRL provides a practical middle ground: by truncating the expansion at level $T = N$ (the number of rollouts), we obtain a compute-indexed family of objectives that progressively approaches maximum likelihood as compute increases. The resulting gradient estimator is remarkably simple—normalizing by the number of successes rather than total samples—yet yields consistent improvements across diverse domains.

Empirically, MaxRL demonstrates superior performance in image classification, maze navigation, math reasoning, and large-scale LLM training. Compared with GRPO, MaxRL achieves higher pass@k while maintaining similar or better pass@1, requiring 2.3x-19.2x fewer samples at deployment time to achieve the same pass@k coverage.

BibTeX

@article{tajwar2025maxrl,
  title={Maximum Likelihood Reinforcement Learning},
  author={Tajwar, Fahim and Zeng, Guanning and Zhou, Yueer and Song, Yuda and 
          Arora, Daman and Jiang, Yiding and Schneider, Jeff and 
          Salakhutdinov, Ruslan and Feng, Haiwen and Zanette, Andrea},
  journal={arXiv preprint},
  year={2025}
}