Maximum Likelihood Reinforcement Learning

1CMU 2Tsinghua University 3Zhejiang University 4UC Berkeley 5Impossible, Inc. *Equal Contribution

TL;DR  A framework to maximize the likelihood using reinforcement learning.

MaxRL Teaser
MaxRL Results on Qwen3-4B

Results on Qwen3-4B. MaxRL Pareto-dominates GRPO across all benchmarks, achieving similar or better Pass@1 while significantly improving Pass@K. This translates to 7.9×–19.2× gains at test-time scaling efficiency.

Why Maximum Likelihood?

Maximum likelihood is a principled objective in machine learning, reliably translating increases in model capacity, data, and compute into performance gains. Modern tasks like code generation and mathematical reasoning differ: they involve non-differentiable generation but have a binary notion of correctness. For each input $x$, the model induces a success probability $p_\theta(x) = p_\theta(y^* | x)$ over the correct answer $y^*$—an implicit likelihood over correct outcomes.

In these settings, reinforcement learning is typically applied. However, the two approaches optimize fundamentally different objectives:

Reinforcement Learning

$J_{\mathrm{RL}} = \mathbb{E}_x\left[p_\theta(x)\right]$

Maximum Likelihood

$J_{\mathrm{ML}} = \mathbb{E}_x\left[\log p_\theta(x)\right]$

MaxRL: Maximum Likelihood via Reinforcement Learning

MaxRL is a framework that turns more compute into increasingly better approximations of the maximum likelihood objective in sampling-based tasks.

Maclaurin Expansion of Maximum Likelihood

The maximum likelihood objective admits a Maclaurin expansion in terms of failure events:

$J_{\mathrm{ML}}(x) = \log p = -\sum_{k=1}^{\infty}\frac{(1-p)^k}{k} = -\sum_{k=1}^{\infty}\frac{\mathrm{fail@}k(x)}{k}$

where $\mathrm{fail@}k(x)=1-\mathrm{pass@}k(x)$ denotes the probability that all $k$ i.i.d. samples from the model fail. Differentiating yields the population-level gradient identity:

$\nabla_\theta J_{\mathrm{ML}}(x) = \sum_{k=1}^{\infty}\frac{1}{k}\,\nabla_\theta \mathrm{pass@}k(x)$

Maximum likelihood optimizes an infinite weighted sum of pass@k gradients. Higher-order terms represent rare successes, critical when $p$ is small. In contrast, standard RL optimizes only the first-order term of this expansion.

Reinforcement Learning

$$\begin{aligned} J_{\mathrm{RL}}(x) &= p_\theta(x) \\ \nabla_\theta J_{\mathrm{RL}}(x) &= \nabla_\theta \mathrm{pass@}1(x) \end{aligned}$$

Optimizes pass@1

Maximum Likelihood

$$\begin{aligned} J_{\mathrm{ML}}(x) &= \log p_\theta(x) \\ \nabla_\theta J_{\mathrm{ML}}(x) &= \sum_{k=1}^{\infty}\frac{1}{k}\,\nabla_\theta \mathrm{pass@}k(x) \end{aligned}$$

Optimizes harmonic mixture of pass@k

Reinforcement learning optimizes a first-order approximation of the maximum likelihood objective.

MaxRL Objective Function

Optimizing the full infinite mixture is infeasible. We define the truncated maximum likelihood objective at level $T$:

$J_{\mathrm{MaxRL}}^{(T)}(x) = -\sum_{k=1}^{T}\frac{(1-p)^k}{k}$

Differentiating yields the truncated population gradient:

$\nabla_\theta J_{\mathrm{MaxRL}}^{(T)}(x) = \sum_{k=1}^{T}\frac{1}{k}\,\nabla_\theta \mathrm{pass@}k(x)$

The objective $J_{\mathrm{MaxRL}}^{(T)}(x)$ defines a compute-indexed family:

  • $T = 1$ recovers standard reinforcement learning (pass@1)
  • $T \to \infty$ recovers exact maximum likelihood

Practical Gradient Estimator of MaxRL

A natural approach is to approximate each pass@k term separately. We take an alternative approach, yielding a simpler estimator and a new viewpoint.

Theorem 1: Conditional Form of the ML Gradient

The gradient of the maximum likelihood objective admits the following conditional expectation representation:

$\nabla_\theta J_{\mathrm{ML}}(x) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(z \mid x) \;\middle|\; \text{success}\right]$

Interpretation: The ML gradient equals the average score function over successful trajectories only.

This theorem suggests a simple estimator: sample $N$ trajectories from the policy, then average the score functions only over successful ones. Given $N$ rollouts with $K$ successes, we define:

$\widehat{g}_N(x) = \begin{cases} \displaystyle\frac{1}{K}\sum_{i=1}^N r_i S_i, & K \ge 1 \\[0.6em] 0, & K = 0 \end{cases}$

where $r_i \in \{0,1\}$ is the binary reward and $S_i = \nabla_\theta \log \pi_\theta(z_i \mid x)$ is the score function.

Theorem 2: Estimator–Objective Equivalence

The estimator $\widehat{g}_N(x)$ is an unbiased estimator for the MaxRL gradient of order $T = N$:

$\mathbb{E}\left[\widehat{g}_N(x)\right] = \nabla_\theta J_{\mathrm{MaxRL}}^{(N)}(x)$

Implication: Using $N$ rollouts automatically targets the $T=N$ truncated ML objective—no explicit pass@k estimation needed.

The difference between REINFORCE and MaxRL is remarkably simple at the estimator level:

REINFORCE

$\displaystyle\frac{1}{N}\sum_{i=1}^N r_i S_i$

Unbiased for: $\nabla_\theta\,\mathrm{pass@}1$

Normalize by total samples $N$

MaxRL (Ours)

$\displaystyle\frac{1}{K}\sum_{i=1}^N r_i S_i$

Unbiased for: $\sum_{k=1}^{N}\frac{1}{k}\nabla_\theta \mathrm{pass@}k$

Normalize by successful samples $K$

Increasing $N$ in REINFORCE only reduces variance of a fixed pass@1 objective. However, in MaxRL, increasing $N$ improves the objective itself, approaching maximum likelihood.

In addition, we can reduce variance by using a zero-mean control variate, the unconditional average score $V_N = \frac{1}{N}\sum_{i=1}^N S_i$, which satisfies $\mathbb{E}[V_N]=0$. Subtracting $V_N$ preserves unbiasedness while reducing variance. The on-policy implementation differs from REINFORCE and GRPO by only a single-line modification to the advantage calculation: the advantage is normalized by the per-task mean reward $\hat{r}$, rather than being left unnormalized (as in REINFORCE) or normalized by standard deviation (as in GRPO):

A Unifying Weight Function View

All of the objectives mentioned above admit population-level gradients of the form:

$\nabla_\theta J = \mathbb{E}_{x}\left[w(p_\theta(x)) \nabla_\theta p_\theta(x)\right]$

where $w(p)$ determines how learning signal is allocated across inputs of varying difficulty. The key distinction among objectives is how strongly they emphasize hard, low-pass-rate inputs. Below, we present a visualization of the weight functions for each objective. From this visualization, we can see that as $T$ increases, MaxRL gradually approaches maximum likelihood weighting.

MaxRL Weight Functions

Notably, GRPO's normalization by standard deviation also provides moderate upweighting of hard inputs. However, unlike likelihood-based objectives, GRPO assigns increased weight to very easy inputs as $p \to 1$.

Experiments

We first show that MaxRL closely approximates exact maximum likelihood where it is computable on a toy image classification task, and then demonstrate consistent improvements across maze navigation, GSM8K math reasoning, and finally on large-scale Qwen3 training and challenging math reasoning problems.

ImageNet: Comparison with Exact Likelihood

We first validate MaxRL in a setting where exact maximum likelihood can be implemented exactly. We consider a standard image classification task, where maximum likelihood corresponds to minimizing cross-entropy. Image classification provides a clean testbed: reward is 1 if predicted class matches ground truth, 0 otherwise. As sampling compute increases, MaxRL converges to exact maximum likelihood training.

ImageNet results

ImageNet training dynamics. With sufficient rollouts, MaxRL closely matches cross-entropy training, while REINFORCE fails to make progress from low initial pass rates.

📊 Takeaway 1: MaxRL approaches exact maximum likelihood given infinite compute.

Maze Navigation: Infinite Data Regime

We study training with continually fresh data using procedurally generated mazes. Each training input is newly generated, and the model never encounters the same maze twice. In a data-rich training regime, MaxRL scales more favorably with additional compute compared to existing methods.

Maze visualization

Example maze: successful navigation (left) vs. failure case (right).

Maze scaling results

Scaling behavior with increasing rollouts per prompt. MaxRL consistently outperforms GRPO and RLOO.

🚀 Takeaway 2: MaxRL scales better with additional compute in the infinite data regime.

GSM8K: Data-Scarce Regime

In the data-scarce regime, models train for multiple epochs over a fixed dataset until Pass@1 no longer climbs. This exposes differences in how objectives allocate learning signal under repeated training. MaxRL can sustain improvement over a large number of epochs, demonstrating less pass@k degradation (overfitting) and converging to a higher average performance.

Pass rate distribution during training

Training dynamics on GSM8K and distribution of prompts across pass rate bins during GSM8K training. MaxRL shows slower initial gains but sustained improvement, with substantially less pass@k degradation. During training, MaxRL also shows a much lower bar at 0% pass rate compared to baselines, indicating that it solves more problems in the training set. Moreover, the distribution across pass rates is more uniform for MaxRL, while GRPO and RLOO cluster around the two extremes (0% and 100% pass rates). This demonstrates that training with MaxRL mitigates sharpening and extracts more learning signal from a fixed dataset.

🛡️ Takeaway 3: MaxRL is more resistant to overfitting.

Large-Scale LLM Training

We train Qwen3-1.7B-Base and Qwen3-4B-Base models on POLARIS-53K (~50K math reasoning prompts), and evaluate on various mathematical reasoning benchmarks. On larger scale mathematical reasoning, MaxRL Pareto-dominates GRPO and shows little to no diversity degradation with respect to the base model.

Large scale LLM results

Evaluation on math benchmarks. MaxRL consistently Pareto dominates GRPO: similar or better pass@1 and improved pass@k. Improved coverage means achieving the same pass@k requires 2.3× – 19.2× fewer samples than GRPO.

🧠 Takeaway 4: MaxRL's benefits transfer to larger scale mathematical reasoning.

Training Dynamics Analysis

Besides performance metrics, MaxRL exhibits different optimization dynamics. Most notably, it produces stronger gradients on harder prompts, and also leads to a larger fraction of prompts with at least one correct rollout during training.

Gradient norm analysis

Gradient norm analysis. MaxRL generates larger gradient norms over prompts with close to 0 pass rates, concentrating learning signal on harder problems.

Training dynamics

Fraction of prompts with at least one correct rollout. MaxRL maintains a higher fraction of solvable prompts throughout training, enabling continued learning even in later epochs.

⚡ Takeaway 5: MaxRL shows characteristically different optimization dynamics.

BibTeX

@misc{tajwar2026maxrl,
  title   = {Maximum Likelihood Reinforcement Learning}, 
  author  = {Fahim Tajwar and Guanning Zeng and Yueer Zhou and Yuda Song
             and Daman Arora and Yiding Jiang and Jeff Schneider and Ruslan Salakhutdinov
             and Haiwen Feng and Andrea Zanette},
  year    = {2026},
  eprint  = {2602.02710},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url     = {https://arxiv.org/abs/2602.02710}, 
}