Jackpot: A Technique to Reduce LLM Reinforcement Learning Costs by 80% [Paper]

Jackpot: 3 Key Insights for Training Big Models with Small Models

LLM Reinforcement Learning rollout costs account for 80% of the total
Jackpot maintains training stability even with small model rollouts
Achieved performance equivalent to on-policy RL on Qwen3-8B

The Rollout Cost Problem and OBRS

In LLM reinforcement learning, rollout generation accounts for 80% of the total cost^{[Jackpot Paper]}. Using smaller models for rollouts reduces costs, but the distribution difference between the two models (actor-policy mismatch) destabilizes training.

Jackpot solves this with OBRS (Optimal Budgeted Rejection Sampling)^{[Jackpot Paper]}. It selects only the tokens generated by the small model that are close to the large model’s distribution for training. Instead of perfect distribution matching, it finds the optimal strategy within the acceptance budget.

Qwen3-8B Experimental Results

Using Qwen3-1.7B to generate rollouts and training Qwen3-8B resulted in GSM8K 93.57% and MATH-500 82.65%^{[Jackpot Paper]}. This is equivalent to or higher than the on-policy baseline (93.29%, 79.50%).

The existing TIS only achieved 76.45% on MATH-500 and showed instability in the later stages. Jackpot maintained stable learning up to 300 steps.

How it Works

Tokens are filtered with an acceptance probability of a(x) = min(1, p_target / (lambda * p_inf)). Top-k approximation reduces computation, and it operates on existing trajectories, resulting in low additional overhead^{[PPO Paper]}.

Frequently Asked Questions (FAQ)

Q: When is Jackpot useful?

A: It is effective when you want to reduce rollout costs in LLM reinforcement learning. It is advantageous in environments where the training target is large and a smaller model can be used for rollouts. The greater the difference in model size, the greater the stability benefit compared to existing methods.

Q: Why is actor-policy mismatch a problem?

A: If the distributions of the rollout model and the training model are different, the likelihood ratio spikes sharply for rare tokens. This can destabilize the gradient and cause training to diverge. The KL divergence is an order of magnitude larger than in asynchronous training.

Q: How is it different from existing importance sampling?

A: TIS clips the likelihood ratio to reduce variance but does not correct the distribution itself. OBRS selectively accepts or rejects samples to bring the rollout distribution itself closer to the target. This difference manifests as a gap in training stability.

If you found this helpful, please subscribe to AI Digester.

References

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch RL – arXiv (2026-02-05)
Qwen3 Model Series – GitHub (2026)
Proximal Policy Optimization Algorithms – arXiv (2017-07-20)