MINGHAO LI, YING ZENG, CONG MA, SIYAO SONG, KAI JIA
September 16, 2025
In the development of large models, Scaling Law has long been the dominant theme: larger parameter counts, richer data, and greater compute typically lead to better performance. But when the scaling path of model parameters and dataset size encounters diminishing returns, research attention shifts to Test-Time Scaling — extracting more capability from a fixed model size during inference by using longer, more structured chains of thought and tool chains.
Reinforcement Learning with Verifiable Rewards (RLVR) is a representative paradigm in this direction. By using verifiable rewards and challenging prompts, RLVR guides models to engage in deeper reasoning during sampling, enabling them to solve more complex problems. This kind of scaling is reflected in the RL process as a natural increase in response length. In this process, there are three key elements: robust rewards that can continuously guide the model’s optimization, hard problems that trigger deep-thinking behavior, and model trajectories sampled that lead to the derivation of the correct answer. However, RLVR faces the following practical issues in implementation: sampled trajectories lack diversity, which reduces the efficiency of finding correct answers:
A more pragmatic engineering problem is that when a batch of tasks yields a very low initial score for the policy model, the system struggles to sample high-quality examples early on; optimization progress becomes slow or even stalls, forcing us to rely on human annotation or supervised fine-tuning (SFT) distilled from stronger models to cold-start the system so as to increase the early hit probability. Such remedies are tightly constrained by the abilities of humans or the strong models used.
Thus, the problem is naturally restated as: How can we more efficiently sample correct solution trajectories?
<aside> 🔖 Shift Test-Time Scaling from「single, independently sampled trajectories」to「experience-driven inter-trajectory evolution」
</aside>
We propose SamplingEvolve, which reconstructs inference from “single-shot independent sampling” into an “experience-driven inter-trajectory evolution system”. Intuitively, the system persists model-generated historical trajectories together with their evaluation feedback, and uses these externalized experiences to drive unlimited rounds of evolution during sampling. Unlike the conventional
$$ x \sim P_\theta(x) $$
SamplingEvolve explicitly conditions sampling on historical experience and feedback, so the sampling distribution becomes