MINGHAO LI, YING ZENG, CONG MA, SIYAO SONG, KAI JIA

September 16, 2025

[中文版] [GitHub]

In the development of large models, Scaling Law has long been the dominant theme: larger parameter counts, richer data, and greater compute typically lead to better performance. But when the scaling path of model parameters and dataset size encounters diminishing returns, research attention shifts to Test-Time Scaling — extracting more capability from a fixed model size during inference by using longer, more structured chains of thought and tool chains.

Reinforcement Learning with Verifiable Rewards (RLVR) is a representative paradigm in this direction. By using verifiable rewards and challenging prompts, RLVR guides models to engage in deeper reasoning during sampling, enabling them to solve more complex problems. This kind of scaling is reflected in the RL process as a natural increase in response length. In this process, there are three key elements: robust rewards that can continuously guide the model’s optimization, hard problems that trigger deep-thinking behavior, and model trajectories sampled that lead to the derivation of the correct answer. However, RLVR faces the following practical issues in implementation: sampled trajectories lack diversity, which reduces the efficiency of finding correct answers:

  1. The model only reflects within a single sampled trajectory, and attempts inside that trajectory lack external feedback; the model has difficulty targeting and correcting mistaken steps or trying fundamentally new approaches.
  2. After a sampled trajectory receives feedback and is used as a gradient signal, there is no explicit storage or experience reuse. On the next sampling run, the model cannot be guaranteed to produce diverse behaviors.
  3. During RL, the model easily suffers from entropy collapse; as training proceeds, multiple sampled trajectories gradually converge and become homogeneous.

A more pragmatic engineering problem is that when a batch of tasks yields a very low initial score for the policy model, the system struggles to sample high-quality examples early on; optimization progress becomes slow or even stalls, forcing us to rely on human annotation or supervised fine-tuning (SFT) distilled from stronger models to cold-start the system so as to increase the early hit probability. Such remedies are tightly constrained by the abilities of humans or the strong models used.

Thus, the problem is naturally restated as: How can we more efficiently sample correct solution trajectories?

<aside> 🔖 Shift Test-Time Scaling from「single, independently sampled trajectories」to「experience-driven inter-trajectory evolution」

</aside>

SamplingEvolve

We propose SamplingEvolve, which reconstructs inference from “single-shot independent sampling” into an “experience-driven inter-trajectory evolution system”. Intuitively, the system persists model-generated historical trajectories together with their evaluation feedback, and uses these externalized experiences to drive unlimited rounds of evolution during sampling. Unlike the conventional

$$ x \sim P_\theta(x) $$

SamplingEvolve explicitly conditions sampling on historical experience and feedback, so the sampling distribution becomes