20250928-162658.png

[中文] [Email]

<aside> 🔬

BandAI is dedicated to cutting-edge AI research and the exploration of next-generation AI products. Our research spans LLMs, VLMs, and agents, with a focus on frontier topics such as Deep Research, Agentic RL, and Self-Evolution etc. At the same time, we are committed to applying these technologies on Douyin, leveraging our data and technical strengths to deliver unprecedented experiences to users.

Join us — help ensure AI creates real-world utility.

</aside>

Table of Contents


Feb 2026 | Immersion in the GitHub Universe: Scaling Coding Agents to Mastery

Achieving mastery in real world software engineering tasks is fundamentally bottlenecked by the scarcity of large scale, high quality training data. Scaling such data has been limited by the complexity of environment setup, unit test generation, and problem statement curation. In this paper, we propose ScaleSWE, an automated, sandboxed multi agent workflow designed to construct high quality SWE data at scale. The system coordinates three specialized agents for environment setup, test creation, and problem description synthesis to process 6 million pull requests across 5200 repositories, producing Scale SWE Data: 100k verified SWE instances, the largest such dataset to date. It substantially surpasses existing real world datasets in repository diversity and reflects realistic task complexity. We further demonstrate the dataset utility for training by distilling 71498 high quality trajectories and finetuning Qwen30BA3BInstruct to produce ScaleSWE Agent. Our agent achieves a 64 resolve rate on SWE Bench Verified a nearly three fold improvement over the base model. ScaleSWE provides a scalable, reproducible approach for data construction to advance LLM based software engineering. Scale SWE will be publicly available.

Feb 2026 | Towards Better RL Training Data Utilization via Second-Order Rollout

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training.

Jan 2026 | TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models

Large language models (LLMs) show promise as teaching assistants, yet their teaching capability remains insufficiently evaluated. Existing benchmarks mainly focus on problem-solving or problem-level guidance, leaving knowledge-centered teaching underexplored. We propose a syllabus-grounded evaluation framework that measures LLM teaching capability via student performance improvement after multi-turn instruction. By restricting teacher agents to structured knowledge points and example problems, the framework avoids information leakage and enables reuse of existing benchmarks.

We instantiate the framework on Gaokao data across multiple subjects. Experiments reveal substantial variation in teaching effectiveness across models and domains: some models perform well in mathematics, while teaching remains challenging in physics and chemistry. We also find that incorporating example problems does not necessarily improve teaching, as models often shift toward example-specific error correction. Overall, our results highlight teaching ability as a distinct and measurable dimension of LLM behavior.

Jan 2026 | CoLT: Reasoning with Chain of Latent Tool Calls

Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as “tool calls”. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step.When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, preserving its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different decoder structures.

Jan 2026 | A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization

Reinforcement learning (RL) has become a central technique for improving the reasoning and alignment capabilities of large language models (LLMs). In practice, however, large-scale RL training often prioritizes efficiency, where rollouts are generated by slightly outdated policies, leading to off-policy optimization. While many practical RL algorithms rely on token-level importance sampling ratios to correct the resulting distribution mismatch, training on such stale rollouts can still cause instability or even collapse in LLM RL post-training, especially when the policy drift becomes large.

In this work, we revisit the theoretical foundations of policy optimization for LLMs and show that the correct correction term should be the prefix importance ratio, rather than the token-level approximations commonly used in practice. Based on this insight, we propose a simple and effective objective that preserves essential prefix-level information while avoiding numerical instability, leading to more stable RL post-training for large language models.

Dec 2025 | Thinking with Programming Vision: Towards a Unified View for Thinking with Images [CVPR 2026]

Current Multimodal Large Language Models (MLLMs) often exhibit surprising brittleness when facing simple real-world perturbations, such as image rotation or flipping, and are typically constrained by a narrow set of pre-defined tools like cropping. To address these limitations, we propose CodeVision, a novel framework that introduces a "code-as-tool" paradigm. Instead of relying on a fixed registry of tools, our approach empowers the model to generate executable code as a universal interface, enabling it to dynamically invoke a virtually unlimited range of image operations. This shift not only eliminates the need for manual tool specification but also significantly enhances the model's flexibility and scalability in handling complex visual reasoning tasks.