
<aside> 🔬
BandAI is dedicated to cutting-edge AI research and the exploration of next-generation AI products. Our research spans LLMs, VLMs, and agents, with a focus on frontier topics such as Deep Research, Agentic RL, and Self-Evolution etc. At the same time, we are committed to applying these technologies on Douyin, leveraging our data and technical strengths to deliver unprecedented experiences to users.
Join us — help ensure AI creates real-world utility.
</aside>
Table of Contents
Current Multimodal Large Language Models (MLLMs) often exhibit surprising brittleness when facing simple real-world perturbations, such as image rotation or flipping, and are typically constrained by a narrow set of pre-defined tools like cropping. To address these limitations, we propose CodeVision, a novel framework that introduces a "code-as-tool" paradigm. Instead of relying on a fixed registry of tools, our approach empowers the model to generate executable code as a universal interface, enabling it to dynamically invoke a virtually unlimited range of image operations. This shift not only eliminates the need for manual tool specification but also significantly enhances the model's flexibility and scalability in handling complex visual reasoning tasks.
To achieve robust tool-use capabilities, we employ a two-stage training methodology consisting of cold-start Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). By utilizing a novel dense process reward function, we guide the model to develop strategic reasoning and error recovery skills throughout multi-turn interactions. Experiments on the Qwen series demonstrate that CodeVision significantly improves robustness and fosters emergent behaviors, such as the spontaneous composition of unseen tools (e.g., contrast enhancement) to solve novel problems. On challenging benchmarks like MVToolBench, our model achieves state-of-the-art performance, surpassing leading models such as GPT-5 and Gemini.
ShoppingComp fundamentally reshapes the evaluation paradigm for e-commerce agents. Rather than closed sandbox tests, it places models directly into an open, verifiable, real-world shopping environment. The benchmark, meticulously constructed by 35 domain experts, covers 120 tasks and 1,026 authentic scenarios. Within these expert-designed, complex contexts, models are required to use real online search tools to complete comprehensive shopping tasks. A clearly defined rubric-based framework is employed to systematically measure three core capabilities: precision of product retrieval, professionalism of shopping reports, and reliability of safety-critical decisions. Unlike prior datasets, ShoppingComp exposes models to long-overlooked real-world challenges such as dynamic product information, online noise, and misleading marketing content.
Results reveal severe limitations of current large language models in real-world shopping environments: even state-of-the-art systems such as GPT-5 and Gemini 2.5 Pro achieved very low overall scores (11.22% and 7.7%, respectively). These models frequently made critical mistakes, such as failing to identify unsafe use cases or being misled by promotional claims. Consequently, ShoppingComp effectively bridges the gap between existing benchmarks and practical deployment, establishing a key performance baseline for the development of next-generation shopping assistants.
Large language models (LLMs) have demonstrated significant improvements when optimized with reinforcement learning (RL), particularly with verifiable rewards, which show strong promise for generating long chains of reasoning. However, existing methods rely solely on the model’s own capability to explore reasoning trajectories, resulting in delayed and sparse feedback from final rewards.
To address this limitation, we propose Expert-Assisted Policy Optimization (EAPO), a novel method that treats consulting an expert as a learnable action. This enables the policy to learn when and how to query the expert during RL training. During evaluation, to ensure a fair judgement of the optimized model’s capability, the policy generates responses entirely on its own, without expert guidance. Extensive experiments show that EAPO effectively enhances policy optimization by leveraging expert assistance, providing richer feedback signals beyond the final verifiable reward and ultimately improving the reasoning ability of LLMs.
Existing text-image generation approaches typically fall into two categories: (1) Unified generative-understanding models; (2) Two-stage pipelines—first generating text, then images. Both approaches rely heavily on diffusion models for image synthesis, making it difficult to produce accurate, factual visual content required for tasks such as news reporting or data-analysis reports.
To address this gap, we propose LLM-I, which deeply integrates online search, image editing, diffusion generation, and code visualization within an LLM. Leveraging the reasoning capabilities of large language models, LLM-I can generate rich, reliable multimodal reports for use cases like data analytics and product comparisons. In addition, we introduce a reinforcement learning (RL) strategy that significantly enhances the model’s ability to intelligently select and orchestrate tools while maintaining tight alignment between text and visuals. We validate our approach on Qwen3-4B, Qwen3-30B, Qwen2.5-VL-7B, and Qwen2.5-VL-32B, achieving new SOTA performance across four diverse benchmarks.
SamplingEvolve implements test-time scaling as an experience-driven trajectory evolution loop: candidate trajectories are persistently stored in a trajectory pool (including full messages, tool calls, metadata, and parent node IDs). They are then iteratively refined by the evolution engine, evolutioner, and evaluator, leveraging reusable failure cases and natural language feedback as a form of soft gradient to guide optimization.