<aside> 🔬

BandAI is dedicated to cutting-edge AI research and the exploration of next-generation AI products. Our research spans LLMs, VLMs, and agents, with a focus on frontier topics such as Deep Research, Agentic RL, and Self-Evolution etc. At the same time, we are committed to applying these technologies on Douyin, leveraging our data and technical strengths to deliver unprecedented experiences to users.

Join us — help ensure AI creates real-world utility.

</aside>

Table of Contents

Sep 2025 | ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?

ShoppingComp fundamentally reshapes the evaluation paradigm for e-commerce agents. Rather than closed sandbox tests, it places models directly into an open, verifiable, real-world shopping environment. The benchmark, meticulously constructed by 35 domain experts, covers 120 tasks and 1,026 authentic scenarios. Within these expert-designed, complex contexts, models are required to use real online search tools to complete comprehensive shopping tasks. A clearly defined rubric-based framework is employed to systematically measure three core capabilities: precision of product retrieval, professionalism of shopping reports, and reliability of safety-critical decisions. Unlike prior datasets, ShoppingComp exposes models to long-overlooked real-world challenges such as dynamic product information, online noise, and misleading marketing content.

Results reveal severe limitations of current large language models in real-world shopping environments: even state-of-the-art systems such as GPT-5 and Gemini 2.5 Pro achieved very low overall scores (11.22% and 7.7%, respectively). These models frequently made critical mistakes, such as failing to identify unsafe use cases or being misled by promotional claims. Consequently, ShoppingComp effectively bridges the gap between existing benchmarks and practical deployment, establishing a key performance baseline for the development of next-generation shopping assistants.

Sep 2025 | EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

Large language models (LLMs) have demonstrated significant improvements when optimized with reinforcement learning (RL), particularly with verifiable rewards, which show strong promise for generating long chains of reasoning. However, existing methods rely solely on the model’s own capability to explore reasoning trajectories, resulting in delayed and sparse feedback from final rewards.

To address this limitation, we propose Expert-Assisted Policy Optimization (EAPO), a novel method that treats consulting an expert as a learnable action. This enables the policy to learn when and how to query the expert during RL training. During evaluation, to ensure a fair judgement of the optimized model’s capability, the policy generates responses entirely on its own, without expert guidance. Extensive experiments show that EAPO effectively enhances policy optimization by leveraging expert assistance, providing richer feedback signals beyond the final verifiable reward and ultimately improving the reasoning ability of LLMs.

Sep 2025 | LLM-I: LLMs are Naturally Interleaved Multimodal Creators

Existing text-image generation approaches typically fall into two categories: (1) Unified generative-understanding models; (2) Two-stage pipelines—first generating text, then images. Both approaches rely heavily on diffusion models for image synthesis, making it difficult to produce accurate, factual visual content required for tasks such as news reporting or data-analysis reports.

To address this gap, we propose LLM-I, which deeply integrates online search, image editing, diffusion generation, and code visualization within an LLM. Leveraging the reasoning capabilities of large language models, LLM-I can generate rich, reliable multimodal reports for use cases like data analytics and product comparisons. In addition, we introduce a reinforcement learning (RL) strategy that significantly enhances the model’s ability to intelligently select and orchestrate tools while maintaining tight alignment between text and visuals. We validate our approach on Qwen3-4B, Qwen3-30B, Qwen2.5-VL-7B, and Qwen2.5-VL-32B, achieving new SOTA performance across four diverse benchmarks.

Sep 2025 | SamplingEvolve: Test-Time Scaling through Experience-Guided Trajectory Evolution

SamplingEvolve implements test-time scaling as an experience-driven trajectory evolution loop: candidate trajectories are persistently stored in a trajectory pool (including full messages, tool calls, metadata, and parent node IDs). They are then iteratively refined by the evolution engine, evolutioner, and evaluator, leveraging reusable failure cases and natural language feedback as a form of soft gradient to guide optimization.

We evaluated the system on GAIA-text (20 rounds) and BrowseComp (10 rounds). On GAIA, cumulative accuracy improved to 70.37% after only three iterations (compared to 62.96% with Pass@N), and continued evolution led to 86.42% (+23.46%). On the first 100 BrowseComp problems, the peak performance reached 43.00%. All experiments used either LLM-based or rule-based evaluators to provide verifiable feedback, which was written back into the trajectory pool to ensure reproducibility of the results.

Aug 2025 | ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

ReportBench fills a gap in evaluating “Deep Research” agents by creating a reproducible benchmark from arXiv survey papers that measures two concrete axes: citation relevance/coverage and factual faithfulness of generated claims. It reframes evaluation away from subjective writing style and toward verifiable research skills.