Jiahao Wu, Zhongwen Xu, Qiang Fu, and Wei Yang
Tencent · TEG · AIPD
December 2025
TL;DR
- Using the **rLLM [1] training framework + Qwen3-8B [2] base model + offline Wikipedia / BrowseComp-Plus [3] corpora + the synthetic multi-turn data from ASearcher [4]**, you can cost-effectively train a search agent that reliably performs 10+ retrieval turns.
- SFT is not strictly necessary: starting from the base model and doing GRPO-based [5] RL is enough to build stable multi-turn search capability. SFT can provide a better starting point for long-horizon training, but it tends to lock the model into a particular pattern and makes it overuse multi-turn search even for simple questions.
- Multi-turn search training dataset [4] is critical: standard single-hop / two-hop datasets (HotpotQA [6], etc.) do not train long-horizon multi-turn search; you need a dedicated synthetic multi-turn dataset.
- Training stability is mostly an engineering problem: the key issues are train/inference mismatch, strict token-in-token-out alignment, and handling “abnormal trajectories”.
- Summarizing retrieved documents with an auxiliary LLM [7] improves training stability, supports longer horizons, and generalizes to setups that directly return raw search results.
News
1. The Research Bottleneck for Multi-Turn Search Agents
Recently, LLM reasoning and tool-use capabilities have improved rapidly: on the one hand, the open-source community has released many strong models [2,8]; on the other hand, when moving to multi-turn, long-horizon search-and-verify settings (multiple retrievals, iterative disambiguation, evidence aggregation, and converging to a final answer), there is still a clear gap, where smaller open-source models often lag far behind closed-source commercial models in both effective use of search turns and final accuracy.