Cut the Bill, Keep the Turns: Affordable Multi-Turn Search RL

Jiahao Wu, Zhongwen Xu, Qiang Fu, and Wei Yang

Tencent · TEG · AIPD

December 2025

TL;DR

Using the **rLLM [1] training framework + Qwen3-8B [2] base model + offline Wikipedia / BrowseComp-Plus [3] corpora + the synthetic multi-turn data from ASearcher [4]**, you can cost-effectively train a search agent that reliably performs 10+ retrieval turns.
SFT is not strictly necessary: starting from the base model and doing GRPO-based [5] RL is enough to build stable multi-turn search capability. SFT can provide a better starting point for long-horizon training, but it tends to lock the model into a particular pattern and makes it overuse multi-turn search even for simple questions.
Multi-turn search training dataset [4] is critical: standard single-hop / two-hop datasets (HotpotQA [6], etc.) do not train long-horizon multi-turn search; you need a dedicated synthetic multi-turn dataset.
Training stability is mostly an engineering problem: the key issues are train/inference mismatch, strict token-in-token-out alignment, and handling “abnormal trajectories”.
Summarizing retrieved documents with an auxiliary LLM [7] improves training stability, supports longer horizons, and generalizes to setups that directly return raw search results.

News

We have uploaded the test trajectories to huggingface: aidenjhwu/SearchAgentResults
We have add more training curves in Section 9.5
We have uploaded the preprocessed dataset to huggingface: aidenjhwu/ASearcher_en_no-math_Qwen3-8B-reject-sample
We have uploaded the model weights to huggingface
- https://huggingface.co/aidenjhwu/SearchAgent-8B-hq : The Qwen3-8B model trained w/ outlier suppression training strategy (introduced in Section 5.4.1)
- https://huggingface.co/aidenjhwu/SearchAgent-8B : The Qwen3-8B model trained w/o outlier suppression training strategy
- https://huggingface.co/aidenjhwu/SearchAgent-A3B : The Qwen3-30B-A3B model trained w/o outlier suppression training strategy

1. The Research Bottleneck for Multi-Turn Search Agents

Recently, LLM reasoning and tool-use capabilities have improved rapidly: on the one hand, the open-source community has released many strong models [2,8]; on the other hand, when moving to multi-turn, long-horizon search-and-verify settings (multiple retrievals, iterative disambiguation, evidence aggregation, and converging to a final answer), there is still a clear gap, where smaller open-source models often lag far behind closed-source commercial models in both effective use of search turns and final accuracy.