Jiahao Wu, Zhongwen Xu, Qiang Fu, and Wei Yang
Tencent · TEG · AIPD
December 2025
Recently, LLM reasoning and tool-use capabilities have improved rapidly: on the one hand, the open-source community has released many strong models [2,8]; on the other hand, when moving to multi-turn, long-horizon search-and-verify settings (multiple retrievals, iterative disambiguation, evidence aggregation, and converging to a final answer), there is still a clear gap, where smaller open-source models often lag far behind closed-source commercial models in both effective use of search turns and final accuracy.

To narrow this gap, the community has proposed many effective approaches [4,7,9,10,11,12]. Most follow a pipeline of synthetic data construction → SFT alignment → RL optimization to gradually push smaller models toward more robust long-horizon search strategies. However, for individual researchers, it is still hard to use these methods as a starting point: many are not fully open-sourced, key training details are not fully disclosed, and training stability is highly sensitive to engineering details (data synthesis and cleaning, tool environment stability, concurrency and timeouts, truncation and abnormal trajectory handling, train/inference alignment, etc.). In addition, multi-stage training is harder to reproduce, and paid Web search APIs further raise iteration cost.
This post provides a researcher-oriented tutorial for deploying and training a multi-turn search agent baseline. We systematize the key engineering choices and stability tricks we found in practice, so you can train your first multi-turn search agent at lower cost and with better reproducibility on classic multi-turn search benchmarks, increasing average search turns from ~2 to 15+ and accuracy from single digits to ~30%.