LitReview Arena: Evaluating Literature Review Agents with Battle-style Peer Review Platform
Ruotong Zhao ⋅ Zhiyu Chen ⋅ Xurui Liu ⋅ Haidong Xue ⋅ Dong Liang ⋅ Jigao Fu ⋅ Wu YanBiao ⋅ Yuanyi Zhen ⋅ Fengli Xu ⋅ Yong Li
Abstract
Literature reviews are essential to reflect the landscape of research fields. Large language models, especially deep research agents, have recently shown strong capabilities in automated literature review generation. However, it remains a challenging task to rigorously evaluate the scientific value of the generated reviews, since human expert annotations are difficult to scale up and LLM-as-a-judge approaches lack of a convincing criteria. To address this gap, we introduce LitReview Arena, a battle-style evaluation platform with a structured protocol tailored to literature review quality. Our protocol imitates academic peer review by recruiting domain experts with research paper-writing experience, and we match each query to reviewers within their expertise. Each battle is judged with dimension-wise outcomes over five literature-review-specific criteria, enabling reproducible and diagnostic comparisons across systems. We collect a large-scale human preference dataset of expert votes (4984 votes×5 dimensions) and systematically measure how far current models are from human drafts. Results show that the most advanced models win only 23.0\% of decisive matches against humans on overall utility, leaving substantial room for improvement. Meanwhile, agentic LLMs, such as Sonar Deep Research, substantially outperform base language models by over 60\%. We also find that existing LLM-as-a-judge evaluation methods are severely misaligned with human experts (Spearman's $\rho \approx 0.467$). Based on the collected preference data, we provide an expert-calibrated evaluator, \emph{LitJudge}, improving alignment to $\rho \approx 0.78$, comparable to inter-expert consistency. Codes and datasets are publicly available at https://anonymous.4open.science/r/LitReview-Arena-3B82/.
Successful Page Load