Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Sep 24, 2025
Open Peer Review Period: Oct 8, 2025 - Dec 3, 2025
Date Accepted: Jan 5, 2026
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Evaluation of Large Language Models for Peer Review in Transplantation Research: Algorithm Validation Study

Shen SM, Wang Z, Paul K, Li MH, Huang X, Koizumi N

Evaluation of Large Language Models for Peer Review in Transplantation Research: Algorithm Validation Study

JMIR AI 2026;5:e84322

DOI: 10.2196/84322

PMID: 41672474

PMCID: 12936655

Comparative Study of Open-Source Large Language Models for Peer Review in Transplantation Research: Accuracy, Affiliation Bias, and Prompting Engineering

  • Selena Ming Shen; 
  • Zifu Wang; 
  • Krittika Paul; 
  • Meng-Hao Li; 
  • Xiao Huang; 
  • Naoru Koizumi

ABSTRACT

Background:

Peer review remains central to research quality assurance, yet it suffers from reviewer fatigue and human bias. The rapid rise in scientific publishing has worsened these challenges, prompting interest in whether large language models (LLMs) can support or improve the peer review process.

Objective:

This study aims to address critical gaps in the use of LLMs for peer review of transplantation papers by: (1) comparing the performance of five recent open-source LLMs; (2) evaluating the impact of author affiliations—prestigious, less prestigious, and none—on LLM review outcomes; and (3) examining the influence of prompt engineering strategies, including zero-shot, few-shot, tree-of-thought (ToT), and retrieval-augmented generation (RAG), on review decisions.

Methods:

A dataset of 200 transplantation papers published between 2024–2025 across four journal quartiles was evaluated using five state-of-the-art open-source LLMs (Llama 3.3, Mistral 7B, Gemma 2, DeepSeek r1-distill Qwen, and Qwen 2.5). Four prompting techniques—zero-shot, few-shot, tree-of-thought (ToT), and retrieval-augmented generation (RAG) were tested under multiple temperature settings. Models were instructed to categorize papers into quartiles. To assess fairness, each paper was evaluated three times: with no affiliation, prestigious affiliation, and less prestigious affiliation. Accuracy, decisions, runtime, and computing resource usage were recorded. Χ² tests and adjusted Pearson residuals examined the presence of affiliation bias.

Results:

RAG at temperature 0.5 achieved the best overall performance (exact match accuracy: 0.35; close match accuracy: 0.78). Across all models, LLMs frequently assigned manuscripts to Q2 and Q3 while avoiding extreme quartiles (Q1 and Q4). None of the models demonstrated affiliation bias, though Gemma 2 and Qwen 2.5 approached statistical significance in some cases. Each model displayed unique “personalities” in quartile predictions, influencing consistency. Mistral had the highest exact match accuracy (0.35) despite having both the lowest average runtime (1246.378 s) and computing resource usage (7B). While accuracy was insufficient for independent review, LLMs showed value in supporting preliminary triage tasks.

Conclusions:

Current open-source LLMs are not reliable enough to replace human peer reviewers but can meaningfully reduce workload by supporting early-stage manuscript triage. Importantly, affiliation bias was largely absent, suggesting LLMs may offer a pathway to more equitable peer review. RAG with moderate temperature emerged as the most effective prompting strategy. A hybrid system integrating LLMs with human oversight may enhance efficiency while maintaining rigor and integrity in scholarly publishing.


 Citation

Please cite as:

Shen SM, Wang Z, Paul K, Li MH, Huang X, Koizumi N

Evaluation of Large Language Models for Peer Review in Transplantation Research: Algorithm Validation Study

JMIR AI 2026;5:e84322

DOI: 10.2196/84322

PMID: 41672474

PMCID: 12936655

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.