JMIR Preprints #84322: Comparative Study of Open-Source Large Language Models for Peer Review in Transplantation Research: Accuracy, Affiliation Bias, and Prompt Engineering

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Comparative Study of Open-Source Large Language Models for Peer Review in Transplantation Research: Accuracy, Affiliation Bias, and Prompt Engineering

Selena Ming Shen;
Zifu Wang;
Krittika Paul;
Meng-Hao Li;
Xiao Huang;
Naoru Koizumi

ABSTRACT

Background:

Peer review remains central to research quality assurance, yet it suffers from reviewer fatigue and human bias. The rapid rise in scientific publishing has worsened these challenges, prompting interest in whether large language models (LLMs) can support or improve the peer review process.

Objective:

This study aims to address critical gaps in the use of LLMs for peer review of transplantation papers by: (1) comparing the performance of five recent open-source LLMs; (2) evaluating the impact of author affiliations—prestigious, less prestigious, and none—on LLM review outcomes; and (3) examining the influence of prompt engineering strategies, including zero-shot, few-shot, tree-of-thought (ToT), and retrieval-augmented generation (RAG), on review decisions.

Methods:

A dataset of 200 transplantation papers published between 2024–2025 across four journal quartiles was evaluated using five state-of-the-art open-source LLMs (Llama 3.3, Mistral 7B, Gemma 2, DeepSeek r1-distill Qwen, and Qwen 2.5). Four prompting techniques—zero-shot, few-shot, tree-of-thought (ToT), and retrieval-augmented generation (RAG) were tested under multiple temperature settings. Models were instructed to categorize papers into quartiles. To assess fairness, each paper was evaluated three times: with no affiliation, prestigious affiliation, and less prestigious affiliation. Accuracy, decisions, runtime, and computing resource usage were recorded. Χ² tests and adjusted Pearson residuals examined the presence of affiliation bias.

Results:

RAG at temperature 0.5 achieved the best overall performance (exact match accuracy: 0.35; close match accuracy: 0.78). Across all models, LLMs frequently assigned manuscripts to Q2 and Q3 while avoiding extreme quartiles (Q1 and Q4). None of the models demonstrated affiliation bias, though Gemma 2 and Qwen 2.5 approached statistical significance in some cases. Each model displayed unique “personalities” in quartile predictions, influencing consistency. Mistral had the highest exact match accuracy (0.35) despite having both the lowest average runtime (1246.378 s) and computing resource usage (7B). While accuracy was insufficient for independent review, LLMs showed value in supporting preliminary triage tasks.

Conclusions:

Current open-source LLMs are not reliable enough to replace human peer reviewers but can meaningfully reduce workload by supporting early-stage manuscript triage. Importantly, affiliation bias was largely absent, suggesting LLMs may offer a pathway to more equitable peer review. RAG with moderate temperature emerged as the most effective prompting strategy. A hybrid system integrating LLMs with human oversight may enhance efficiency while maintaining rigor and integrity in scholarly publishing.

Citation

Please cite as:

Shen SM, Wang Z, Paul K, Li MH, Huang X, Koizumi N

Evaluation of Large Language Models for Peer Review in Transplantation Research: Algorithm Validation Study

JMIR AI 2026;5:e84322

DOI: 10.2196/84322

PMID: 41672474

PMCID: 12936655

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR AI

Date Submitted: Sep 24, 2025

Open Peer Review Period: Oct 8, 2025 - Dec 3, 2025

Date Accepted: Jan 5, 2026

(closed for review but you can still tweet)

Comparative Study of Open-Source Large Language Models for Peer Review in Transplantation Research: Accuracy, Affiliation Bias, and Prompt Engineering

ABSTRACT

Citation

Copyright