Accepted for/Published in: JMIR AI
Date Submitted: Sep 24, 2025
Open Peer Review Period: Oct 8, 2025 - Dec 3, 2025
Date Accepted: Jan 5, 2026
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Comparative Study of Open-Source Large Language Models for Peer Review in Transplantation Research: Accuracy, Affiliation Bias, and Prompt Engineering
ABSTRACT
Background:
Peer review remains central to research quality assurance, yet it suffers from reviewer fatigue and human bias. The rapid rise in scientific publishing has worsened these challenges, prompting interest in whether large language models (LLMs) can support or improve the peer review process.
Objective:
This study aims to address critical gaps in the use of LLMs for peer review of transplantation papers by: (1) comparing the performance of five recent open-source LLMs; (2) evaluating the impact of author affiliations—prestigious, less prestigious, and none—on LLM review outcomes; and (3) examining the influence of prompt engineering strategies, including zero-shot, few-shot, tree-of-thought (ToT), and retrieval-augmented generation (RAG), on review decisions.
Methods:
A dataset of 200 transplantation papers published between 2024–2025 across four journal quartiles was evaluated using five state-of-the-art open-source LLMs (Llama 3.3, Mistral 7B, Gemma 2, DeepSeek r1-distill Qwen, and Qwen 2.5). Four prompting techniques—zero-shot, few-shot, tree-of-thought (ToT), and retrieval-augmented generation (RAG) were tested under multiple temperature settings. Models were instructed to categorize papers into quartiles. To assess fairness, each paper was evaluated three times: with no affiliation, prestigious affiliation, and less prestigious affiliation. Accuracy, decisions, runtime, and computing resource usage were recorded. Χ² tests and adjusted Pearson residuals examined the presence of affiliation bias.
Results:
RAG at temperature 0.5 achieved the best overall performance (exact match accuracy: 0.35; close match accuracy: 0.78). Across all models, LLMs frequently assigned manuscripts to Q2 and Q3 while avoiding extreme quartiles (Q1 and Q4). None of the models demonstrated affiliation bias, though Gemma 2 and Qwen 2.5 approached statistical significance in some cases. Each model displayed unique “personalities” in quartile predictions, influencing consistency. Mistral had the highest exact match accuracy (0.35) despite having both the lowest average runtime (1246.378 s) and computing resource usage (7B). While accuracy was insufficient for independent review, LLMs showed value in supporting preliminary triage tasks.
Conclusions:
Current open-source LLMs are not reliable enough to replace human peer reviewers but can meaningfully reduce workload by supporting early-stage manuscript triage. Importantly, affiliation bias was largely absent, suggesting LLMs may offer a pathway to more equitable peer review. RAG with moderate temperature emerged as the most effective prompting strategy. A hybrid system integrating LLMs with human oversight may enhance efficiency while maintaining rigor and integrity in scholarly publishing.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.