JMIR Preprints #69707: EVIDENCE SUMMARIES USING LARGE LANGUAGE MODELS: EXPLORING THE COMPARABILITY OF HUMAN- AND AI-ANNOTATED BIBLIOGRAPHIES

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

EVIDENCE SUMMARIES USING LARGE LANGUAGE MODELS: EXPLORING THE COMPARABILITY OF HUMAN- AND AI-ANNOTATED BIBLIOGRAPHIES

Michelle Colder Carras;
Riaz Qureshi;
Kevin Naaman;
Faisal Aldayel;
Mayank Date;
Dahlia AlJuboori;
Johannes Thrul

ABSTRACT

Background:

Annotated bibliographies comprise summaries of relevant literature, and training, experience, and time is required to create useful annotations. However, summaries generated by artificial intelligence (AI) can contain serious errors.

Objective:

We compared the quality of human- and AI-generated annotations directly to determine strengths and weaknesses of both.

Methods:

We compared five criteria (word count, readability, capture of main points, presence of errors, broader contextualization/quality) between human- and native ChatGPT-produced annotations for 15 academic papers using descriptive statistics and non-parametric testing.

Results:

Humans produced shorter annotations (90.20 vs 111.47 words, Z= 2.82, P=.01) with better readability than AI (15.25 vs. 8.03, Z= -2.28, P=.02), although readability was low for all annotations. There was no difference in the capture of main points (X2 = 6.12, P=.18) or presence of errors (X2 = 5.27, P=.16). AI-produced annotations provided better contextualization than human annotations (X2 = 11.28, P <.001).

Conclusions:

AI-produced summaries of academic literature are comparable to human annotations. Annotations generated by AI and verified by humans should reduce the time needed to produce summaries on a given subject.

Citation

Please cite as:

Colder Carras M, Qureshi R, Naaman K, Aldayel F, Date M, AlJuboori D, Thrul J

Using Large Language Models to Summarize Evidence in Biomedical Articles: Exploratory Comparison Between AI- and Human-Annotated Bibliographies

JMIR Form Res 2026;10:e69707

DOI: 10.2196/69707

PMID: 41678657

PMCID: 12900274

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Dec 5, 2024

Date Accepted: Dec 12, 2025

EVIDENCE SUMMARIES USING LARGE LANGUAGE MODELS: EXPLORING THE COMPARABILITY OF HUMAN- AND AI-ANNOTATED BIBLIOGRAPHIES

ABSTRACT

Citation

Copyright