Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Dec 5, 2024
Date Accepted: Dec 12, 2025

The final, peer-reviewed published version of this preprint can be found here:

Using Large Language Models to Summarize Evidence in Biomedical Articles: Exploratory Comparison Between AI- and Human-Annotated Bibliographies

Colder Carras M, Qureshi R, Aldayel F, Date M, AlJuboori D, Thrul J

Using Large Language Models to Summarize Evidence in Biomedical Articles: Exploratory Comparison Between AI- and Human-Annotated Bibliographies

JMIR Form Res 2026;10:e69707

DOI: 10.2196/69707

PMID: 41678657

PMCID: 12900274

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

AI Generation Of Evidence Summaries: A Descriptive Study Of The Comparability With Human Annotations

  • Michelle Colder Carras; 
  • Riaz Qureshi; 
  • Faisal Aldayel; 
  • Mayank Date; 
  • Dahlia AlJuboori; 
  • Johannes Thrul

ABSTRACT

Background:

Annotated bibliographies comprise summaries of relevant literature, and training, experience, and time is required to create useful annotations. However, summaries generated by artificial intelligence (AI) can contain serious errors.

Objective:

We compared the quality of human- and AI-generated annotations directly to determine strengths and weaknesses of both.

Methods:

We compared five criteria (word count, readability, capture of main points, presence of errors, broader contextualization/quality) between human- and native ChatGPT-produced annotations for 15 academic papers using descriptive statistics and non-parametric testing.

Results:

Humans produced shorter annotations (90.20 vs 111.47 words, Z= 2.82, P=.01) with better readability than AI (15.25 vs. 8.03, Z= -2.28, P=.02), although readability was low for all annotations. There was no difference in the capture of main points (X2 = 6.12, P=.18) or presence of errors (X2 = 5.27, P=.16). AI-produced annotations provided better contextualization than human annotations (X2 = 11.28, P <.001).

Conclusions:

AI-produced summaries of academic literature are comparable to human annotations. Annotations generated by AI and verified by humans should reduce the time needed to produce summaries on a given subject.


 Citation

Please cite as:

Colder Carras M, Qureshi R, Aldayel F, Date M, AlJuboori D, Thrul J

Using Large Language Models to Summarize Evidence in Biomedical Articles: Exploratory Comparison Between AI- and Human-Annotated Bibliographies

JMIR Form Res 2026;10:e69707

DOI: 10.2196/69707

PMID: 41678657

PMCID: 12900274

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.