Accepted for/Published in: JMIR Formative Research
Date Submitted: Dec 5, 2024
Date Accepted: Dec 12, 2025
EVIDENCE SUMMARIES USING LARGE LANGUAGE MODELS: EXPLORING THE COMPARABILITY OF HUMAN- AND AI-ANNOTATED BIBLIOGRAPHIES
ABSTRACT
Background:
Annotated bibliographies comprise summaries of relevant literature, and training, experience, and time is required to create useful annotations. However, summaries generated by artificial intelligence (AI) can contain serious errors.
Objective:
We compared the quality of human- and AI-generated annotations directly to determine strengths and weaknesses of both.
Methods:
We compared five criteria (word count, readability, capture of main points, presence of errors, broader contextualization/quality) between human- and native ChatGPT-produced annotations for 15 academic papers using descriptive statistics and non-parametric testing.
Results:
Humans produced shorter annotations (90.20 vs 111.47 words, Z= 2.82, P=.01) with better readability than AI (15.25 vs. 8.03, Z= -2.28, P=.02), although readability was low for all annotations. There was no difference in the capture of main points (X2 = 6.12, P=.18) or presence of errors (X2 = 5.27, P=.16). AI-produced annotations provided better contextualization than human annotations (X2 = 11.28, P <.001).
Conclusions:
AI-produced summaries of academic literature are comparable to human annotations. Annotations generated by AI and verified by humans should reduce the time needed to produce summaries on a given subject.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.