JMIR Preprints #52935: Citations and References in Scholarly Writing: A cross-disciplinary Evaluation of Large Language Model Performance and Reliability.

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Citations and References in Scholarly Writing: A cross-disciplinary Evaluation of Large Language Model Performance and Reliability.

Joseph Mugaanyi;
Liuying Cai;
Sumei Cheng;
Caide Lu;
Jing Huang

ABSTRACT

Background:

Recent advancements in natural language processing have given rise to Large Language Models (LLMs), such as ChatGPT (GPT-3.5), capable of generating scholarly content, including citations and references. Assessing the accuracy of these AI-generated citations is imperative for maintaining scholarly rigor.

Objective:

The aim of this study was to assess the accuracy of citations and references generated by ChatGPT (GPT-3.5) in two distinct academic domains: Natural Sciences and Humanities.

Methods:

Two researchers independently prompted ChatGPT to write an introduction section for a manuscript and include citations; then evaluated citations and DOI accuracy. Results were compared between the two disciplines.

Results:

10 topics were included, 5 in natural sciences and 5 in humanities. A total of 102 citations were generated, 55 in natural sciences and 47 in humanities. 40 citations (72.7%) in natural sciences were real and 36 (76.6%) in humanities (P = 0.415). There were significant disparities found in DOI presence (Natural Sciences: 70.9% vs. Humanities: 38.3%) and accuracy (32.7% vs. 8.5%). DOI hallucination was more prevalent in the Humanities (89.4%). Levenshtein Distance was significantly higher in the Humanities, indicating lower DOI accuracy.

Conclusions:

ChatGPT's performance in generating citations and references varies across disciplines. Differences in DOI standards and disciplinary nuances contribute to performance variations. Researchers should consider AI writing tools' strengths and limitations in citation accuracy. Domain-specific models may enhance accuracy.

Citation

Please cite as:

Mugaanyi J, Cai L, Cheng S, Lu C, Huang J

Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study

J Med Internet Res 2024;26:e52935

DOI: 10.2196/52935

PMID: 38578685

PMCID: 11031695

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Sep 19, 2023

Date Accepted: Mar 12, 2024

Citations and References in Scholarly Writing: A cross-disciplinary Evaluation of Large Language Model Performance and Reliability.

ABSTRACT

Citation

Copyright