Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Sep 19, 2023
Date Accepted: Mar 12, 2024

The final, peer-reviewed published version of this preprint can be found here:

Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study

Mugaanyi J, Cai L, Cheng S, Lu C, Huang J

Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study

J Med Internet Res 2024;26:e52935

DOI: 10.2196/52935

PMID: 38578685

PMCID: 11031695

Citations and References in Scholarly Writing: A cross-disciplinary Evaluation of Large Language Model Performance and Reliability.

  • Joseph Mugaanyi; 
  • Liuying Cai; 
  • Sumei Cheng; 
  • Caide Lu; 
  • Jing Huang

ABSTRACT

Background:

Recent advancements in natural language processing have given rise to Large Language Models (LLMs), such as ChatGPT (GPT-3.5), capable of generating scholarly content, including citations and references. Assessing the accuracy of these AI-generated citations is imperative for maintaining scholarly rigor.

Objective:

The aim of this study was to assess the accuracy of citations and references generated by ChatGPT (GPT-3.5) in two distinct academic domains: Natural Sciences and Humanities.

Methods:

Two researchers independently prompted ChatGPT to write an introduction section for a manuscript and include citations; then evaluated citations and DOI accuracy. Results were compared between the two disciplines.

Results:

10 topics were included, 5 in natural sciences and 5 in humanities. A total of 102 citations were generated, 55 in natural sciences and 47 in humanities. 40 citations (72.7%) in natural sciences were real and 36 (76.6%) in humanities (P = 0.415). There were significant disparities found in DOI presence (Natural Sciences: 70.9% vs. Humanities: 38.3%) and accuracy (32.7% vs. 8.5%). DOI hallucination was more prevalent in the Humanities (89.4%). Levenshtein Distance was significantly higher in the Humanities, indicating lower DOI accuracy.

Conclusions:

ChatGPT's performance in generating citations and references varies across disciplines. Differences in DOI standards and disciplinary nuances contribute to performance variations. Researchers should consider AI writing tools' strengths and limitations in citation accuracy. Domain-specific models may enhance accuracy.


 Citation

Please cite as:

Mugaanyi J, Cai L, Cheng S, Lu C, Huang J

Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study

J Med Internet Res 2024;26:e52935

DOI: 10.2196/52935

PMID: 38578685

PMCID: 11031695

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.