Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Aug 20, 2023
Date Accepted: Apr 26, 2024

The final, peer-reviewed published version of this preprint can be found here:

Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-Written Abstracts in the Field of Spine Surgery: Comparative Analysis

Kim HJ, Yang JH, Chang DG, Lenke LG, Pizones J, Castelein R, Watanabe K, Trobisch PD, Mundis GM Jr, Suh SW, Suk SI

Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-Written Abstracts in the Field of Spine Surgery: Comparative Analysis

J Med Internet Res 2024;26:e52001

DOI: 10.2196/52001

PMID: 38924787

PMCID: 11237793

Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-written Abstracts in the Field of Spine Surgery: A Comparative Analysis of Scientific Abstracts Between Artificial Intelligence and Human

  • Hong Jin Kim; 
  • Jae Hyuk Yang; 
  • Dong-Gune Chang; 
  • Lawrence G Lenke; 
  • Javier Pizones; 
  • RenĂ© Castelein; 
  • Kota Watanabe; 
  • Per D Trobisch; 
  • Gregory M Mundis Jr; 
  • Seung Woo Suh; 
  • Se-Il Suk

ABSTRACT

Background:

Due to recent advances in artificial intelligence (AI), language model applications such as ChatGPT and Bard can generate logical text output that is difficult to distinguish from human writing. The use of AI to write scientific abstracts in the field of spine surgery is the center of much debate and controversy.

Objective:

To assess the reproducibility of the structured abstracts generated by ChatGPT and Bard compared to human-written abstracts in the field of spine surgery

Methods:

Sixty abstracts dealing with spine sections were randomly selected from seven reputable journals and used as ChatGPT and Bard input statements to generate abstracts based on supplied article titles. A total of 174 abstracts divided into human-written abstracts, ChatGPT-generated abstracts, and Bard-generated abstracts was evaluated for compliance with the structured format of journal guidelines and consistency of content. The likelihood of plagiarism and AI output were assessed using the iThenticate and zeroGPT programs, respectively. Eight reviewers in the spinal field evaluated 30 randomly extracted abstracts to determine whether they were produced by AI or human authors.

Results:

The proportion of abstracts that met journal formatting guidelines was greater among ChatGPT abstracts (56.6%) compared with those generated by Bard (11.1%) (P<.001). However, a higher proportion of Bard abstracts (90.7%) had word counts that met journal guidelines compared with ChatGPT abstracts (50%) (P<.001). There were no significant differences between the AI-generated groups of abstracts in terms of consistency of conclusions (P=.851). The cohort sample size in the human group was significantly correlated with that of the ChatGPT group (r = 0.955, P<.001) and Bard group (r = 0.998, P<.001). The plagiarism rate was significantly lower among ChatGPT-generated abstracts (20.7%) compared with Bard-generated abstracts (32.1%) (P<.001). The AI-detection program predicted that 21.7% of the human group, 63.3% of the ChatGPT group, and 87.0% of Bard group were possibly generated by AI, with an area under the curve value of 0.863 (P<.001). A sensitivity of 56.3% and a specificity of 48.4% were shown in assessing human-written abstracts by human reviewers.

Conclusions:

Both ChatGPT and Bard can be used to help write abstracts, but most AI-generated abstracts are currently considered unethical due to high plagiarism and AI-detection rates. ChatGPT-generated abstracts appear to be superior to Bard-generated abstracts in meeting journal formatting guidelines. Because humans were unable to accurately distinguish abstracts written by humans from those produced by AI programs, it is crucial to special caution and examine the ethical boundaries of employing the AI programs including ChatGPT and Bard.


 Citation

Please cite as:

Kim HJ, Yang JH, Chang DG, Lenke LG, Pizones J, Castelein R, Watanabe K, Trobisch PD, Mundis GM Jr, Suh SW, Suk SI

Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-Written Abstracts in the Field of Spine Surgery: Comparative Analysis

J Med Internet Res 2024;26:e52001

DOI: 10.2196/52001

PMID: 38924787

PMCID: 11237793

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.