Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: May 31, 2024
Date Accepted: Jan 16, 2025

The final, peer-reviewed published version of this preprint can be found here:

Detecting Artificial Intelligence–Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study

Doru B, Maier C, Busse JS, Lücke T, Schönhoff J, Enax- Krumova E, Hessler S, Berger M, Tokic M

Detecting Artificial Intelligence–Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study

JMIR Med Educ 2025;11:e62779

DOI: 10.2196/62779

PMID: 40053752

PMCID: 11914838

Detecting AI- generated versus human- written medical student essays: a semi-randomized controlled study

  • Berin Doru; 
  • Christoph Maier; 
  • Johanna Sophie Busse; 
  • Thomas Lücke; 
  • Judith Schönhoff; 
  • Elena Enax- Krumova; 
  • Steffen Hessler; 
  • Maria Berger; 
  • Marianne Tokic

ABSTRACT

Background:

Large Language Models (LLMs), exemplified by ChatGPT, have reached a level of sophistication that makes distinguishing between human and AI-generated texts increasingly challenging. This has caused concern in academia and especially medicine, where the accuracy and authenticity of written work are of paramount importance.

Objective:

The aim of this semi-randomized, experimental study was to investigate the ability of two blinded expert groups with different levels of content familiarity- medical professionals and humanities scholars with expertise in textual analysis- to differentiate between longer scientific texts in German written by medical students and those generated by ChatGPT. The study further sought to analyze the reasoning behind their identification choices, especially the role of content familiarity and linguistic features.

Methods:

Between May and August 2023, a total of 35 experts (medical: n=22; humanities: n=13) were each presented with two pairs of texts on two different medical topics. Each pair contained similar content and structure: one text was written by a medical student and the other was generated by ChatGPT (version 3.5, March 2023). Experts were asked to identify the AI-generated text and provide justification for their choice. These justifications were analyzed in a multi-stage, interdisciplinary qualitative analysis in which textual features were identified. Before unblinding, experts had to rate seven characteristics of each text: linguistic fluency, scientific quality, logical coherence, expression of knowledge limitations, formulation of future research questions, and spelling/grammatical accuracy. Univariate tests and multivariate logistic regression analyses were used to examine associations between participants' characteristics, stated reasons for their author decisions, and the likelihood of correctly identifying the authorship of a text.

Results:

Overall, 70% of participants accurately identified the AI-generated texts, with minimal difference between groups (medical: 72%; humanities: 65%; OR 1.37; 95% CI 0.5-3.9). While content errors had minimal impact on identification accuracy, stylistic features, particularly redundancy (OR 6.90; 95% CI 1.01-47.1), repetition (OR 8.05; 95% CI 1.25-51.7), and thread/coherence (OR 6.62; 95% CI 1.25- 35.2), were pivotal in participants’ decisions to identify a text as AI-generated.

Conclusions:

The findings suggest that both medical and humanities experts could identify ChatGPT-generated texts in medical contexts with decisions largely based on linguistic attributes. The frequency of correct identification appears to be independent of experts’ familiarity with the text content. As the decision is mainly based on linguistic attributes, i.e., stylistic and text coherence-related features, further quasi-experimental studies with texts from other academic disciplines should be conducted to determine whether instructions based on these features can further increase the ability of lecturers to distinguish between student-authored and AI-generated work.


 Citation

Please cite as:

Doru B, Maier C, Busse JS, Lücke T, Schönhoff J, Enax- Krumova E, Hessler S, Berger M, Tokic M

Detecting Artificial Intelligence–Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study

JMIR Med Educ 2025;11:e62779

DOI: 10.2196/62779

PMID: 40053752

PMCID: 11914838

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.