Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jan 17, 2024
Date Accepted: Jul 9, 2024

The final, peer-reviewed published version of this preprint can be found here:

Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study

Li KD, Fernandez AM, Schwartz R, Rios N, Carlisle MN, Amend GM, Patel HV, Breyer BN

Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study

J Med Internet Res 2024;26:e56500

DOI: 10.2196/56500

PMID: 39167785

PMCID: 11375389

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Man Versus Machine: Harnessing Artificial Intelligence for Qualitative Analysis

  • Kevin Danis Li; 
  • Adrian M Fernandez; 
  • Rachel Schwartz; 
  • Natalie Rios; 
  • Marvin Nathaniel Carlisle; 
  • Gregory M Amend; 
  • Hiren V Patel; 
  • Benjamin N Breyer

ABSTRACT

Background:

Large language models like GPT-4 have opened new avenues in healthcare and qualitative research. Traditional qualitative methods are time-consuming and require expertise to capture nuance. Although large language models have demonstrated enhanced contextual understanding and inferencing compared to traditional natural language processing, their performance in qualitative analysis versus that of humans remains unexplored.

Objective:

We evaluated the effectiveness of GPT-4 versus human researchers in qualitative analysis of interviews from patients with adult-acquired buried penis (AABP).

Methods:

Qualitative data were obtained from semi-structured interviews with 20 AABP patients. Human analysis involved a structured thematic process in three stages: initial observations, line-by-line coding, and consensus discussions to refine themes. In contrast, artificial intelligence (AI) analysis with GPT-4 underwent two phases: a naïve phase where GPT-4 outputs were independently evaluated by a blinded reviewer to identify themes/subthemes, and a comparison phase where AI-generated themes were compared with human-identified themes to assess agreement.

Results:

The study population (n=20) comprised predominantly white (85%), married (60%), heterosexual (95%) men, with a mean age of 58.8 years and BMI of 41.1 kg/m2. Human thematic analysis identified "urinary issues" in 95% and GPT-4 in 75% of interviews, with the subtheme "spray/stream" noted in 60% and 35%, respectively. "Sexual issues" were prominent (95% humans vs. 80% GPT-4), though humans identified a wider range of subthemes, including "pain with sex or masturbation" (35%) and "difficulty with sex or masturbation" (20%). Both analyses similarly highlighted "mental health issues" (55% humans vs. 44% GPT-4), although humans coded "depression" more frequently (50% humans vs. 20% GPT-4). Humans frequently cited "issues using public restrooms" (60%) as impacting social life, whereas GPT-4 emphasized "struggles with romantic relationships" (45%). "Hygiene issues" were consistently recognized (70% humans vs. 65% GPT-4). Humans uniquely identified "contributing factors" as a theme in all interviews. There was moderate agreement between human and GPT-4 coding (Cohen's Kappa = 0.401). Reliability assessments of GPT-4’s analyses showed consistent coding for themes like "Body image struggles" and "Chronic pain" (100%), and "Depression" (90%). Other themes like "Motivation for surgery" and "Weight challenges" were reliably coded (80%), while less frequent themes were variably identified across multiple iterations.

Conclusions:

Large language models like GPT-4 can effectively identify key themes in analyzing qualitative healthcare data, showing moderate agreement with human analysis. While human analysis provided a richer diversity of subthemes, the consistency of AI suggests its utility as a complementary tool in qualitative research. With AI rapidly advancing, future studies should iterate analyses and circumvent token limitations by segmenting data, furthering the breadth and depth of large language model-driven qualitative analyses.


 Citation

Please cite as:

Li KD, Fernandez AM, Schwartz R, Rios N, Carlisle MN, Amend GM, Patel HV, Breyer BN

Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study

J Med Internet Res 2024;26:e56500

DOI: 10.2196/56500

PMID: 39167785

PMCID: 11375389

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.