Accepted for/Published in: JMIR AI
Date Submitted: Nov 11, 2023
Date Accepted: Jun 6, 2024
(closed for review but you can still tweet)
Comparing the Efficacy and Efficiency of Human and GenAI Qualitative Thematic Analyses
ABSTRACT
Background:
Qualitative methods are incredibly beneficial to the dissemination and implementation of new digital health interventions; however, these methods can be time-intensive and slow down dissemination when timely knowledge from the data sources is needed in everchanging health systems. Recent advancements in generative artificial intelligence (GenAI) and their underlying large language models (LLM) may provide a promising opportunity to expedite the qualitative analysis of textual data, but their validity and reliability remain unknown.
Objective:
The primary objectives of our study were to evaluate the consistency in themes, reliability of coding, and time needed for inductive and deductive thematic analyses between GenAI (i.e., ChatGPT, Bard) and human coders.
Methods:
The qualitative data for the present study consisted of 40 brief text message reminder prompts used in a digital health intervention for promoting antiretroviral medication adherence among people with HIV who use methamphetamine. Inductive and deductive thematic analyses of these text messages were conducted by two independent teams of human coders. An independent human analyst conducted inductive and deductive analyses using both ChatGPT and Bard. The consistency in themes (or extent to which themes were the same) and reliability (or agreement in coding of themes) between methods were compared.
Results:
The themes generated by GenAI were consistent with 71.4% of the themes identified by human analysts following for inductive thematic analysis (ChatGPT = 71.4%; Bard = 71.4%). The consistency was lower between human and GenAI-generated themes following a deductive thematic analysis procedure (ChatGPT = 50.0%; Bard = 58.3%). The percent agreement (or intercoder reliability) for these congruent themes between human coders and GenAI ranged from fair to moderate (ChatGPT - inductive = 47.0%, ChatGPT - deductive = 37.3%; Bard - inductive = 37.0%, Bard - deductive = 36.2%). In general, ChatGPT and Bard performed similarly to each other across both types of qualitative analyses in terms of consistency of themes (Inductive = 100%, Deductive = 83.3%) and reliability of coding (Inductive = 37.1%, Deductive = 46.8%). On average, GenAI required significantly less overall time than human coders conducting qualitative analysis (20 vs. 567 mins).
Conclusions:
The promising consistency in themes generated by human coders and GenAI suggests that these technologies hold promise in reducing the resource intensiveness of qualitative thematic analysis; however, the relatively lower reliability in coding between them suggests that hybrid approaches are necessary. Human coders appeared better than GenAI at identifying subtle and nuanced themes. Moreover, there remain outstanding ethical challenges to applying such technologies to human subjects research including confidentiality. Future studies should consider how these powerful technologies can be best used in collaboration with human coders to improve the efficiency of research in hybrid approaches, while also remaining alert to potential harms they may pose.
Citation
Request queued. Please wait while the file is being generated. It may take some time.