Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Formative Research

Date Submitted: Apr 15, 2026
Open Peer Review Period: Apr 28, 2026 - Jun 23, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Feasibility of using ChatGPT to generate exposure hierarchies for body dysmorphic disorder: Proof-of-concept study

  • Geneva K. Jonathan; 
  • Adam C. Jaroszewski; 
  • Heyli T. Arcese; 
  • Natasha H. Bailen; 
  • Ivar Snorrason; 
  • Anne Chosak; 
  • Jennifer Ragan; 
  • Susan Sprich; 
  • Jessica Rassmussen; 
  • Emily E. Bernstein; 
  • Sabine Wilhelm

ABSTRACT

Background:

Body dysmorphic disorder (BDD) is a chronic, impairing condition for which cognitive behavior therapy including exposure and response prevention (ERP) is the first-line treatment, yet few clinicians are trained in BDD-specific ERP. Constructing personalized exposure hierarchies is time-intensive and requires clinical expertise, particularly for patients with poor insight who may hold appearance beliefs with delusional conviction. Large language models (LLMs) have shown preliminary utility in generating clinical content, but their capacity to produce safe, appropriate ERP hierarchies for BDD has not been examined.

Objective:

This study aimed to evaluate the feasibility of using ChatGPT (GPT-5) to generate ERP exposure hierarchies for BDD, assess the influence of clinical and demographic characteristics on output quality, and benchmark AI-generated hierarchies against those created by doctoral-level BDD specialists.

Methods:

ChatGPT generated 10-item graded exposure hierarchies for 72 simulated patient vignettes systematically varied by body area of concern (hair, skin, nose), insight level (good, limited, absent), symptom specificity (low, high), patient age (15, 40 years), and patient gender (woman, man). Expert clinicians independently generated hierarchies for a subset of 18 vignettes. Two researchers rated all ChatGPT hierarchies for task completion and input information integration. Three BDD expert clinicians, blinded to the study aims and generation source, rated all 90 hierarchies on relevance, specificity, variability, safety, and overall quality. After unblinding, raters attempted to identify which hierarchy in each of 18 matched pairs was AI-generated.

Results:

ChatGPT generated complete hierarchies for 69 of 72 vignettes (95.8%) and integrated most input information (mean 4.66/5, SD 0.42). Blinded experts rated ChatGPT hierarchies as highly relevant (mean 4.81, SD 0.23), specific (mean 4.70, SD 0.28), variable (mean 4.39, SD 0.26), safe (mean 4.90, SD 0.16), and of high overall quality (mean 4.66, SD 0.26), with no significant differences from expert-generated hierarchies on any dimension except variability (P=.02, r=0.24). After unblinding, raters correctly identified AI-generated hierarchies only 31.5% of the time, significantly below chance (P=.009). However, insight level was the most consistent predictor of hierarchy quality: absent-insight hierarchies received significantly lower safety (P<.001, r=0.73) and overall quality ratings (P=.01, r=0.43). This pattern was not unique to ChatGPT; safety ratings for expert-generated hierarchies also declined significantly by insight level (P=.031), with a comparable magnitude of decline. Qualitative comments revealed that difficulty miscalibration was the most common rater concern (30/43 comments, 69.8%), disproportionately directed at ChatGPT hierarchies.

Conclusions:

LLMs show considerable potential for generating clinician-facing ERP hierarchies for BDD and produced output that blinded specialists rated as comparable to expert work on most dimensions of quality. However, hierarchies for patients with poor insight received lower safety and quality ratings from both AI and human sources, with qualitative comments suggesting that AI-generated hierarchies were less well calibrated to the clinical demands of delusional-level presentations. These findings underscore the need for clinician oversight when treating patients with delusional conviction, which is common in BDD.


 Citation

Please cite as:

Jonathan GK, Jaroszewski AC, Arcese HT, Bailen NH, Snorrason I, Chosak A, Ragan J, Sprich S, Rassmussen J, Bernstein EE, Wilhelm S

Feasibility of using ChatGPT to generate exposure hierarchies for body dysmorphic disorder: Proof-of-concept study

JMIR Preprints. 15/04/2026:98393

DOI: 10.2196/preprints.98393

URL: https://preprints.jmir.org/preprint/98393

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.