Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jun 10, 2025
Date Accepted: Dec 29, 2025

The final, peer-reviewed published version of this preprint can be found here:

Evaluating GPT-4 Responses on Scars or Keloids for Patient Education: Large Language Model Evaluation Study

Rao M

Evaluating GPT-4 Responses on Scars or Keloids for Patient Education: Large Language Model Evaluation Study

JMIR Med Inform 2026;14:e78838

DOI: 10.2196/78838

PMID: 41773665

PMCID: 12954683

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Evaluating ChatGPT Responses on Scar or Keloid for Patient Education

  • Mingjun Rao

ABSTRACT

Background:

Scars and keloids impose significant physical and psychological burdens on patients, often leading to functional limitations, cosmetic concerns, and mental health issues like anxiety or depression. Patients increasingly turn to online platforms for information, yet existing web-based resources on scars/keloids are frequently unreliable, fragmented, or difficult to understand. Large language models (LLMs) such as ChatGPT-4 show promise in delivering medical information, but their accuracy, readability, and potential for generating hallucinated content require validation for patient education applications.

Objective:

To systematically evaluate ChatGPT-4’s performance in providing patient education on scars and keloids, focusing on its accuracy, reliability, readability, and reference quality.

Methods:

This study involved collecting 354 questions from Reddit communities (r/Keloids, r/SCAR, r/PlasticSurgery), covering topics including treatment options, preoperative/postoperative care, and psychological impacts. Each question was input into ChatGPT-4 in independent sessions to mimic real-world patient interactions. Responses were evaluated using multiple tools: the PEMAT-AI for understandability and actionability, the DISCERN-AI for treatment information quality, the Global Quality Scale (GQS) for overall information excellence, and standard readability metrics (Flesch Reading Ease, Gunning Fog Index, etc.). Three plastic surgeons used the NLAT-AI tool to rate accuracy, safety, and clinical appropriateness, while REF-AI validated references for reference hallucination, relevance, and source quality.

Results:

ChatGPT-4 demonstrated high accuracy and reliability: PEMAT-AI showed 75.5% understandability, DISCERN-AI rated responses as "Good" (26.3/35), and GQS scored 4.28/5. Surgeons’ evaluations averaged 3.94–4.43/5 across dimensions, with strong internal consistency (Cronbach’s alpha = 0.81). Readability analyses indicated moderate complexity (Flesch: 50.13, Gunning Fog: 12.68), corresponding to a 12th-grade reading level. REF-AI identified 11.8% hallucinated references (383/3250), while 88.2% of references were real, with 95.1% from authoritative sources (e.g., government guidelines, literature).

Conclusions:

ChatGPT-4 exhibits substantial potential as a patient education tool for scars and keloids, offering reliable and accurate information. However, improvements in readability (to align with 6th–8th-grade standards) and reduction of reference hallucinations are essential to enhance accessibility and trustworthiness. Future LLM optimizations should prioritize simplifying medical language and strengthening reference validation mechanisms to maximize clinical utility. Clinical Trial: not applicable


 Citation

Please cite as:

Rao M

Evaluating GPT-4 Responses on Scars or Keloids for Patient Education: Large Language Model Evaluation Study

JMIR Med Inform 2026;14:e78838

DOI: 10.2196/78838

PMID: 41773665

PMCID: 12954683

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.