Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jun 10, 2025
Date Accepted: Dec 29, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Evaluating ChatGPT Responses on Scar or Keloid for Patient Education
ABSTRACT
Background:
Scars and keloids impose significant physical and psychological burdens on patients, often leading to functional limitations, cosmetic concerns, and mental health issues like anxiety or depression. Patients increasingly turn to online platforms for information, yet existing web-based resources on scars/keloids are frequently unreliable, fragmented, or difficult to understand. Large language models (LLMs) such as ChatGPT-4 show promise in delivering medical information, but their accuracy, readability, and potential for generating hallucinated content require validation for patient education applications.
Objective:
To systematically evaluate ChatGPT-4’s performance in providing patient education on scars and keloids, focusing on its accuracy, reliability, readability, and reference quality.
Methods:
This study involved collecting 354 questions from Reddit communities (r/Keloids, r/SCAR, r/PlasticSurgery), covering topics including treatment options, preoperative/postoperative care, and psychological impacts. Each question was input into ChatGPT-4 in independent sessions to mimic real-world patient interactions. Responses were evaluated using multiple tools: the PEMAT-AI for understandability and actionability, the DISCERN-AI for treatment information quality, the Global Quality Scale (GQS) for overall information excellence, and standard readability metrics (Flesch Reading Ease, Gunning Fog Index, etc.). Three plastic surgeons used the NLAT-AI tool to rate accuracy, safety, and clinical appropriateness, while REF-AI validated references for reference hallucination, relevance, and source quality.
Results:
ChatGPT-4 demonstrated high accuracy and reliability: PEMAT-AI showed 75.5% understandability, DISCERN-AI rated responses as "Good" (26.3/35), and GQS scored 4.28/5. Surgeons’ evaluations averaged 3.94–4.43/5 across dimensions, with strong internal consistency (Cronbach’s alpha = 0.81). Readability analyses indicated moderate complexity (Flesch: 50.13, Gunning Fog: 12.68), corresponding to a 12th-grade reading level. REF-AI identified 11.8% hallucinated references (383/3250), while 88.2% of references were real, with 95.1% from authoritative sources (e.g., government guidelines, literature).
Conclusions:
ChatGPT-4 exhibits substantial potential as a patient education tool for scars and keloids, offering reliable and accurate information. However, improvements in readability (to align with 6th–8th-grade standards) and reduction of reference hallucinations are essential to enhance accessibility and trustworthiness. Future LLM optimizations should prioritize simplifying medical language and strengthening reference validation mechanisms to maximize clinical utility. Clinical Trial: not applicable
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.