Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jul 29, 2025
Date Accepted: Dec 19, 2025

The final, peer-reviewed published version of this preprint can be found here:

Evaluating and Validating Large Language Models for Health Education on Developmental Dysplasia of the Hip: 2-Phase Study With Expert Ratings and a Pilot Randomized Controlled Trial

Ouyang H, Lin G, Li Y, Yao Z, Li Y, Yan H, Qin F, Yao J, Chen Y

Evaluating and Validating Large Language Models for Health Education on Developmental Dysplasia of the Hip: 2-Phase Study With Expert Ratings and a Pilot Randomized Controlled Trial

J Med Internet Res 2026;28:e73326

DOI: 10.2196/73326

PMID: 41554120

PMCID: 12865344

Evaluating and validating large language models for health education on developmental dysplasia of the hip: Two-phase study with expert ratings and a pilot randomized controlled trial

  • Hui Ouyang; 
  • Gan Lin; 
  • Yiyuan Li; 
  • Zhixing Yao; 
  • Yating Li; 
  • Han Yan; 
  • Fang Qin; 
  • Jinghui Yao; 
  • Yun Chen

ABSTRACT

Background:

Developmental dysplasia of the hip (DDH) is a common pediatric orthopedic disease, and health education is vital to disease management and rehabilitation. The emergence of large language models (LLMs) has provided new opportunities for health education. However, the effectiveness and applicability of LLMs in education with DDH have not been systematically evaluated.

Objective:

This study conducted an integrated two-phase evaluation to assess the quality and educational effectiveness of LLM-generated educational materials.

Methods:

This study comprised two phases. Based on Bloom's taxonomy, a 16-item DDH question bank was created through literature analysis and expert collaboration. Four LLMs (ChatGPT-4, DeepSeek-V3, Gemini 2.0 Flash, and Copilot) were questioned using standardized prompts. All responses were independently evaluated by five pediatric orthopedic experts using 5-point Likert measures of accuracy, fluency, and richness, the scales of PEMAT-P and DISCERN. The readability was measured by formula. The data were examined using Kruskal-Wallis tests, ANOVA, and post-hoc comparisons. In Phase 2, an assessor-blinded, two-arm pilot randomized controlled trial was conducted. A total of 127 caregivers were randomized into an LLM-assisted education group or a web search control group. The intervention included structured LLM training, supervised practice, and two weeks of reinforcement training. Measured at baseline, post-intervention, and two-week following, the outcomes were eHealth literacy (primary), DDH knowledge, health risk perception, perceived usefulness, information self-efficacy, and health information-seeking behavior. Cohen's d effect sizes and linear mixed-effects models were used in an intention-to-treat manner.

Results:

There were significant differences between the four LLMs in terms of accuracy, richness, fluency, PEMAT-P Understandability, and DISCERN (P < 0.05). ChatGPT-4 (63.67 [IQR 63.67–64.67]) and DeepSeek-V3 (63.67 [63.33–64.67]) generates more accurate text than Copilot (59.00 [58.67–59.67]). DeepSeek-V3 (64.00 [64.00–64.00]) was language richer than Copilot (52.33 [51.33–52.67]). Gemini 2.0 Flash (72.67 [72.33–73.00]) was more fluent than Copilot (65.67 [63.33–65.67]). In Study 2, the intervention group showed higher eHealth literacy at T1 (33.62 [95% CI, 32.76–34.49], d = 0.20 [95% CI, 0.13–0.56]) and T2 (33.27 [95% CI, 32.38–34.17], d = 0.36 [95% CI, 0.01–0.80]), greater DDH knowledge at T1 (7.87 [95% CI, 7.48–8.25], d = 0.71 [95% CI, 0.33–1.11]) and T2 (7.12 [95% CI, 6.72–7.51], d = 0.54 [95% CI, 0.17–0.96]) and slight improvements in health risk prediction and perceived usefulness. Other outcomes showed positive but nonsignificant trends.

Conclusions:

Mainstream LLMs demonstrate varying capacities in generating educational content for DDH. They effectively generated DDH caregiver education materials, improving eHealth literacy and knowledge. Although LLMs can address general informational needs, they cannot substitute completely clinical evaluation. Future research should focus on optimizing plain language, refining dialogue design, and enhancing audience personalization to improve the quality of materials generated by LLMs. Clinical Trial: Chinese Clinical Trial Registry ChiCTR2000038980; http://www.chictr.org.cn/showproj.aspx?proj=62659


 Citation

Please cite as:

Ouyang H, Lin G, Li Y, Yao Z, Li Y, Yan H, Qin F, Yao J, Chen Y

Evaluating and Validating Large Language Models for Health Education on Developmental Dysplasia of the Hip: 2-Phase Study With Expert Ratings and a Pilot Randomized Controlled Trial

J Med Internet Res 2026;28:e73326

DOI: 10.2196/73326

PMID: 41554120

PMCID: 12865344

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.