Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jul 16, 2025
Open Peer Review Period: Jul 17, 2025 - Sep 11, 2025
Date Accepted: Oct 15, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Quality Assessment of Large Language Model–Generated Medical Dialogue for Clinical Vignettes: Evaluation Study

Yanagita Y, Yokokawa D, Ihara S, Yoshida R, Okano Y, Uehara T

Quality Assessment of Large Language Model–Generated Medical Dialogue for Clinical Vignettes: Evaluation Study

JMIR Form Res 2025;9:e80752

DOI: 10.2196/80752

PMID: 41183323

PMCID: 12624296

Can GPT Generate Medical Dialogue for Clinical Vignettes: An Evaluation

  • Yasutaka Yanagita; 
  • Daiki Yokokawa; 
  • Shiichi Ihara; 
  • Ryo Yoshida; 
  • Yoshihide Okano; 
  • Takanori Uehara

ABSTRACT

Background:

Clinical vignettes often focus on prototypical presentations; require substantial time and effort to develop; and fail to represent patient diversity, the complexity of clinical conditions, patients’ perspectives, and the dynamic nature of physician–patient interactions.

Objective:

We evaluated the quality of physician–patient dialogues produced by generative AI in Japanese, focusing on their medical accuracy and overall appropriateness as medical interviews.

Methods:

We created an AI prompt that included a specific clinical history and instructed the model to simulate a cooperative patient responding to the physician’s questions to generate a physician–patient dialogue. The target diseases were those covered by the Japanese National Medical Licensing Examination. Each dialogue consisted of 25 turns by the physician and 25 by the patient, reflecting the typical volume of conversation in Japanese outpatient settings. Three internists independently evaluated each generated dialogue using a 7-point Likert scale across six criteria: coherence of the conversation, medical accuracy of the patient’s responses, medical accuracy of the physician’s responses, content of the medical history, communication skills, and professionalism. In addition, the composite score for each dialogue was calculated as the overall mean of these six criteria.

Results:

The mean scores (standard deviation) for the six criteria were as follows: coherence of the conversation: 5.9 (0.9); medical accuracy of the patient’s responses: 6.0 (0.9); medical accuracy of the physician’s responses: 5.6 (1.1); content of the medical history taking: 5.9 (0.9); communication skills: 5.6 (0.9); and professionalism: 5.5 (1.1). The composite score was 5.7 (1.0).

Conclusions:

While physician oversight remains essential, it is feasible to efficiently create AI-generated educational materials for medical education that overcome the limitations of traditional clinical vignettes. This approach may reduce time and financial burdens, enhancing opportunities to practice clinical interviewing in settings that closely mirror real-world encounters.


 Citation

Please cite as:

Yanagita Y, Yokokawa D, Ihara S, Yoshida R, Okano Y, Uehara T

Quality Assessment of Large Language Model–Generated Medical Dialogue for Clinical Vignettes: Evaluation Study

JMIR Form Res 2025;9:e80752

DOI: 10.2196/80752

PMID: 41183323

PMCID: 12624296

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.