Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Mar 4, 2025
Date Accepted: Jul 28, 2025

The final, peer-reviewed published version of this preprint can be found here:

Development and Validation of a Large Language Model–Based System for Medical History-Taking Training: Prospective Multicase Study on Evaluation Stability, Human-AI Consistency, and Transparency

Liu Y, Shi C, Wu L, Lin X, Chen X, Zhu Y, Tan H, Zhang W

Development and Validation of a Large Language Model–Based System for Medical History-Taking Training: Prospective Multicase Study on Evaluation Stability, Human-AI Consistency, and Transparency

JMIR Med Educ 2025;11:e73419

DOI: 10.2196/73419

PMID: 40882613

PMCID: 12396829

Development and Validation of a Large Language Model-Based System for Medical History-Taking Training: A Prospective Multi-Case Study on Evaluation Stability, Human-AI Consistency, and Transparency

  • Yang Liu; 
  • Chujun Shi; 
  • Liping Wu; 
  • Xiule Lin; 
  • Xiaoqin Chen; 
  • Yiying Zhu; 
  • Haizhu Tan; 
  • Weishan Zhang

ABSTRACT

Background:

History-taking is crucial in medical training; however, current methods often lack consistent feedback and standardized evaluation, and have limited access to standardized patient (SP) resources. Artificial Intelligence (AI)-powered simulated patients offer a promising solution; however, challenges such as human-AI consistency, evaluation stability and transparency remain underexplored in multi-case clinical scenarios.

Objective:

This study aimed to develop and validate the AI-powered Medical History-Taking Training and Evaluation System (AMTES), based on DeepSeek-V2.5, to assess its stability, human-AI consistency, and transparency in clinical scenarios with varying symptoms and difficulty levels.

Methods:

We developed AMTES, which was built on a Browser/Server (B/S) architecture and employed multiple strategies including specified evaluation output formats, cross-checking of original dialogue text, keyword verification, and split-parallel evaluation to enhance dialogue quality and automated assessment. A prospective study with 31 medical students evaluated AMTES's performance across three cases of varying complexity, alongside a post-training questionnaire for user feedback.

Results:

A total of 31 students practiced with our AMTES. 1) AMTES achieved high dialogue accuracy: 98.6% (SD 1.5%) for cough, 99% (SD 1.1%) for frequent urination, and 97.9% (SD 2.2%) for abdominal pain. 2) It provided transparent, structured, and stable evaluations, with low coefficients of variation (CV) in total scores (<1.2%), matched-item counts (<0.75%), and key history categories (<1%). 3) Human-AI consistency was strong, with high intraclass correlation coefficients (ICC) for total scores: 0.978 (95% CI 0.955-0.989) in the cough case, 0.923 (95% CI 0.849-0.962) in the frequent urination case, and 0.972 (95% CI 0.943-0.986) in the abdominal pain case. Item-level discrepancies were minimal across all three cases, with an average of 1.89 items out of 66 (2.87%) for cough, 2.06 out of 59 (3.49%) for frequent urination, and 2.85 out of 67 (4.25%) for abdominal pain. 4) A large proportion of students found the AMTES helpful, with 15 (48%) agreeing and 12 (39%) strongly agreeing. Furthermore, 11 (35%) students agreed and 15 (48%) strongly agreed that they would like to use the AMTES in the future.

Conclusions:

Our data showed that AMTES can enhance automated scoring reliability and transparency, demonstrate high stability and human-AI consistency across diverse clinical scenarios. Its strong user approval highlights its potential as a valuable tool for medical history-taking training.


 Citation

Please cite as:

Liu Y, Shi C, Wu L, Lin X, Chen X, Zhu Y, Tan H, Zhang W

Development and Validation of a Large Language Model–Based System for Medical History-Taking Training: Prospective Multicase Study on Evaluation Stability, Human-AI Consistency, and Transparency

JMIR Med Educ 2025;11:e73419

DOI: 10.2196/73419

PMID: 40882613

PMCID: 12396829

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.