Accepted for/Published in: JMIR Medical Education
Date Submitted: Mar 4, 2025
Date Accepted: Jul 28, 2025
Development and Validation of a Large Language Model-Based System for Medical History-Taking Training: A Prospective Multi-Case Study on Evaluation Stability, Human-AI Consistency, and Transparency
ABSTRACT
Background:
History-taking is crucial in medical training; however, current methods often lack consistent feedback and standardized evaluation, and have limited access to standardized patient (SP) resources. Artificial Intelligence (AI)-powered simulated patients offer a promising solution; however, challenges such as human-AI consistency, evaluation stability and transparency remain underexplored in multi-case clinical scenarios.
Objective:
This study aimed to develop and validate the AI-powered Medical History-Taking Training and Evaluation System (AMTES), based on DeepSeek-V2.5, to assess its stability, human-AI consistency, and transparency in clinical scenarios with varying symptoms and difficulty levels.
Methods:
We developed AMTES, which was built on a Browser/Server (B/S) architecture and employed multiple strategies including specified evaluation output formats, cross-checking of original dialogue text, keyword verification, and split-parallel evaluation to enhance dialogue quality and automated assessment. A prospective study with 31 medical students evaluated AMTES's performance across three cases of varying complexity, alongside a post-training questionnaire for user feedback.
Results:
A total of 31 students practiced with our AMTES. 1) AMTES achieved high dialogue accuracy: 98.6% (SD 1.5%) for cough, 99% (SD 1.1%) for frequent urination, and 97.9% (SD 2.2%) for abdominal pain. 2) It provided transparent, structured, and stable evaluations, with low coefficients of variation (CV) in total scores (<1.2%), matched-item counts (<0.75%), and key history categories (<1%). 3) Human-AI consistency was strong, with high intraclass correlation coefficients (ICC) for total scores: 0.978 (95% CI 0.955-0.989) in the cough case, 0.923 (95% CI 0.849-0.962) in the frequent urination case, and 0.972 (95% CI 0.943-0.986) in the abdominal pain case. Item-level discrepancies were minimal across all three cases, with an average of 1.89 items out of 66 (2.87%) for cough, 2.06 out of 59 (3.49%) for frequent urination, and 2.85 out of 67 (4.25%) for abdominal pain. 4) A large proportion of students found the AMTES helpful, with 15 (48%) agreeing and 12 (39%) strongly agreeing. Furthermore, 11 (35%) students agreed and 15 (48%) strongly agreed that they would like to use the AMTES in the future.
Conclusions:
Our data showed that AMTES can enhance automated scoring reliability and transparency, demonstrate high stability and human-AI consistency across diverse clinical scenarios. Its strong user approval highlights its potential as a valuable tool for medical history-taking training.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.