Accepted for/Published in: JMIR Medical Education
Date Submitted: Aug 22, 2023
Date Accepted: Sep 24, 2024
Evaluation of a Computer-based Morphological Analysis Method for Free-text Responses in the General Medicine In-training Examination: A Pilot Study
ABSTRACT
Background:
The General Medicine In-training Examination (GM-ITE) tests clinical knowledge in a two-year postgraduate residency program in Japan. As a domain of medical safety, the GM-ITE in academic year 2021 included questions regarding the diagnosis from medical history and physical findings through video viewing, and the skills in presenting a case. Examinees watched a video/audio recording of a patient examination and provided free-text responses. The human cost of scoring free-text answers may limit the implementation of GM-ITE. A simple morphological analysis and word-matching model can be used to score free-text responses.
Objective:
We compared human versus computer scoring of free-text responses and qualitatively evaluated the discrepancies between human- and machine-generated scores to evaluate the efficacy of machine scoring.
Methods:
After obtaining consent for participation in the study, the authors used text data from residents, who voluntarily answered the GM-ITE patient reproduction video-based questions involving simulated patients. The GM-ITE used video-based questions to simulate a patient’s consultation in the emergency room with a diagnosis of pulmonary embolism following a fracture. Residents provided the statements for the case presentation. We obtained human-generated scores by collating the results of two independent scorers and machine-generated scores by converting the free-text responses into a word sequence through segmentation and morphological analysis and matching them with a prepared list of correct answers in 2022.
Results:
Of the 104 responses collected—63 for postgraduate year 1 (PGY-1) and 41 for postgraduate year (PGY-2), 39 cases remained for final analysis after excluding invalid responses. The authors found discrepancies between human and machine scoring in 7.2% of the cases; some were due to shortcomings in machine scoring that could be resolved by maintaining a list of correct words and dictionaries, and others were due to human error.
Conclusions:
Machine scoring is comparable to human scoring. Machine scoring requires a simple program and calibration, but can potentially reduce the cost of scoring free-text responses.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.