JMIR Preprints #58329: Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

Junhyuk Seo;
Dasol Choi;
Taerim Kim;
Won Chul Cha;
Minha Kim;
Haanju Yoo;
Namkee Oh;
Yongjin Yi;
Kye Hwa Lee;
Edward Choi

ABSTRACT

Background:

The advancement of Large Language Models (LLMs) offers significant opportunities for healthcare, particularly in the generation of medical documentation. However, challenges related to ensuring the accuracy and reliability of LLM outputs, coupled with the absence of established quality standards, have raised concerns about their clinical application.

Objective:

Our objective is to introduce and validate an evaluation framework for assessing the accuracy and clinical applicability of LLM-generated Emergency Department (ED) records, aiming to enhance AI integration in healthcare documentation.

Methods:

We organized the Healthcare Prompt-a-thon, a competitive event designed to explore the capabilities of LLMs in generating accurate medical records. The event involved 52 participants who generated 33 initial ED records using HyperCLOVA X, a Korean-specialized LLM. We applied a dual evaluation approach: (1) Clinical Evaluation: Four medical professionals evaluated the records using a five-point Likert scale across five criteria—Appropriateness, Accuracy, Structure/Format, Conciseness, and Clinical Validity. (2) Quantitative Evaluation: We developed a framework to categorize and count errors in the LLM outputs, identifying seven key error types. Statistical methods, including Pearson correlation and Intraclass Correlation Coefficients (ICC), were used to assess consistency and agreement among evaluators.

Results:

The clinical evaluation demonstrated strong inter-rater reliability, with ICC values ranging from 0.653 to 0.887 (P < .001), and a test-retest reliability Pearson correlation coefficient of 0.776 (P < .001). Quantitative analysis revealed that Invalid Generation Errors were the most common, constituting 35.38% of total errors, while Structural Malformation Errors had the most significant negative impact on clinical validity (Pearson-r = -0.654, P < .001). A strong negative correlation was found between the number of quantitative errors and clinical evaluation scores (Pearson-r = -0.633, P < .001), indicating that higher error rates corresponded to lower clinical validity.

Conclusions:

Our research provides robust support for the reliability and clinical acceptability of the proposed evaluation framework. It underscores the framework's potential to mitigate clinical burdens and foster the responsible integration of AI technologies in healthcare, suggesting a promising direction for future research and practical applications in the field.

Citation

Please cite as:

Seo J, Choi D, Kim T, Cha WC, Kim M, Yoo H, Oh N, Yi Y, Lee KH, Choi E

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

J Med Internet Res 2024;26:e58329

DOI: 10.2196/58329

PMID: 39566044

PMCID: 11618017

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Mar 13, 2024

Date Accepted: Sep 24, 2024

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

ABSTRACT

Citation

Copyright