Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Mar 13, 2024
Date Accepted: Sep 24, 2024

The final, peer-reviewed published version of this preprint can be found here:

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

Seo J, Choi D, Kim T, Cha WC, Kim M, Yoo H, Oh N, Yi Y, Lee KH, Choi E

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

J Med Internet Res 2024;26:e58329

DOI: 10.2196/58329

PMID: 39566044

PMCID: 11618017

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

  • Junhyuk Seo; 
  • Dasol Choi; 
  • Taerim Kim; 
  • Won Chul Cha; 
  • Minha Kim; 
  • Haanju Yoo; 
  • Namkee Oh; 
  • Yongjin Yi; 
  • Kye Hwa Lee; 
  • Edward Choi

ABSTRACT

Background:

The advancement of Large Language Models (LLMs) offers significant opportunities for healthcare, particularly in the generation of medical documentation. However, challenges related to ensuring the accuracy and reliability of LLM outputs, coupled with the absence of established quality standards, have raised concerns about their clinical application.

Objective:

Our objective is to introduce and validate an evaluation framework for assessing the accuracy and clinical applicability of LLM-generated Emergency Department (ED) records, aiming to enhance AI integration in healthcare documentation.

Methods:

We organized the Healthcare Prompt-a-thon, a competitive event designed to explore the capabilities of LLMs in generating accurate medical records. The event involved 52 participants who generated 33 initial ED records using HyperCLOVA X, a Korean-specialized LLM. We applied a dual evaluation approach: (1) Clinical Evaluation: Four medical professionals evaluated the records using a five-point Likert scale across five criteria—Appropriateness, Accuracy, Structure/Format, Conciseness, and Clinical Validity. (2) Quantitative Evaluation: We developed a framework to categorize and count errors in the LLM outputs, identifying seven key error types. Statistical methods, including Pearson correlation and Intraclass Correlation Coefficients (ICC), were used to assess consistency and agreement among evaluators.

Results:

The clinical evaluation demonstrated strong inter-rater reliability, with ICC values ranging from 0.653 to 0.887 (P < .001), and a test-retest reliability Pearson correlation coefficient of 0.776 (P < .001). Quantitative analysis revealed that Invalid Generation Errors were the most common, constituting 35.38% of total errors, while Structural Malformation Errors had the most significant negative impact on clinical validity (Pearson-r = -0.654, P < .001). A strong negative correlation was found between the number of quantitative errors and clinical evaluation scores (Pearson-r = -0.633, P < .001), indicating that higher error rates corresponded to lower clinical validity.

Conclusions:

Our research provides robust support for the reliability and clinical acceptability of the proposed evaluation framework. It underscores the framework's potential to mitigate clinical burdens and foster the responsible integration of AI technologies in healthcare, suggesting a promising direction for future research and practical applications in the field.


 Citation

Please cite as:

Seo J, Choi D, Kim T, Cha WC, Kim M, Yoo H, Oh N, Yi Y, Lee KH, Choi E

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

J Med Internet Res 2024;26:e58329

DOI: 10.2196/58329

PMID: 39566044

PMCID: 11618017

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.