Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Mar 16, 2024
Date Accepted: Oct 21, 2024
The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: A Study in Data Science and Health Informatics
ABSTRACT
Background:
Clinical Natural Language Processing (cNLP), a subfield dedicated to the analysis of clinical texts within artificial intelligence, has experienced a significant development over the last decades. Recent advancements in computing power and algorithms have enabled its expanded application in oncological research.
Objective:
To explore the potential of Large Language Models (LLMs) to extract and structure information from free-text clinical reports, with a specific focus on identifying and classifying patient comorbidities in the electronic health records of oncology. We specifically evaluate the gpt-3.5-turbo- 1106 and gpt-4-1106-preview models in comparison with the capabilities of specialized human evaluators.
Methods:
We implemented a script using the OpenAI API to extract structured information in JSON format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by five specialists in radiation oncology. We compared the results using metrics such as Sensitivity, Specificity, Precision, Accuracy, F-value, Kappa index, and the McNemar test, in addition to examining the common causes of errors in both humans and GPT models.
Results:
The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant. GPT-4 demonstrated clear superiority in several key metrics. Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs. 96.8%). GPT-4 showed greater consistency, replicating exact results in 76% of the reports after 10 analyses, in contrast to 59% for GPT-3.5. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred non-explicit comorbidities, sometimes correctly, though this also resulted in more false positives.
Conclusions:
The studied LLMs, with carefully designed prompts, demonstrate competence comparable to that of medical specialists in interpreting clinical reports, even in complex and confusingly written texts. Considering also their superior efficiency in terms of time and costs, these models represent a preferable option over human analysis for data mining and structuring information in large collections of clinical reports.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.