Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jan 18, 2024
Date Accepted: Sep 25, 2024
The Accuracy and Capability of Artificial Intelligence Solutions in Healthcare Exams and Certificates: A Systematic Review and Meta-Analysis
ABSTRACT
Background:
Large Language Models (LLM) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text.
Objective:
In response to this rapidly progressing field, we aimed to establish a baseline performance and quality standard for the current generation of LLMs in narrative medical response tasks.
Methods:
We quantified the accuracy of LLMs in responding to healthcare examination questions, and evaluated the consistency and quality of study reporting. The protocol was registered with OSF (https://osf.io/xqzkw). The search included all papers up until 09/10/2023, at which point a preliminary search was conducted and piloting of study selection process was commenced using MEDLINE, Embase, Global Health, Cochrane Library and Health Technology Assessment Database, alongside the OVID search interface. The literature search included the following MeSH terms used in all possible combinations: ‘artificial intelligence’, ‘ChatGPT’, ‘GPT’, ‘LLM’, ‘Large Language Model’, ‘machine learning’, ‘neural network’, ‘Generative Pre-trained Transformer’, ‘Generative Transformer’, ‘Generative Language Model’, ‘Generative Model’, ‘medical exam’, ‘healthcare exam’ ‘clinical exam’. Sensitivity, accuracy and precision data was extracted, including the relevant confidence intervals.
Results:
The search identified 1673 relevant citations. After removing duplicate results, 1268 articles were screened for titles and abstracts, and 32 studies were included for full-text review. Our meta-analysis suggests that LLMs are able to perform with an overall medical exam accuracy of 0.61 (CI 0.58, 0.64), an LLM on the USMLE accuracy of 0.51 (CI 0.46, 0.56), and a ChatGPT on medical exams overall accuracy of 0.64 (CI 0.6, 0.67).
Conclusions:
For policy and deployment decisions about Large Language Models to advance healthcare, we propose a new framework called RUBRICC - Regulatory, Usability, Bias, Reliability (Evidence & Safety), Interoperability, Cost, & Co-design-PPIE. This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services whilst respecting patient safety considerations. Clinical Trial: OSF (https://osf.io/xqzkw)
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.