JMIR Preprints #56532: The Accuracy and Capability of Artificial Intelligence Solutions in Healthcare Exams and Certificates: A Systematic Review and Meta-Analysis

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

The Accuracy and Capability of Artificial Intelligence Solutions in Healthcare Exams and Certificates: A Systematic Review and Meta-Analysis

William Joel Waldock;
Joe Zhang;
Ahmad Guni;
Ahmad Nabeel;
Ara Darzi;
Hutan Ashrafian

ABSTRACT

Background:

Large Language Models (LLM) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text.

Objective:

In response to this rapidly progressing field, we aimed to establish a baseline performance and quality standard for the current generation of LLMs in narrative medical response tasks.

Methods:

We quantified the accuracy of LLMs in responding to healthcare examination questions, and evaluated the consistency and quality of study reporting. The protocol was registered with OSF (https://osf.io/xqzkw). The search included all papers up until 09/10/2023, at which point a preliminary search was conducted and piloting of study selection process was commenced using MEDLINE, Embase, Global Health, Cochrane Library and Health Technology Assessment Database, alongside the OVID search interface. The literature search included the following MeSH terms used in all possible combinations: ‘artificial intelligence’, ‘ChatGPT’, ‘GPT’, ‘LLM’, ‘Large Language Model’, ‘machine learning’, ‘neural network’, ‘Generative Pre-trained Transformer’, ‘Generative Transformer’, ‘Generative Language Model’, ‘Generative Model’, ‘medical exam’, ‘healthcare exam’ ‘clinical exam’. Sensitivity, accuracy and precision data was extracted, including the relevant confidence intervals.

Results:

The search identified 1673 relevant citations. After removing duplicate results, 1268 articles were screened for titles and abstracts, and 32 studies were included for full-text review. Our meta-analysis suggests that LLMs are able to perform with an overall medical exam accuracy of 0.61 (CI 0.58, 0.64), an LLM on the USMLE accuracy of 0.51 (CI 0.46, 0.56), and a ChatGPT on medical exams overall accuracy of 0.64 (CI 0.6, 0.67).

Conclusions:

For policy and deployment decisions about Large Language Models to advance healthcare, we propose a new framework called RUBRICC - Regulatory, Usability, Bias, Reliability (Evidence & Safety), Interoperability, Cost, & Co-design-PPIE. This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services whilst respecting patient safety considerations. Clinical Trial: OSF (https://osf.io/xqzkw)

Citation

Please cite as:

Waldock WJ, Zhang J, Guni A, Nabeel A, Darzi A, Ashrafian H

The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis

J Med Internet Res 2024;26:e56532

DOI: 10.2196/56532

PMID: 39499913

PMCID: 11576595

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jan 18, 2024

Date Accepted: Sep 25, 2024

The Accuracy and Capability of Artificial Intelligence Solutions in Healthcare Exams and Certificates: A Systematic Review and Meta-Analysis

ABSTRACT

Citation

Copyright