Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jan 29, 2025
Date Accepted: May 14, 2025
Implementing Large Language Models in Healthcare: A Clinician’s Review and Guideline
ABSTRACT
Background:
Large language models (LLMs) can generate outputs understandable by humans such as answers to medical questions and radiology report generation. With the rapid development of LLMs, clinicians face a growing challenge in determining the most suitable algorithms to support their work.
Objective:
We aim to provide clinicians and other healthcare practitioners with systematic guidance in selecting a LLM that is relevant and appropriate to their needs, and to facilitate the integration process of LLMs in healthcare.
Methods:
We conducted a literature search on full-text publications in English on clinical applications of LLMs published between January 1, 2022 and March 31, 2025 on PubMed, Science Direct, Scopus, IEEE Xplore. We excluded papers in journals below a set citation threshold, as well as papers that did not focus on LLMs, were not research-based, or did not have clinical applications. We also conducted a literature search on arXiv in the same investigated period and included papers on the clinical applications of innovative multimodal LLMs. This led to a total of 270 studies.
Results:
We collected 330 LLMs and recorded their application frequency in clinical tasks and frequency of best performance in their context. Based on a five-stage clinical workflow, we found that Stages II, III, and IV are key stages in the clinical workflow, containing numerous clinical subtasks and LLMs. However, the diversity of LLMs that may perform optimally in each context remains limited. GPT-3.5 and GPT-4 are the most versatile models in the five-stage clinical workflow, applied to 52% and 63% of the clinical subtasks and performed best in 29% and 54% of the clinical subtasks, respectively. General-purpose LLMs may not perform well in specialized areas, as they often require lightweight prompt engineering methods or fine-tuning techniques based on specific datasets to improve model performance. Most LLMs with multimodal abilities are closed-source models, and therefore lack of transparency, model customization and fine-tuning for specific clinical tasks, and may also pose challenges on data protection and privacy, which are common requirements in clinical settings.
Conclusions:
In this review, we found that LLMs may help clinicians in a variety of clinical tasks. However, we did not find evidence of generalist clinical LLMs successfully applicable to a wide range of clinical tasks. Their clinical deployment remains therefore challenging. From this review, we propose an interactive online guideline for clinicians to select suitable LLMs by clinical tasks. Using a clinical perspective and free of unnecessary technical jargon, this guideline may be used as a reference to successfully apply LLMs in clinical settings.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.