Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Jun 8, 2026
Open Peer Review Period: Jun 9, 2026 - Aug 4, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Augmenting Large Language Models for Clinical Decision-Making in Healthcare: A Systematic Review of Fine-Tuning, Retrieval-Augmented Generation, and Hybrid Methods

  • Anshum Patel; 
  • Yugant Khand; 
  • Sai Krishna Vallamchetla; 
  • Pengze Li; 
  • Cui Tao; 
  • Joseph Y Cheung

ABSTRACT

Background:

Large language models (LLMs) are increasingly evaluated for clinical decision support, but their practical value depends on post-training adaptation rather than raw benchmark performance. Fine-tuning, retrieval-augmented generation (RAG), and hybrid approaches represent the principal strategies for improving clinical reliability and relevance, yet evidence remains fragmented across specialties, model architectures, and evaluation designs.

Objective:

To synthesize evidence on fine-tuning, RAG, and hybrid post-training strategies for improving LLM clinical performance across healthcare tasks.

Methods:

We searched PubMed/MEDLINE, EMBASE, and Scopus (January 2018 – January 2026). Eligible studies evaluated transformer-based LLMs with post-training adaptation or retrieval augmentation applied to clinical datasets and reported quantitative performance outcomes. Risk of bias was assessed using PROBAST+AI. Results were synthesized descriptively by enhancement strategy. This review was prospectively registered (PROSPERO: CRD420261308522).

Results:

Of 1,890 records identified, 35 studies (published 2024–2026) met inclusion criteria across 12 countries and multiple clinical domains, including oncology, radiology, neurology, and emergency medicine. Three enhancement strategies were identified: SFT/PEFT (n=7, 20%), RAG (n=17, 49%), and hybrid pipelines combining fine-tuning, retrieval, and structured prompting (n=11, 31%). Fine-tuning was most effective for narrow, labeled classification tasks. External AUROC reached 0.912 for cancer detection and 0.938 for hepatocellular carcinoma from cell-free DNA signatures; macro-sensitivity was 0.918 for acute infarct detection from radiology reports; and AUC was 0.892 for major depressive disorder on UK Biobank data. RAG produced the largest gains when corpora were authoritative and aligned with the clinical task. Incorporating the ESC acute coronary syndrome guideline raised accuracy from 71.1% to 92.1% (GPT-4o) and from 78.9% to 94.7% (DeepSeek R1). A trauma-radiology chatbot improved injury grading accuracy from 48% to 87%, and a guideline-grounded urology pipeline reached 95.5% concordance versus 62.3% among junior clinicians. RAG reduced performance in two studies with noisy or poorly structured corpora, and reasoning-class models showed limited incremental benefit from retrieval. Hybrid systems achieved the strongest results for complex tasks. A stroke pipeline fine-tuned using LoRA reached 99.0% internal and 95.5%/79.1% external accuracy; a federated multimodal dermatology system achieved 90.2% diagnostic accuracy across 11 lesion types; and a multimodal osteonecrosis pipeline reached 96.0% expert-rated accuracy. Structured prompting (persona, chain-of-thought, task decomposition) shifted accuracy by 5–15 percentage points across studies. Only 10 studies (29%) reported external validation, and 24 (69%) lacked formal safety moderation.

Conclusions:

Post-training adaptation and retrieval augmentation improved clinical LLM performance across diverse tasks. Strategy-task alignment, corpus quality, and prompt design were primary determinants of benefit. Evidence remains predominantly retrospective with limited external validation. Future studies should prioritize prospective, clinically embedded evaluations incorporating safety and fairness reporting.


 Citation

Please cite as:

Patel A, Khand Y, Vallamchetla SK, Li P, Tao C, Cheung JY

Augmenting Large Language Models for Clinical Decision-Making in Healthcare: A Systematic Review of Fine-Tuning, Retrieval-Augmented Generation, and Hybrid Methods

JMIR Preprints. 08/06/2026:104092

DOI: 10.2196/preprints.104092

URL: https://preprints.jmir.org/preprint/104092

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.