Currently submitted to: Journal of Medical Internet Research
Date Submitted: Jan 1, 2026
Open Peer Review Period: Jan 1, 2026 - Feb 26, 2026
(closed for review but you can still tweet)
NOTE: This is an unreviewed Preprint
Warning: This is a unreviewed preprint (What is a preprint?). Readers are warned that the document has not been peer-reviewed by expert/patient reviewers or an academic editor, may contain misleading claims, and is likely to undergo changes before final publication, if accepted, or may have been rejected/withdrawn (a note "no longer under consideration" will appear above).
Peer review me: Readers with interest and expertise are encouraged to sign up as peer-reviewer, if the paper is within an open peer-review period (in this case, a "Peer Review Me" button to sign up as reviewer is displayed above). All preprints currently open for review are listed here. Outside of the formal open peer-review period we encourage you to tweet about the preprint.
Citation: Please cite this preprint only for review purposes or for grant applications and CVs (if you are the author).
Final version: If our system detects a final peer-reviewed "version of record" (VoR) published in any journal, a link to that VoR will appear below. Readers are then encourage to cite the VoR instead of this preprint.
Settings: If you are the author, you can login and change the preprint display settings, but the preprint URL/DOI is supposed to be stable and citable, so it should not be removed once posted.
Submit: To post your own preprint, simply submit to any JMIR journal, and choose the appropriate settings to expose your submitted version as preprint.
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
A Comparative Evaluation of Fine-Tuning, Retrieval-Augmented Generation, and Hybrid Large Language Models for Postoperative Clinical Decision Support
ABSTRACT
Background:
Large language models (LLMs) have shown growing potential for clinical decision support. However, effectively integrating domain-specific medical knowledge into LLMs while maintaining accuracy, safety, and interpretability remains a key challenge for postoperative discharge instructions and patient education. Fine-tuning (FT), retrieval-augmented generation (RAG), and hybrid FT+RAG approaches represent three prominent strategies for knowledge integration, yet their comparative performance in postoperative clinical contexts has not been systematically evaluated.
Objective:
We aimed to compare the clinical performance, reliability, and safety characteristics of baseline, fine-tuned, retrieval-augmented, and hybrid FT+RAG LLM configurations for postoperative clinical decision support.
Methods:
We conducted a controlled comparative evaluation of four LLM configurations using Google Gemini 2.5 Flash. A total of 600 postoperative question–answer pairs were used for model adaptation and validation, while 150 queries were reserved for final evaluation. Queries included routine postoperative care questions, emergency escalation scenarios, and deliberately out-of-scope questions. Model outputs were independently assessed by three blinded clinical experts for accuracy, completeness, and relevance. Automated metrics were used to evaluate readability, faithfulness, and hallucination propensity.
Results:
All knowledge-enhanced models significantly outperformed the baseline model in clinical accuracy (baseline 68.0% vs FT 92.7%, RAG 91.3%, FT+RAG 97.3%; p<.001). The hybrid FT+RAG model achieved the highest overall performance, including 100% precision, 96.7% recall, and the lowest hallucination rate. FT and RAG alone yielded comparable gains across accuracy, completeness, relevance, faithfulness, and hallucination reduction, with no statistically significant differences between them. While enhanced models produced shorter and more concise responses, they demonstrated reduced readability compared with the baseline model.
Conclusions:
Incorporating domain knowledge substantially improves the clinical performance of LLMs for postoperative decision support. Hybrid FT+RAG approaches provide the strongest overall accuracy and safety profile, although trade-offs in readability, interpretability, and rater variability remain. These findings support the use of knowledge-augmented LLMs in postoperative care while underscoring the need for careful governance, transparency, and human oversight prior to clinical deployment. Clinical Trial: Not applicable
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.