Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Mar 27, 2025
Open Peer Review Period: Mar 27, 2025 - Apr 11, 2025
Date Accepted: May 12, 2025
(closed for review but you can still tweet)
Predicting 30-Day Postoperative Mortality and American Society of Anesthesiologists Physical Status Using Retrieval-Augmented Large Language Models: Development and Validation Study
ABSTRACT
Background:
Accurately assessing perioperative risk is critical for informed surgical planning and patient safety. However, current prediction models often rely solely on structured data and overlook the nuanced clinical reasoning embedded in free-text preoperative notes. Recent advances in large language models (LLMs) have opened new opportunities for harnessing unstructured clinical data, yet their application in perioperative prediction remains limited by concerns about factual accuracy. Retrieval-augmented generation (RAG) offers a promising solution—enhancing LLM performance by grounding outputs in authoritative medical sources, potentially improving both predictive accuracy and clinical interpretability.
Objective:
This study aimed to investigate whether integrating LLMs with RAG can improve the prediction of 30-day postoperative mortality and American Society of Anesthesiologists physical status classification using unstructured preoperative clinical notes.
Methods:
We conducted a retrospective cohort study using over 24,491 medical records from a tertiary medical center, including preoperative anesthesia assessments, discharge summaries, and surgical information. To extract clinical insights from free-text data, we employed the LLaMA 3.1-8B language model with retrieval-augmented generation (RAG), using MedEmbed for text embedding and Miller’s Anesthesia as the primary retrieval source. We systematically evaluated model performance under various configurations—embedding models, chunk sizes, and few-shot prompting—using weighted area under the precision-recall curve (AUPRC) for mortality prediction and micro F1 score for American Society of Anesthesiologists (ASA) classification.
Results:
The LLaMA-RAG model consistently outperformed traditional machine learning baselines. For 30-day postoperative mortality, it achieved the highest AUROC of 0.9570 (95% CI 0.9543–0.9597) and AUPRC of 0.6536 (95% CI 0.6479–0.6593). For ASA classification, it attained the highest micro F1 score of 0.8409 (95% CI 0.8238–0.8551). Notably, the model demonstrated exceptional sensitivity in identifying rare but high-risk cases, such as ASA Class 5 patients and postoperative deaths.
Conclusions:
The LLaMA-RAG model significantly improved prediction of postoperative mortality and ASA classification, especially for rare high-risk cases. By grounding outputs in domain knowledge, retrieval-augmented generation enhanced both accuracy and interpretability.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.