Identifying Deprescribing Opportunities with Large Language Models in Older Adults: Retrospective Cohort Study
ABSTRACT
Background:
Polypharmacy, the concurrent use of multiple medications, is prevalent among older adults and associated with increased risks for adverse drug events including falls. Deprescribing, the systematic process of discontinuing potentially inappropriate medications (PIMs), aims to mitigate these risks. However, the practical application of deprescribing criteria in emergency settings remains limited due to time constraints and the complexity of criteria.
Objective:
This study evaluates the performance of a large language model (LLM)-based pipeline in identifying deprescribing opportunities for older emergency department (ED) patients with polypharmacy, utilizing 3 different sets of criteria: Beers, Screening Tool of Older People’s Prescriptions (STOPP), and GEMS-Rx. It further evaluates LLM confidence calibration and its ability to improve recommendation performance.
Methods:
We conducted a retrospective cohort study of older adults presenting to an ED in a large academic medical center in the Northeast United States from January-March 2022. A random, convenience sample of 100 patients (712 total oral medications) was selected for detailed analysis. The LLM pipeline consisted of two steps: (1) filtering high-yield deprescribing criteria based on patients' medication lists, and (2) applying these criteria using both structured and unstructured patient data to recommend deprescribing. Model performance was assessed by comparing model recommendations to those of trained medical students, with discrepancies adjudicated by board-certified ED physicians. Selective prediction, a method that allows a model to abstain from low-confidence predictions to improve overall reliability, was applied to assess the model's confidence and decision-making thresholds.
Results:
The LLM achieved high accuracy in identifying deprescribing criteria (PPV: 0.83; NPV: 0.93) relative to medical students, but showed limitations in making specific deprescribing recommendations (PPV: 0.47; NPV: 0.93). Adjudication revealed that while the model excelled at identifying when there was a deprescribing criterion related to one of the patient's medications, it often struggled with determining whether that criterion applied to the specific case due to complex inclusion/exclusion criteria (54.5% of errors) and ambiguous clinical contexts (e.g. missing information; 39.3% of errors). Selective prediction only marginally improved LLM performance due to poorly calibrated confidence estimates.
Conclusions:
This study highlights the potential of LLMs to support deprescribing decisions in the ED by effectively filtering relevant criteria. However, challenges remain in applying these criteria to complex clinical scenarios, as the LLM demonstrated poor performance on more intricate decision-making tasks, with its reported confidence often failing to align with its actual success in these cases. The findings underscore the need for clearer deprescribing guidelines, improved LLM calibration for real-world use, and better integration of human-AI workflows to balance AI recommendations with clinician judgment.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.