Real-world implementation of large language models for writing clinical discharge summaries within a secure data environment: development and expert evaluation
ABSTRACT
Background:
A discharge summary should be a clinical report that documents a patient’s hospital stay, including investigation results, diagnoses, management, and follow-up. Currently, discharge summaries are written by clinicians who manually locate pertinent information across the electronic health record (EHR), of which around 80% is free text. This process is time-consuming and may be suitable for automation using large language models (LLMs).
Objective:
This study developed a template-based prompting system that can produce clinically acceptable discharge summaries, specifically the clinical summary and plan and requested actions sections, from routinely collected electronic patient records.
Methods:
This study used EHR data from Imperial College Healthcare NHS Trust (ICHT), a network of five hospitals providing acute and specialist care in North-West London for over 1.3 million patients annually. The data were hosted in the Imperial Secure Data Environment (SDE), refreshed weekly since 1 April 2023. Ethical approval was granted by the UK Health Research Authority (REC: 21/SW/0120; IRAS project ID 282093) and ICHT security and Data Protection offices. Data were accessed via the secure data environment under this approval. Fifty-two inpatient encounters were selected by the clinical team to ensure diversity in clinical specialty, reason for admission, complexity, length of stay, and sociodemographic characteristics. 42 cases were allocated to the development dataset and 9 comprised the test dataset, with one case excluded due to incomplete data. The system synthesised clinical notes relating to an inpatient hospital encounter and used structured template prompts with OpenAI’s GPT-4 to generate a discharge summary. The prompt was co-designed across 3 iterations. Resident doctors completed an evaluation form to assess the clinical acceptability of generated summaries, including the primary outcome (global confidence rating) and secondary outcomes (accuracy, completeness, readability, formatting, sociodemographic bias and potential clinical harm). Sensitivity analyses assessed the effect of length of stay and admission type (emergency department vs other, surgical vs other) on the primary outcome. (See supplementary for further details.)
Results:
52 patients (61.5% female) were included, with a mean age of 44.8 years (standard deviation, SD= 27.1) and an average length of stay of 15.2 days (SD=21.1). In the test dataset, 88.9% of GPT-generated summaries received a positive global confidence rating (“yes” or “yes with minor changes”). Secondary outcomes were positive for the clinical summary section (89% complete and 78% accurate), and the plan and requested actions section (78% complete and 78% accurate). Readability, formatting, sociodemographic bias and potential clinical harm also showed positive results in the test dataset. Sensitivity analyses showed no statistically significant variation in the primary outcome (global confidence rating) across length of stay or admission type.
Conclusions:
Our results demonstrate the feasibility of the pipeline, but rigorous statistical evaluation in a larger, adequately powered sample is needed.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.