Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Dec 2, 2025
Date Accepted: Jun 11, 2026

The final, peer-reviewed published version of this preprint can be found here:

Real-World Implementation of Large Language Models for Writing Clinical Discharge Summaries Within a Secure Data Environment: Development and Expert Evaluation Study

Carenzo C, Goldsmith K, Arribas M, Atkins B, Ko I, Chong HL, Raja A, Riad A, Lear R, Abdullahi Y, Glampson B, Orchard T, Mayer E

Real-World Implementation of Large Language Models for Writing Clinical Discharge Summaries Within a Secure Data Environment: Development and Expert Evaluation Study

JMIR AI 2026;5:e88816

DOI: 10.2196/88816

PMID: 42398933

Real-world implementation of large language models for writing clinical discharge summaries within a secure data environment: development and expert evaluation

  • Catalina Carenzo; 
  • Kathleen Goldsmith; 
  • Maite Arribas; 
  • Benjamin Atkins; 
  • Ina Ko; 
  • Ho Lun Chong; 
  • Asmita Raja; 
  • Aya Riad; 
  • Rachael Lear; 
  • Yusuf Abdullahi; 
  • Ben Glampson; 
  • Tim Orchard; 
  • Erik Mayer

ABSTRACT

Background:

A discharge summary should be a clinical report that documents a patient’s hospital stay, including investigation results, diagnoses, management, and follow-up. Currently, discharge summaries are written by clinicians who manually locate pertinent information across the electronic health record (EHR), of which around 80% is free text. This process is time-consuming and may be suitable for automation using large language models (LLMs).

Objective:

This study developed a template-based prompting system that can produce clinically acceptable discharge summaries, specifically the clinical summary and plan and requested actions sections, from routinely collected electronic patient records.

Methods:

This study used EHR data from Imperial College Healthcare NHS Trust (ICHT), a network of five hospitals providing acute and specialist care in North-West London for over 1.3 million patients annually. The data were hosted in the Imperial Secure Data Environment (SDE), refreshed weekly since 1 April 2023. Ethical approval was granted by the UK Health Research Authority (REC: 21/SW/0120; IRAS project ID 282093) and ICHT security and Data Protection offices. Data were accessed via the secure data environment under this approval. Fifty-two inpatient encounters were selected by the clinical team to ensure diversity in clinical specialty, reason for admission, complexity, length of stay, and sociodemographic characteristics. 42 cases were allocated to the development dataset and 9 comprised the test dataset, with one case excluded due to incomplete data. The system synthesised clinical notes relating to an inpatient hospital encounter and used structured template prompts with OpenAI’s GPT-4 to generate a discharge summary. The prompt was co-designed across 3 iterations. Resident doctors completed an evaluation form to assess the clinical acceptability of generated summaries, including the primary outcome (global confidence rating) and secondary outcomes (accuracy, completeness, readability, formatting, sociodemographic bias and potential clinical harm). Sensitivity analyses assessed the effect of length of stay and admission type (emergency department vs other, surgical vs other) on the primary outcome. (See supplementary for further details.)

Results:

52 patients (61.5% female) were included, with a mean age of 44.8 years (standard deviation, SD= 27.1) and an average length of stay of 15.2 days (SD=21.1). In the test dataset, 88.9% of GPT-generated summaries received a positive global confidence rating (“yes” or “yes with minor changes”). Secondary outcomes were positive for the clinical summary section (89% complete and 78% accurate), and the plan and requested actions section (78% complete and 78% accurate). Readability, formatting, sociodemographic bias and potential clinical harm also showed positive results in the test dataset. Sensitivity analyses showed no statistically significant variation in the primary outcome (global confidence rating) across length of stay or admission type.

Conclusions:

Our results demonstrate the feasibility of the pipeline, but rigorous statistical evaluation in a larger, adequately powered sample is needed.


 Citation

Please cite as:

Carenzo C, Goldsmith K, Arribas M, Atkins B, Ko I, Chong HL, Raja A, Riad A, Lear R, Abdullahi Y, Glampson B, Orchard T, Mayer E

Real-World Implementation of Large Language Models for Writing Clinical Discharge Summaries Within a Secure Data Environment: Development and Expert Evaluation Study

JMIR AI 2026;5:e88816

DOI: 10.2196/88816

PMID: 42398933

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.