Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Sep 15, 2024
Date Accepted: Jun 16, 2025

The final, peer-reviewed published version of this preprint can be found here:

Improving Large Language Models’ Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation

Koohi Habibi Dehkordi M, Perl Y, Deek FP, He Z, Keloth VK, Liu H, Elhanan G, Einstein AJ

Improving Large Language Models’ Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation

JMIR Med Inform 2025;13:e66476

DOI: 10.2196/66476

PMID: 40705416

PMCID: 12332456

Improving Large Language Models Summarization by Highlighting Discharge Notes: A Comparative Evaluation

  • Mahshad Koohi Habibi Dehkordi; 
  • Yehoshua Perl; 
  • Fadi P Deek; 
  • Zhe He; 
  • Vipina K Keloth; 
  • Hao Liu; 
  • Gai Elhanan; 
  • Andrew J Einstein

ABSTRACT

Background:

The American Medical Association recommends that Electronic Health Record (EHR) notes, often dense and written in nuanced language, be made readable for patients and laypeople, a practice we refer to as the simplification of EHR notes. Our approach to achieve simplification of EHR notes involves a process of incremental simplification steps to achieve the ideal note. In this paper we present the first step of this process. Large Language Models (LLMs), have demonstrated considerable success in text summarization. Such LLMs summaries re-present the content of EHR notes in an easier to read language. However, LLMs summaries can also introduce inaccuracies.

Objective:

Our objective is to obtain more accurate summaries of EHR notes. For this purpose, we aim to prove a hypothesis that summaries, generated by LLMs, of highlighted EHR notes are likely to be more accurate than such summaries of the original notes.

Methods:

To test our hypothesis, we perform a study where we randomly sampled 15 EHR notes from the MIMIC III database and highlighted them. Highlighting of EHR notes is done automatically using an Interface Technology we previously designed using Machine Learning techniques. To calibrate the LLMs summaries for our simplification goal, we have chosen GPT-4o and used prompt engineering to ensure high-quality prompts and address issues of output inconsistency and prompt sensitivity. We provide both highlighted and unhighlighted versions of each EHR note along with their corresponding prompts to GPT-4o. Each generated summary is manually evaluated to assess its quality using the evaluation metrics: completeness, correctness, and structural integrity.

Results:

On average, summaries from highlighted notes (H-summaries) achieved 96% completeness, 8% higher than summaries from unhighlighted notes (U-summaries). Moreover, H-summaries demonstrated better correctness, with fewer instances of erroneous information. Furthermore, H-summaries included fewer structural errors, such as improper headers and misplaced information. We show that, our findings support the hypothesis that summarizing highlighted EHR notes improves accuracy.

Conclusions:

Feeding the LLMs with highlighted EHR notes, combined with prompt engineering, results in generating higher-quality summaries in terms of correctness, completeness, and structural integrity, compare to unhighlighted EHR notes. The summaries generated with this approach will later be used to further simplify EHR notes for patients and laypeople, as recommended by the NIH.


 Citation

Please cite as:

Koohi Habibi Dehkordi M, Perl Y, Deek FP, He Z, Keloth VK, Liu H, Elhanan G, Einstein AJ

Improving Large Language Models’ Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation

JMIR Med Inform 2025;13:e66476

DOI: 10.2196/66476

PMID: 40705416

PMCID: 12332456

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.