Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Aug 5, 2025
Date Accepted: Jan 22, 2026

The final, peer-reviewed published version of this preprint can be found here:

Improving the Understandability of Clinical Guidelines: Development and Evaluation of a GPT-4–Based Pipeline

Jones MD, Torgbi M, Tayyar Madabushi H

Improving the Understandability of Clinical Guidelines: Development and Evaluation of a GPT-4–Based Pipeline

J Med Internet Res 2026;28:e81915

DOI: 10.2196/81915

PMID: 41730207

PMCID: 12928683

Large Language Models to Improve the Understandability of Clinical Guidelines: an Evaluation of Readability Improvements and Unintended Content Changes Produced by GPT-4

  • Matthew D Jones; 
  • Melissa Torgbi; 
  • Harish Tayyar Madabushi

ABSTRACT

Background:

Difficulty finding and understanding information in clinical guidelines contributes to medication errors. Large language models (LLMs) can simplify complex text to aid in understanding, but this approach to improving the quality of guidelines has not been investigated. However, LLMs are also known to “hallucinate” or generate outputs that may not align with reality.

Objective:

To develop and evaluate an LLM pipeline to improve the readability of clinical guidelines while ensuring the preservation of critical content.

Methods:

To align LLM revisions with research evidence and enable comparison with manual editing, the National Health Service Injectable Medicines Guide (IMG) was used as a case study to which a GPT-4 based pipeline was applied, with prompts based on user testing-derived recommendations for IMG authors. This enabled readability comparisons between various IMG guideline versions: original, manually or GPT-4-revised using the user testing derived recommendations, and fully user tested. Readability was evaluated using readability metrics and three expert pharmacists’ ratings. Content similarity before/after LLM revision was assessed using BERT scores and expert pharmacist review.

Results:

Considering 20 IMG guidelines used in practice, BERT scores indicated high semantic similarity between the original and LLM-revised guidelines (0.88 to 0.96). An omission, addition or change in meaning was identified by at least one pharmacist in 30 (20%), 7 (5%) and 18 (12%) (respectively) of the 153 guideline sub-sections. The SMOG grade showed a small but significant improvement in readability for the LLM guidelines (mean difference 0.32, 95%CI: 0.10-0.55, P=.02) and the manually revised versions (mean difference 0.46, 95%CI: 0.13-0.79, P=.03). There was no significant difference between the LLM and manually revised versions (P>0.99). There were no significant differences between Flesch-Kincaid reading grades (P=.91). Expert ratings favoured the LLM-revised versions for understandability. Considering two IMG guidelines from previous research, user testing produced a greater improvement in readability than LLM-revision.

Conclusions:

Authors should not use current LLMs to modify clinical guidelines without carefully checking the revised text for unintended omissions, additions or changes of meaning. Further work should investigate the potential of LLMs to augment manual user testing and reduce the barriers to the wider use of this approach to improve the safety of clinical guidelines. Clinical Trial: N/A


 Citation

Please cite as:

Jones MD, Torgbi M, Tayyar Madabushi H

Improving the Understandability of Clinical Guidelines: Development and Evaluation of a GPT-4–Based Pipeline

J Med Internet Res 2026;28:e81915

DOI: 10.2196/81915

PMID: 41730207

PMCID: 12928683

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.