Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Aug 5, 2025
Date Accepted: Jan 22, 2026
Large Language Models to Improve the Understandability of Clinical Guidelines: an Evaluation of Readability Improvements and Unintended Content Changes Produced by GPT-4
ABSTRACT
Background:
Difficulty finding and understanding information in clinical guidelines contributes to medication errors. Large language models (LLMs) can simplify complex text to aid in understanding, but this approach to improving the quality of guidelines has not been investigated. However, LLMs are also known to “hallucinate” or generate outputs that may not align with reality.
Objective:
To develop and evaluate an LLM pipeline to improve the readability of clinical guidelines while ensuring the preservation of critical content.
Methods:
To align LLM revisions with research evidence and enable comparison with manual editing, the National Health Service Injectable Medicines Guide (IMG) was used as a case study to which a GPT-4 based pipeline was applied, with prompts based on user testing-derived recommendations for IMG authors. This enabled readability comparisons between various IMG guideline versions: original, manually or GPT-4-revised using the user testing derived recommendations, and fully user tested. Readability was evaluated using readability metrics and three expert pharmacists’ ratings. Content similarity before/after LLM revision was assessed using BERT scores and expert pharmacist review.
Results:
Considering 20 IMG guidelines used in practice, BERT scores indicated high semantic similarity between the original and LLM-revised guidelines (0.88 to 0.96). An omission, addition or change in meaning was identified by at least one pharmacist in 30 (20%), 7 (5%) and 18 (12%) (respectively) of the 153 guideline sub-sections. The SMOG grade showed a small but significant improvement in readability for the LLM guidelines (mean difference 0.32, 95%CI: 0.10-0.55, P=.02) and the manually revised versions (mean difference 0.46, 95%CI: 0.13-0.79, P=.03). There was no significant difference between the LLM and manually revised versions (P>0.99). There were no significant differences between Flesch-Kincaid reading grades (P=.91). Expert ratings favoured the LLM-revised versions for understandability. Considering two IMG guidelines from previous research, user testing produced a greater improvement in readability than LLM-revision.
Conclusions:
Authors should not use current LLMs to modify clinical guidelines without carefully checking the revised text for unintended omissions, additions or changes of meaning. Further work should investigate the potential of LLMs to augment manual user testing and reduce the barriers to the wider use of this approach to improve the safety of clinical guidelines. Clinical Trial: N/A
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.