Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Sep 19, 2025
Open Peer Review Period: Sep 19, 2025 - Nov 14, 2025
Date Accepted: Feb 27, 2026
(closed for review but you can still tweet)
Large Language Model-Generated Patient Instructions for Prescriptions in Primary Health Care: A Preclinical Evaluation
ABSTRACT
Background:
Large Language Model-Generated Patient Instructions for Prescriptions in Primary Health Care: A Preclinical Evaluation
Objective:
We evaluated Large Language Models (LLMs) performance in generating medication usage instructions to complement prescriptions in Primary Health Care.
Methods:
This randomized, blinded experimental study utilized prescription-inducing scenarios, assigned to 62 healthcare professionals, to validate instructions generated by LLMs during e-prescriptions. The instructions were generated by ChatGPT-4.0, Llama3.1-8B, and Llama3.1-8B-RAG using Retrieval-Augmented Generation (RAG) based on patient information leaflets. Performance metrics assessed Adequacy, Completeness, Clarity, Personalization, Usefulness, and errors in the generated instructions, with scores to analyse overall and individual metrics, using all evaluations (n=198) and consensus among evaluators by test (n=46).
Results:
The three models yielded similar scores for producing qualified instructions, by consensus among evaluators (n=46 tests), with median (IQR) values of: ChatGPT-4.0: 89.3 (12.5), Llama3.1-8B: 79.5 (46.1), and Llama3.1-8B-RAG: 85.7 (21.9), P=.282. RAG rendered Llama3.1-8B model equivalent to ChatGPT-4.0 regarding Adequacy, Completeness, Clarity, and Usefulness, and presented fewer errors in the generated instructions: ChatGPT-4.0 (n=5), Llama3.1-8B (n=11), and Llama3.1-8B-RAG (n=4), P=.040. Concerning specific criteria across 198 tests, Llama3.1-8B-RAG received scores equivalent to those of ChatGPT-4.0 in Adequacy with mean (SD) 6.24 (2.3) and 6.82 (2.1), respectively, P=.536); Completeness with mean (SD) 5.94 (2.2) and 6.55 (1.8), respectively, P=.376; Clarity with mean (SD) 5.77 (2.4) and 6.68 (1.9), respectively, P=.086; as well as Usefulness with mean (SD) 5.42 (2.4) and 5.96 (2.2), respectively, P=.627. ChatGPT-4.0 received higher scores in the Personalization criterion with mean (SD) 7.05 (1.5) in comparison with 5.44 (2.6) Llama3.1-8B-RAG, P<.001.
Conclusions:
The open-source LLM enhanced with external information presenting similar performance to the closed-source model. LLM-generation demonstrated potential for instructing patients on medication use. Nonetheless, the introduction of this innovation into the e-prescribing workflow demands prescriber validation and LLM performance governance.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.