Accepted for/Published in: JMIR Formative Research
Date Submitted: Jul 18, 2025
Date Accepted: Sep 30, 2025
Date Submitted to PubMed: Oct 1, 2025
Impact of Detailed Versus Generic Instructions on Fine-Tuned Language Models for Patient Discharge Instructions Generation: Comparative Statistical Analysis
ABSTRACT
Background:
Discharge instructions are essential for patient post-hospital care, but are time-consuming to write. With the rise of large language models (LLMs), there is strong potential to automate this process. This study explores the use of open-source LLMs for generating discharge instructions.
Objective:
We investigated whether a Mistral model can reliably generate patient-oriented discharge instructions. Two distinct instruction-tuning paradigms were compared, each using a different mechanism for embedding guidance during fine-tuning.
Methods:
In our experiment, we applied Mistral-Nemo-Instruct, a large language model, in combination with two distinct instruction strategies for fine-tuning. The first were detailed instructions tailored to the task of discharge instruction generation. The second was a basic instruction with minimal guidance and no task-specific detail. The independent variable in this study is the instruction strategy (detailed vs. generic), while the dependent variables are the evaluation scores of the generated discharge instructions. The generated discharge instructions were evaluated against 3,621 ground-truth references. We used BLEU-1 to BLEU-4, ROUGE (ROUGE-1, ROUGE-2, ROUGE-L), SentenceTransformer similarity, and BERTScore as evaluation metrics to assess the quality of the generated outputs in comparison to the corresponding ground-truth instructions for the same discharge summaries.
Results:
The tailored, detailed instruction model consistently outperformed the generic instruction model across all evaluation metrics. For example, ROUGE-L improved from 8.59% (Generic) to 26.52% (Detailed), and METEOR increased from 15.33% to 18.47%. BLEU-4 rose from 0.81% to 21.24%, while SentenceTransformer similarity improved from 11.91% to 74.90%. BERTScore increased from 78.92% with the Generic model to 87.05% with the Detailed model (P < .001).
Conclusions:
The use of detailed, task-specific instruction strategies significantly enhances the effectiveness of open-source large language models in generating discharge instructions. These findings indicate that carefully designed instructions during fine-tuning substantially improve model performance. Clinical Trial: No
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.