Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Aug 26, 2025
Date Accepted: Jan 29, 2026
Improving Retrieval Augmented Generation for Healthcare by Fine-tuning Clinical Embedding Models: Development and Evaluation Study
ABSTRACT
Background:
Embedding models can be integrated into Retrieval Augmented Generation systems to retrieve and search for unstructured data. These models are trained on publicly available English data, limiting their effectiveness in non-English healthcare settings. More importantly, the models are not trained on real-world clinical data, leading to inaccurate results when integrated into Retrieval Augmented Generation systems for healthcare use cases.
Objective:
This retrospective study addresses this gap by developing embedding models specifically trained on real-world clinical documents for medical information retrieval.
Methods:
Embedding models were fine-tuned using eleven million question-answer pairs generated from 400,000 clinical documents from a large German hospital, including radiology reports, discharge letters, pathology reports, and operation notes. Furthermore, all datasets were translated into English and pseudonymized to publish these models for other healthcare institutions. A Large Language Model generated medically relevant questions for each document section, creating training data aiming to reflect real-world clinical scenarios. Evaluation was performed in two scenarios: information retrieval and Retrieval Augmented Generation.
Results:
The fine-tuned models demonstrated superior performance on real-world German and translated English evaluation datasets, surpassing the state-of-the-art multilingual-e5-large, bge-m5, and gte-multilingual-base models in both evaluation scenarios. For the information retrieval evaluation, the fine-tuned model achieved a mAP@100 of 0.268 compared to the next best model, multilingual-e5-large, which reached a mAP@100 of 0.135. For the Retrieval Augmented Generation evaluation, the fine-tuned model showed a BERTScore F1-score of 0.769 compared to 0.756.
Conclusions:
By using a real-world dataset consisting of reports from different medical specialties and incorporating a Large Language Model to generate questions based on these reports, a large training dataset was created and used to fine-tune an embedding model. This model surpassed the performance of state-of-the-art models and holds promise for improving Retrieval Augmented Generation in the healthcare domain.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.