Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Aug 26, 2025
Date Accepted: Jan 29, 2026

The final, peer-reviewed published version of this preprint can be found here:

Improving Retrieval Augmented Generation for Health Care by Fine-Tuning Clinical Embedding Models: Development and Evaluation Study

Arzideh K, Schäfer H, Idrissi-Yaghir A, Schmidt CS, Eryilmaz B, Bahn M, Turki AT, Pollok OB, Hartmann EM, Winnekens P, Borys K, Haubold J, Nensa F, Hosch R

Improving Retrieval Augmented Generation for Health Care by Fine-Tuning Clinical Embedding Models: Development and Evaluation Study

J Med Internet Res 2026;28:e82997

DOI: 10.2196/82997

PMID: 41880603

Improving Retrieval Augmented Generation for Healthcare by Fine-tuning Clinical Embedding Models: Development and Evaluation Study

  • Kamyar Arzideh; 
  • Henning Schäfer; 
  • Ahmad Idrissi-Yaghir; 
  • Cynthia Sabrina Schmidt; 
  • Bahadir Eryilmaz; 
  • Mikel Bahn; 
  • Amin T. Turki; 
  • Olivia Barbara Pollok; 
  • Eva Maria Hartmann; 
  • Philipp Winnekens; 
  • Katarzyna Borys; 
  • Johannes Haubold; 
  • Felix Nensa; 
  • René Hosch

ABSTRACT

Background:

Embedding models can be integrated into Retrieval Augmented Generation systems to retrieve and search for unstructured data. These models are trained on publicly available English data, limiting their effectiveness in non-English healthcare settings. More importantly, the models are not trained on real-world clinical data, leading to inaccurate results when integrated into Retrieval Augmented Generation systems for healthcare use cases.

Objective:

This retrospective study addresses this gap by developing embedding models specifically trained on real-world clinical documents for medical information retrieval.

Methods:

Embedding models were fine-tuned using eleven million question-answer pairs generated from 400,000 clinical documents from a large German hospital, including radiology reports, discharge letters, pathology reports, and operation notes. Furthermore, all datasets were translated into English and pseudonymized to publish these models for other healthcare institutions. A Large Language Model generated medically relevant questions for each document section, creating training data aiming to reflect real-world clinical scenarios. Evaluation was performed in two scenarios: information retrieval and Retrieval Augmented Generation.

Results:

The fine-tuned models demonstrated superior performance on real-world German and translated English evaluation datasets, surpassing the state-of-the-art multilingual-e5-large, bge-m5, and gte-multilingual-base models in both evaluation scenarios. For the information retrieval evaluation, the fine-tuned model achieved a mAP@100 of 0.268 compared to the next best model, multilingual-e5-large, which reached a mAP@100 of 0.135. For the Retrieval Augmented Generation evaluation, the fine-tuned model showed a BERTScore F1-score of 0.769 compared to 0.756.

Conclusions:

By using a real-world dataset consisting of reports from different medical specialties and incorporating a Large Language Model to generate questions based on these reports, a large training dataset was created and used to fine-tune an embedding model. This model surpassed the performance of state-of-the-art models and holds promise for improving Retrieval Augmented Generation in the healthcare domain.


 Citation

Please cite as:

Arzideh K, Schäfer H, Idrissi-Yaghir A, Schmidt CS, Eryilmaz B, Bahn M, Turki AT, Pollok OB, Hartmann EM, Winnekens P, Borys K, Haubold J, Nensa F, Hosch R

Improving Retrieval Augmented Generation for Health Care by Fine-Tuning Clinical Embedding Models: Development and Evaluation Study

J Med Internet Res 2026;28:e82997

DOI: 10.2196/82997

PMID: 41880603

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.