Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Sep 9, 2024
Open Peer Review Period: Sep 10, 2024 - Nov 5, 2024
Date Accepted: Jan 31, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

Šuvalov H, Lepson M, Kukk V, Malk M, Kuulmets HA, Kolde R

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

J Med Internet Res 2025;27:e66279

DOI: 10.2196/66279

PMID: 40101227

PMCID: 11962312

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Using synthetic healthcare data to leverage LLMs for named entity recognition: Method

  • Hendrik Šuvalov; 
  • Mihkel Lepson; 
  • Veronika Kukk; 
  • Maria Malk; 
  • Hele-Andra Kuulmets; 
  • Raivo Kolde

ABSTRACT

Background:

Named entity recognition (NER) is critical for extracting medical entities from healthcare texts, enabling key applications in clinical decision support and data mining. However, developing NER models for low-resource languages like Estonian is challenging due to limited annotated data and pre-trained models. Large Language Models (LLM) have proven to be promising in understanding text from any language or domain.

Objective:

This paper aims to address the challenge of developing high-quality medical NER models for low-resource languages like Estonian. The objective is to overcome this limitation by leveraging synthetic Estonian healthcare data annotated with LLMs. The focus is on training an effective NER model on synthetic data for downstream tasks and using it on real-world, highly sensitive medical data.

Methods:

To tackle the scarcity of annotated data in Estonian healthcare texts, we employ a novel three step approach. First, synthetic Estonian healthcare data is generated using a locally trained model. Second, the data is annotated using LLMs. Finally, the annotated synthetic data is used to fine-tune a NER model. This paper compares the performance of different prompts, assesses the impact of GPT-3.5-Turbo, GPT-4 and a local LLM and explores the relationship between the amount of annotated synthetic data and model performance.

Results:

Our approach yields promising results in the extraction of named entities from real-world medical texts. Specifically, our best setup achieves an F1 score of 0.757 for extracting drugs and an F1 score of 0.395 for extracting procedures.

Conclusions:

In this paper, we show the results of leveraging LLMs for training NER models without risking the privacy of the sensitive medical data by using synthetic texts. These results are achieved without relying on real human-annotated data, highlighting the effectiveness of our methodology.


 Citation

Please cite as:

Šuvalov H, Lepson M, Kukk V, Malk M, Kuulmets HA, Kolde R

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

J Med Internet Res 2025;27:e66279

DOI: 10.2196/66279

PMID: 40101227

PMCID: 11962312

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.