JMIR Preprints #65317: Evaluation and Bias Analysis of Large Language Models in Generating Synthetic Electronic Health Records: A Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluation and Bias Analysis of Large Language Models in Generating Synthetic Electronic Health Records: A Comparative Study

Ruochen Huang;
Honghan Wu;
Yuhan Yuan;
Yifan Xu;
Hao Qian;
Changwei Zhang;
Xin Wei;
Shan Lu;
Xin Zhang;
Jingbao Kan;
Cheng Wan;
Yun Liu

ABSTRACT

Background:

Recent advancements in artificial intelligence, particularly with large language models (LLMs), have shown potential in the automated generation of synthetic clinical electronic health records (EHRs). However, concerns regarding the performance of these models and the manifestation of gender and racial biases in their outputs necessitate a thorough examination, especially as these models are increasingly applied in healthcare settings.

Objective:

This study aims to systematically assess the performance of various LLMs in generating synthetic EHRs and to critically evaluate the presence of gender and racial biases in the generated outputs. The study introduces the Electronic Health Record Performance Score (EPS) as a novel metric for comparing the efficacy of different LLMs, particularly focusing on bilingual English-Chinese models versus predominantly English models.

Methods:

We evaluated seven open-source LLMs across 20 diseases, analyzing the completeness and bias of the generated EHRs. Gender and racial biases were quantified using statistical methods, including chi-square tests. The study involved the generation of 140,000 synthetic patient cases, which were assessed using the EPS and attribute-specific EPS (EPSgender and EPSrace). Model performance was analyzed in relation to model size, cultural background, and training data diversity.

Results:

The findings revealed significant differences in the accuracy of synthetic EHRs generated by different models, with larger models generally performing better but also exhibiting more pronounced biases. Gender biases were found to increase with model size, particularly aligning with the gender prevalence of specific diseases. Racial biases were more complex, with a consistent overrepresentation of white patients across most diseases. The study highlighted that increasing the diversity of training data did not necessarily reduce racial biases.

Conclusions:

This study underscores the critical need for ongoing interdisciplinary research to enhance the fairness and reliability of LLMs in healthcare. The pervasive gender and racial biases identified in LLM-generated EHRs emphasize the importance of developing methods for bias detection and mitigation to ensure equitable healthcare delivery and education. Future research should focus on refining these models to better represent diverse patient populations while maintaining high performance in EHR generation.

Citation

Please cite as:

Huang R, Wu H, Yuan Y, Xu Y, Qian H, Zhang C, Wei X, Lu S, Zhang X, Kan J, Wan C, Liu Y

Evaluation and Bias Analysis of Large Language Models in Generating Synthetic Electronic Health Records: Comparative Study

J Med Internet Res 2025;27:e65317

DOI: 10.2196/65317

PMID: 40354109

PMCID: 12107208

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Sep 1, 2024

Date Accepted: Mar 29, 2025

(closed for review but you can still tweet)

Evaluation and Bias Analysis of Large Language Models in Generating Synthetic Electronic Health Records: A Comparative Study

ABSTRACT

Citation