Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Sep 1, 2024
Date Accepted: Mar 29, 2025
(closed for review but you can still tweet)
Evaluation and Bias Analysis of Large Language Models in Generating Synthetic Electronic Health Records: A Comparative Study
ABSTRACT
Background:
Recent advancements in artificial intelligence, particularly with large language models (LLMs), have shown potential in the automated generation of synthetic clinical electronic health records (EHRs). However, concerns regarding the performance of these models and the manifestation of gender and racial biases in their outputs necessitate a thorough examination, especially as these models are increasingly applied in healthcare settings.
Objective:
This study aims to systematically assess the performance of various LLMs in generating synthetic EHRs and to critically evaluate the presence of gender and racial biases in the generated outputs. The study introduces the Electronic Health Record Performance Score (EPS) as a novel metric for comparing the efficacy of different LLMs, particularly focusing on bilingual English-Chinese models versus predominantly English models.
Methods:
We evaluated seven open-source LLMs across 20 diseases, analyzing the completeness and bias of the generated EHRs. Gender and racial biases were quantified using statistical methods, including chi-square tests. The study involved the generation of 140,000 synthetic patient cases, which were assessed using the EPS and attribute-specific EPS (EPSgender and EPSrace). Model performance was analyzed in relation to model size, cultural background, and training data diversity.
Results:
The findings revealed significant differences in the accuracy of synthetic EHRs generated by different models, with larger models generally performing better but also exhibiting more pronounced biases. Gender biases were found to increase with model size, particularly aligning with the gender prevalence of specific diseases. Racial biases were more complex, with a consistent overrepresentation of white patients across most diseases. The study highlighted that increasing the diversity of training data did not necessarily reduce racial biases.
Conclusions:
This study underscores the critical need for ongoing interdisciplinary research to enhance the fairness and reliability of LLMs in healthcare. The pervasive gender and racial biases identified in LLM-generated EHRs emphasize the importance of developing methods for bias detection and mitigation to ensure equitable healthcare delivery and education. Future research should focus on refining these models to better represent diverse patient populations while maintaining high performance in EHR generation.
Citation
Request queued. Please wait while the file is being generated. It may take some time.