JMIR Preprints #57257: Assessing Racial and Ethnic Bias in Text Generation for Healthcare-Related Tasks by GPT-3.5-turbo: Cross Sectional Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Assessing Racial and Ethnic Bias in Text Generation for Healthcare-Related Tasks by GPT-3.5-turbo: Cross Sectional Study

John Hanna;
Abdi D Wakene;
Christoph U Lehmann;
Richard J Medford

ABSTRACT

Background:

Racial and ethnic bias in Large Language Models (LLMs) used for healthcare tasks is a growing concern, as it may contribute to health disparities. In response, LLM operators implemented safeguards against prompts that are overtly seeking certain bias.

Objective:

Our study investigates potential racial and ethnic bias in GPT-3.5-turbo, a popular LLM, in generating healthcare consumer-directed text in absence of overtly biased queries.

Methods:

In this cross-sectional study, GPT-3.5-turbo was prompted to generate discharge instructions for patients with Human Immunodeficiency Virus (HIV). Each patient’s encounter de-identified metadata including race/ethnicity as a variable were passed over in a table format through a prompt four times, altering only the race/ethnicity information (African American, Asian, Hispanic White, Non-Hispanic White) each time, while keeping all other information constant. The prompt requested the model to write discharge instructions for each encounter without explicitly mentioning race, ethnicity, or insurance type. The LLM-generated instructions were analyzed for sentiment, subjectivity, reading ease, and word usage by race/ethnicity and insurance type.

Results:

The average polarity of GPT-3.5-turbo generated patient instructions across the different racial/ethnic groups was comparable, ranging from 0.14 to 0.15, with an average subjectivity of 0.46 for all groups. Differences in polarity and subjectivity across racial/ethnic groups were not statistically significant. However, word frequency varied across racial/ethnic groups, and subjectivity differed across insurance types, with commercial insurance eliciting the most subjective responses.

Conclusions:

GPT-3.5-turbo was relatively invariant to race/ethnicity and insurance type in terms of linguistic and readability measures. Further studies are needed to validate these results and assess their implications.

Citation

Please cite as:

Hanna J, Wakene AD, Lehmann CU, Medford RJ

Assessing Racial and Ethnic Bias in Text Generation by Large Language Models for Health Care–Related Tasks: Cross-Sectional Study

J Med Internet Res 2025;27:e57257

DOI: 10.2196/57257

PMID: 40080818

PMCID: 11950697

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Feb 9, 2024

Date Accepted: Jan 16, 2025

Assessing Racial and Ethnic Bias in Text Generation for Healthcare-Related Tasks by GPT-3.5-turbo: Cross Sectional Study

ABSTRACT

Citation

Copyright