Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Feb 9, 2024
Date Accepted: Jan 16, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Assessing Racial and Ethnic Bias in Text Generation for Healthcare-Related Tasks by GPT-3.5-turbo: Cross Sectional Study
ABSTRACT
Background:
Racial and ethnic bias in Large Language Models (LLMs) used for healthcare tasks is a growing concern, as it may contribute to health disparities. In response, LLM operators implemented safeguards against prompts that are overtly seeking certain bias.
Objective:
Our study investigates potential racial and ethnic bias in GPT-3.5-turbo, a popular LLM, in generating healthcare consumer-directed text in absence of overtly biased queries.
Methods:
In this cross-sectional study, GPT-3.5-turbo was prompted to generate discharge instructions for patients with Human Immunodeficiency Virus (HIV). Each patient’s encounter de-identified metadata including race/ethnicity as a variable were passed over in a table format through a prompt four times, altering only the race/ethnicity information (African American, Asian, Hispanic White, Non-Hispanic White) each time, while keeping all other information constant. The prompt requested the model to write discharge instructions for each encounter without explicitly mentioning race, ethnicity, or insurance type. The LLM-generated instructions were analyzed for sentiment, subjectivity, reading ease, and word usage by race/ethnicity and insurance type.
Results:
The average polarity of GPT-3.5-turbo generated patient instructions across the different racial/ethnic groups was comparable, ranging from 0.14 to 0.15, with an average subjectivity of 0.46 for all groups. Differences in polarity and subjectivity across racial/ethnic groups were not statistically significant. However, word frequency varied across racial/ethnic groups, and subjectivity differed across insurance types, with commercial insurance eliciting the most subjective responses.
Conclusions:
GPT-3.5-turbo was relatively invariant to race/ethnicity and insurance type in terms of linguistic and readability measures. Further studies are needed to validate these results and assess their implications.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.