Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Aug 14, 2023
Date Accepted: Nov 27, 2023

The final, peer-reviewed published version of this preprint can be found here:

What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT

Kaplan DM, Palitsky R, Arconada Alvarez SJ, Pozzo NS, Greenleaf MN, Atkinson CA, Lam WA

What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT

J Med Internet Res 2024;26:e51837

DOI: 10.2196/51837

PMID: 38441945

PMCID: 10951834

What’s in a name? Experimental evidence of gender bias in letters of recommendation generated by ChatGPT

  • Deanna M. Kaplan; 
  • Roman Palitsky; 
  • Santiago J. Arconada Alvarez; 
  • Nicole S. Pozzo; 
  • Morgan N. Greenleaf; 
  • Ciara A. Atkinson; 
  • Wilbur A. Lam

ABSTRACT

Background:

Artificial intelligence chatbots such as ChatGPT have garnered excitement about their potential for delegating writing tasks ordinarily performed by humans. Many of these tasks (e.g., writing letters of recommendation) have social and professional ramifications, making potential social biases in ChatGPT’s underlying language model a serious concern.

Objective:

Three pre-registered studies used the text analysis program Linguistic Inquiry and Word Count (LIWC) to investigate gender bias in letters of recommendation written by ChatGPT-3.5 in human-use sessions (n = 1,400 total letters).

Methods:

Analyses used 22 existing LIWC dictionaries, as well as six newly created dictionaries based on systematic reviews of gender bias in recommendation letters, to compare recommendation letters generated for the 200 most historically popular “male” and “female” names in the USA. Study 1 used three different letter writing prompts intended to accentuate professional accomplishments associated with male stereotypes, female stereotypes, or neither. Study 2 examined whether lengthening each of the three prompts, while holding between-prompt word count constant, modified the extent of bias. Study 3 examined variability within letters generated for the same name and prompt. We hypothesized that when prompted with gender-stereotyped professional accomplishments, ChatGPT would evidence gender-based language differences replicating those found in systematic reviews of human-written letters of recommendation (e.g., more affiliative, social and communal language for female names; more agentic and skill-based language for male names).

Results:

Significant differences in language between letters generated for female vs. male names were observed across all prompts, including the prompt hypothesized to be neutral, and across nearly all language categories tested. Historically female names received significantly more social references (5/6 studies), communal and/or doubt-raising language (4/6 studies), personal pronouns (4/6 studies), and clout language (5/6 studies). Contradicting study hypotheses, some gender differences (e.g., achievement language, agentic language) were significant in the hypothesized or the non-hypothesized direction dependent on the prompt. Heteroscedasticity between male and female names was observed on multiple linguistic categories, with greater variance for a historically female than a historically male name.

Conclusions:

ChatGPT duplicates many of the gender-based language biases that have been reliably identified in investigations of human-written reference letters, although these differences vary across prompts and language categories. Caution should be taken when using ChatGPT for tasks such as reference letter writing that have social consequences. The methods developed for this study may be useful for ongoing bias testing in progressive generations of chatbots across a range of real-world scenarios. Clinical Trial: https://osf.io/7mbu6


 Citation

Please cite as:

Kaplan DM, Palitsky R, Arconada Alvarez SJ, Pozzo NS, Greenleaf MN, Atkinson CA, Lam WA

What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT

J Med Internet Res 2024;26:e51837

DOI: 10.2196/51837

PMID: 38441945

PMCID: 10951834

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.