JMIR Preprints #99394: Evaluating Fairness and Generalizability of Large Language Models for Social Isolation Extraction from Electronic Health Records: Multisite Evaluation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluating Fairness and Generalizability of Large Language Models for Social Isolation Extraction from Electronic Health Records: Multisite Evaluation

Lokesh Kumar Chinthala;
Arash Shaban-Nejad;
Hua Xu;
Cindy Lemon;
Gregory Farage;
Robert Davis;
Charisse Madlock-Brown

ABSTRACT

Background:

Recent advancements in large language models (LLMs) have improved the identification of social isolation from clinical narratives, which vary widely in linguistic patterns and documentation practices. However, LLMs fine-tuned on a single dataset often show reduced performance when applied to different healthcare settings or clinical note types. Rigorous evaluation of cross-site generalizability and fairness is therefore essential to ensure accurate and equitable detection of social isolation across diverse populations and clinical contexts.

Objective:

This study aimed to evaluate a span-level fine-tuned FLAN-T5-Large model for extracting ‘social isolation’ indicators from unstructured clinical text and to assess its generalizability and fairness across diverse populations and healthcare data sources.

Methods:

A total of 2,967 unique annotated spans from 9,578 clinical notes across three healthcare systems were used to fine-tune an FLAN-T5-Large model using a contextualized span classification framework. A Gemma‑2-2B model was evaluated in a sensitivity analysis to assess architecture‑related performance differences. Performance was assessed using precision, recall, and macro F1. Fairness was evaluated across demographic variables, social vulnerability strata, and note types using statistical parity difference (SPD) and equal opportunity difference (EOD).

Results:

Incorporating contextual windows around annotated spans improved macro-F1 from 0.90 to 0.94 during validation. In full note evaluation across 900 manually reviewed notes, FLAN-T5-Large achieved high recall for social isolation (0.94 – 0.98) and macro F1 values ranging from 0.69 to 0.81 across sites. Fairness analysis showed generally consistent performance across age, gender, race, and social vulnerability groups, with equitable sensitivity (EOD 0.02 – 0.04) and moderate variation in positive prediction rates (SPD). Performance variability was strongly associated with documentation type, with note type driving substantially greater variability in both performance and fairness metrics than patient demographic factors.

Conclusions:

Fine-tuned FLAN‑T5-Large demonstrated strong capability in detecting ‘social isolation’ from clinical narratives, while maintaining sensitivity parity across subgroups. The observed heterogeneity was largely driven by documentation context rather than by patient characteristics, highlighting the importance of note type-aware evaluation in clinical NLP. These findings support the use of instruction‑tuned LLMs for equitable extraction of social context information from EHR text.

Citation

Please cite as:

Chinthala LK, Shaban-Nejad A, Xu H, Lemon C, Farage G, Davis R, Madlock-Brown C

Evaluating Fairness and Generalizability of Large Language Models for Social Isolation Extraction from Electronic Health Records: Multisite Evaluation

JMIR Preprints. 24/04/2026:99394

DOI: 10.2196/preprints.99394

URL: https://preprints.jmir.org/preprint/99394

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Mental Health

Date Submitted: Apr 24, 2026

Open Peer Review Period: Apr 29, 2026 - Jun 24, 2026

(currently open for review)

Evaluating Fairness and Generalizability of Large Language Models for Social Isolation Extraction from Electronic Health Records: Multisite Evaluation

ABSTRACT

Citation

Copyright