Currently submitted to: JMIR Mental Health
Date Submitted: Apr 24, 2026
Open Peer Review Period: Apr 29, 2026 - Jun 24, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Evaluating Fairness and Generalizability of Large Language Models for Social Isolation Extraction from Electronic Health Records: Multisite Evaluation
ABSTRACT
Background:
Recent advancements in large language models (LLMs) have improved the identification of social isolation from clinical narratives, which vary widely in linguistic patterns and documentation practices. However, LLMs fine-tuned on a single dataset often show reduced performance when applied to different healthcare settings or clinical note types. Rigorous evaluation of cross-site generalizability and fairness is therefore essential to ensure accurate and equitable detection of social isolation across diverse populations and clinical contexts.
Objective:
This study aimed to evaluate a span-level fine-tuned FLAN-T5-Large model for extracting ‘social isolation’ indicators from unstructured clinical text and to assess its generalizability and fairness across diverse populations and healthcare data sources.
Methods:
A total of 2,967 unique annotated spans from 9,578 clinical notes across three healthcare systems were used to fine-tune an FLAN-T5-Large model using a contextualized span classification framework. A Gemma‑2-2B model was evaluated in a sensitivity analysis to assess architecture‑related performance differences. Performance was assessed using precision, recall, and macro F1. Fairness was evaluated across demographic variables, social vulnerability strata, and note types using statistical parity difference (SPD) and equal opportunity difference (EOD).
Results:
Incorporating contextual windows around annotated spans improved macro-F1 from 0.90 to 0.94 during validation. In full note evaluation across 900 manually reviewed notes, FLAN-T5-Large achieved high recall for social isolation (0.94 – 0.98) and macro F1 values ranging from 0.69 to 0.81 across sites. Fairness analysis showed generally consistent performance across age, gender, race, and social vulnerability groups, with equitable sensitivity (EOD 0.02 – 0.04) and moderate variation in positive prediction rates (SPD). Performance variability was strongly associated with documentation type, with note type driving substantially greater variability in both performance and fairness metrics than patient demographic factors.
Conclusions:
Fine-tuned FLAN‑T5-Large demonstrated strong capability in detecting ‘social isolation’ from clinical narratives, while maintaining sensitivity parity across subgroups. The observed heterogeneity was largely driven by documentation context rather than by patient characteristics, highlighting the importance of note type-aware evaluation in clinical NLP. These findings support the use of instruction‑tuned LLMs for equitable extraction of social context information from EHR text.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.