JMIR Preprints #89534: Extracting Social Determinants of Health from Electronic Health Records: Development and Comparison of Rule-Based and Large Language Model Methods

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Extracting Social Determinants of Health from Electronic Health Records: Development and Comparison of Rule-Based and Large Language Model Methods

Bo Wang;
Dia Kabir;
Cheryl Renee Clark;
Karmel Choi;
Jordan Smoller

ABSTRACT

Background:

Social determinants of health (SDoH) are critical drivers of health outcomes but are often under-documented in structured electronic health record (EHR) data. Instead, SDoH are more commonly recorded in unstructured clinical notes, and unlocking this information could have far-reaching implications for advancing population health research and inform clinical decision making.

Objective:

This study aimed to develop and systematically evaluate scalable methods for extracting SDoH information from unstructured clinical notes using rule-based natural language processing (NLP) and large language model (LLM)-based approaches.

Methods:

We constructed a gold-standard annotated corpus comprising clinical text segments from 171 patients in the Mass General Brigham Research Patient Data Registry, covering seven SDoH domain categories and 23 subcategories. A rule-based system (RBS) was developed and evaluated alongside seven OpenAI GPT models (GPT-4o, 4.1, 4.1-mini, o4-mini, GPT-5, GPT-5-mini, and o3) under zero-shot and few-shot settings with multiple prompting strategies. We additionally implemented late-fusion ensemble approaches that combined outputs from rule-based and LLM-based methods. Performance was assessed using precision, recall, and F1 score, with subgroup analyses conducted across demographic characteristics.

Results:

The RBS achieved the highest precision for SDoH domain categories (0.97) but substantially lower recall (0.62). GPT-based models consistently outperformed RBS in overall F1 scores. The best domain-level performance was observed for GPT-5 and GPT-5-mini in few-shot settings (F1=0.88), while o4-mini achieved the highest subcategory-level performance (F1=0.79). A late-fusion ensemble integrating RBS and GPT outputs further improved domain-level performance (F1=0.89), with balanced precision (0.90) and recall (0.89). Model performance was consistent across demographic subgroups.

Conclusions:

Recent GPT models with advanced reasoning capabilities, including the newly released “mini” models (e.g., o4-mini and GPT-5-mini), demonstrated strong performance for SDoH extraction without task-specific fine-tuning and consistently outperformed the rule-based NLP system. Integrating rule-based and LLM-based methods via late-fusion further enhanced performance. Our results demonstrate a scalable and cost-efficient framework for the accurate identification of SDoH from clinical text, facilitating downstream population health research and clinical informatics applications.

Citation

Please cite as:

Wang B, Kabir D, Clark CR, Choi K, Smoller J

Extracting Social Determinants of Health From Electronic Health Records: Development and Comparison of Rule-Based and Large Language Model Methods

JMIR Med Inform 2026;14:e89534

DOI: 10.2196/89534

PMID: 42155986

PMCID: 13231107

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 15, 2025

Date Accepted: Apr 14, 2026

Extracting Social Determinants of Health from Electronic Health Records: Development and Comparison of Rule-Based and Large Language Model Methods

ABSTRACT

Citation

Copyright