Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 15, 2025
Date Accepted: Apr 14, 2026

The final, peer-reviewed published version of this preprint can be found here:

Extracting Social Determinants of Health From Electronic Health Records: Development and Comparison of Rule-Based and Large Language Model Methods

Wang B, Kabir D, Clark CR, Choi K, Smoller J

Extracting Social Determinants of Health From Electronic Health Records: Development and Comparison of Rule-Based and Large Language Model Methods

JMIR Med Inform 2026;14:e89534

DOI: 10.2196/89534

PMID: 42155986

Extracting Social Determinants of Health from Electronic Health Records: Development and Comparison of Rule-Based and Large Language Model Methods

  • Bo Wang; 
  • Dia Kabir; 
  • Cheryl Renee Clark; 
  • Karmel Choi; 
  • Jordan Smoller

ABSTRACT

Background:

Social determinants of health (SDoH) are critical drivers of health outcomes but are often under-documented in structured electronic health record (EHR) data. Instead, SDoH are more commonly recorded in unstructured clinical notes, and unlocking this information could have far-reaching implications for advancing population health research and inform clinical decision making.

Objective:

This study aimed to develop and systematically evaluate scalable methods for extracting SDoH information from unstructured clinical notes using rule-based natural language processing (NLP) and large language model (LLM)-based approaches.

Methods:

We constructed a gold-standard annotated corpus comprising clinical text segments from 171 patients in the Mass General Brigham Research Patient Data Registry, covering seven SDoH domain categories and 23 subcategories. A rule-based system (RBS) was developed and evaluated alongside seven OpenAI GPT models (GPT-4o, 4.1, 4.1-mini, o4-mini, GPT-5, GPT-5-mini, and o3) under zero-shot and few-shot settings with multiple prompting strategies. We additionally implemented late-fusion ensemble approaches that combined outputs from rule-based and LLM-based methods. Performance was assessed using precision, recall, and F1 score, with subgroup analyses conducted across demographic characteristics.

Results:

The RBS achieved the highest precision for SDoH domain categories (0.97) but substantially lower recall (0.62). GPT-based models consistently outperformed RBS in overall F1 scores. The best domain-level performance was observed for GPT-5 and GPT-5-mini in few-shot settings (F1=0.88), while o4-mini achieved the highest subcategory-level performance (F1=0.79). A late-fusion ensemble integrating RBS and GPT outputs further improved domain-level performance (F1=0.89), with balanced precision (0.90) and recall (0.89). Model performance was consistent across demographic subgroups.

Conclusions:

Recent GPT models with advanced reasoning capabilities, including the newly released “mini” models (e.g., o4-mini and GPT-5-mini), demonstrated strong performance for SDoH extraction without task-specific fine-tuning and consistently outperformed the rule-based NLP system. Integrating rule-based and LLM-based methods via late-fusion further enhanced performance. Our results demonstrate a scalable and cost-efficient framework for the accurate identification of SDoH from clinical text, facilitating downstream population health research and clinical informatics applications.


 Citation

Please cite as:

Wang B, Kabir D, Clark CR, Choi K, Smoller J

Extracting Social Determinants of Health From Electronic Health Records: Development and Comparison of Rule-Based and Large Language Model Methods

JMIR Med Inform 2026;14:e89534

DOI: 10.2196/89534

PMID: 42155986

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.