Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Dec 15, 2025
Date Accepted: Apr 14, 2026
Extracting Social Determinants of Health from Electronic Health Records: Development and Comparison of Rule-Based and Large Language Model Methods
ABSTRACT
Background:
Social determinants of health (SDoH) are critical drivers of health outcomes but are often under-documented in structured electronic health record (EHR) data. Instead, SDoH are more commonly recorded in unstructured clinical notes, and unlocking this information could have far-reaching implications for advancing population health research and inform clinical decision making.
Objective:
This study aimed to develop and systematically evaluate scalable methods for extracting SDoH information from unstructured clinical notes using rule-based natural language processing (NLP) and large language model (LLM)-based approaches.
Methods:
We constructed a gold-standard annotated corpus comprising clinical text segments from 171 patients in the Mass General Brigham Research Patient Data Registry, covering seven SDoH domain categories and 23 subcategories. A rule-based system (RBS) was developed and evaluated alongside seven OpenAI GPT models (GPT-4o, 4.1, 4.1-mini, o4-mini, GPT-5, GPT-5-mini, and o3) under zero-shot and few-shot settings with multiple prompting strategies. We additionally implemented late-fusion ensemble approaches that combined outputs from rule-based and LLM-based methods. Performance was assessed using precision, recall, and F1 score, with subgroup analyses conducted across demographic characteristics.
Results:
The RBS achieved the highest precision for SDoH domain categories (0.97) but substantially lower recall (0.62). GPT-based models consistently outperformed RBS in overall F1 scores. The best domain-level performance was observed for GPT-5 and GPT-5-mini in few-shot settings (F1=0.88), while o4-mini achieved the highest subcategory-level performance (F1=0.79). A late-fusion ensemble integrating RBS and GPT outputs further improved domain-level performance (F1=0.89), with balanced precision (0.90) and recall (0.89). Model performance was consistent across demographic subgroups.
Conclusions:
Recent GPT models with advanced reasoning capabilities, including the newly released “mini” models (e.g., o4-mini and GPT-5-mini), demonstrated strong performance for SDoH extraction without task-specific fine-tuning and consistently outperformed the rule-based NLP system. Integrating rule-based and LLM-based methods via late-fusion further enhanced performance. Our results demonstrate a scalable and cost-efficient framework for the accurate identification of SDoH from clinical text, facilitating downstream population health research and clinical informatics applications.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.