Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Sep 8, 2025
Open Peer Review Period: Sep 9, 2025 - Nov 4, 2025
Date Accepted: Mar 9, 2026
(closed for review but you can still tweet)
Applications of Natural Language Processing and Large Language Models for Social Determinants of Health: A Systematic Review
ABSTRACT
Background:
Social Determinants of Health (SDOH) are the social, economic, and environmental conditions that influence health outcomes. SDOHs are often embedded in unstructured text, such as notes in electronic health records (EHRs) and social media posts. Advances in natural language processing (NLP), particularly the ubiquity of large language models (LLMs), offer emerging opportunities to extract, analyze, and interpret SDOH information from these sources and relate them to clinical outcomes. However, existing NLP studies are scattered across disciplines, use various methodologies, and vary in quality and scope, making it difficult to draw cohesive insights or benchmark progress.
Objective:
This systematic review aims to identify the current landscape of NLP and LLMs applications in SDOH-related research. Specifically, it identifies common NLP task areas, models, data sources, evaluation practices, and key findings, while highlighting methodological gaps and opportunities for future work.
Methods:
Following PRISMA guidelines, we searched PubMed, Web of Science, IEEE Xplore, Scopus, PsycINFO, Health Source: Academic Nursing, and ACL Anthology to find studies published in English between 2014 and 2024. Eligible studies used NLP methods, including deep learning, transformer models, and LLMs, to identify, classify, or predict SDOH from text. Screening and data extraction were conducted by independent reviewers, with conflicts resolved by consensus. The review protocol was registered in PROSPERO (registration number: CRD42024578082).
Results:
129 studies met the inclusion criteria. We observed a rapid growth in the field since 2021 (79% studies from 2021-2024). EHRs were the most common data source, although access limitations (based on institutions) were frequent. Most studies focused on extraction or classification tasks, using transformer-based models such as BERT (n=28) and large language models (n=13). Housing instability (46.9%), financial context (46.2%), employment (39.2%), substance use (32.3%) and social connection or isolation (31.5%) were among the social determinants of health most studied. Although several studies reported strong model performance, reproducibility remains limited due to restricted data and code availability.
Conclusions:
This systematic review highlights the expanding role of NLP and LLMs in SDOH research and the potential to support scalable, data-driven approaches to address health disparities. Future work should prioritize longitudinal analysis, structural determinants, public benchmarks, and real-world implementation. Enhancing transparency and inclusivity in datasets and model development is critical to realizing the promise of NLP for equitable health outcomes. Clinical Trial: International Registered Report Identifier (IRRID): DERR1-10.2196/66094 Review registered on PROSPERO 2024 CRD42024578082 JMIR Research Protocol 2025;14:e66094 doi:10.2196/66094
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.