Currently submitted to: JMIR Formative Research
Date Submitted: Mar 30, 2026
Open Peer Review Period: Apr 21, 2026 - Jun 16, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Using Natural Language Processing to Facilitate Qualitative Research: Dementia Experiences in EHR Free Text
ABSTRACT
Background:
Electronic health records (EHRs) contain extensive unstructured free text that is difficult to incorporate into qualitative research at scale. Existing NLP approaches in health research primarily focus on structured data extraction or predictive modeling, leaving qualitative applications underdeveloped.
Objective:
To develop and evaluate an interpretable, rule based NLP pipeline to curate large EHR text corpora into analytically tractable sub corpora suitable for qualitative research.
Methods:
We applied a deterministic NLP algorithm (pyTAKES) to 161,111 free text EHR notes from 335 participants diagnosed with dementia in the Adult Changes in Thought (ACT) study. Using a concept dictionary informed by prior qualitative research and clinical expertise, we tagged notes with semantically meaningful concepts and applied filters (eg, concept density, note type, temporal proximity to diagnosis) to distill three focused sub corpora. We evaluated concept performance through manual review and assessed corpus relevance before and after filtration.
Results:
Sixty two percent of notes contained at least one concept match. Concept review demonstrated acceptable agreement between retrieved text and target phenomena. Filtering reduced the corpus by over 95% while increasing the proportion of caregiving relevant notes from 23.5% to 84.5%. Each sub corpus supported distinct qualitative research questions.
Conclusions:
NLP methods can efficiently curate large EHR text corpora for qualitative analysis. This approach offers a reproducible and resource efficient alternative to black box machine learning models, enabling qualitative researchers to leverage EHR data at scale.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.