Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Aging

Date Submitted: Aug 7, 2024
Date Accepted: Mar 10, 2025

The final, peer-reviewed published version of this preprint can be found here:

Unsupervised Deep Learning of Electronic Health Records to Characterize Heterogeneity Across Alzheimer Disease and Related Dementias: Cross-Sectional Study

West M, Cheng Y, He Y, Leng Y, Magdamo C, Hyman BT, Dickson JR, Serrano-Pozo A, Blacker D, Das S

Unsupervised Deep Learning of Electronic Health Records to Characterize Heterogeneity Across Alzheimer Disease and Related Dementias: Cross-Sectional Study

JMIR Aging 2025;8:e65178

DOI: 10.2196/65178

PMID: 40163031

PMCID: 11997524

Unsupervised Deep Learning of Electronic Health Records Characterizes Heterogeneity Across Alzheimer’s Disease and Related Dementias

  • Matthew West; 
  • You Cheng; 
  • Yingnan He; 
  • Yu Leng; 
  • Colin Magdamo; 
  • Bradley T. Hyman; 
  • John R. Dickson; 
  • Alberto Serrano-Pozo; 
  • Deborah Blacker; 
  • Sudeshna Das

ABSTRACT

Background:

Alzheimer's disease and related dementias (ADRD) exhibit prominent heterogeneity. Identifying clinically meaningful ADRD subtypes is essential for tailoring treatments to specific patient phenotypes.

Objective:

To employ unsupervised learning techniques on electronic health records (EHRs) from memory clinic patients to identify ADRD subtypes.

Methods:

We used pre-trained embeddings of non-ADRD diagnosis codes (ICD) and large language model (LLM)-derived embeddings of clinical notes from patient EHRs. Hierarchical clustering of these embeddings was used to identify ADRD subtypes. Clusters were characterized in terms of their demographic and clinical features.

Results:

We analyzed a cohort of 3,454 ADRD memory clinic patients at Massachusetts General Hospital, each with a specialist diagnosis. Clustering pre-trained embeddings of the non-ADRD diagnosis codes in patient EHRs revealed three patient subtypes: one with skin conditions, another with psychiatric disorders and an earlier ages of onset, and a third with diabetes complications. Similarly, using large language model (LLM)-derived embeddings of clinical notes, we identified three subtypes of patients: one with psychiatric manifestations and higher prevalence of females (prevalence ratio: 1.59), another with cardiovascular and motor problems and higher prevalence of males (prevalence ratio: 1.75), and a third one with geriatric health disorders. Notably, we observed significant overlap between clusters from both data modalities.

Conclusions:

By integrating ICD codes and LLM-derived embeddings, our analysis delineated two distinct ADRD subtypes with sex-specific comorbid and clinical presentations, offering insights for potential precision medicine approaches.


 Citation

Please cite as:

West M, Cheng Y, He Y, Leng Y, Magdamo C, Hyman BT, Dickson JR, Serrano-Pozo A, Blacker D, Das S

Unsupervised Deep Learning of Electronic Health Records to Characterize Heterogeneity Across Alzheimer Disease and Related Dementias: Cross-Sectional Study

JMIR Aging 2025;8:e65178

DOI: 10.2196/65178

PMID: 40163031

PMCID: 11997524

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.