Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 6, 2020
Date Accepted: Jun 5, 2021

The final, peer-reviewed published version of this preprint can be found here:

Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study

Huang Y, Wang N, Zhang Z, Liu H, Fei X, Wei L, Chen H

Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study

JMIR Med Inform 2021;9(7):e19905

DOI: 10.2196/19905

PMID: 34297000

PMCID: 8367145

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Representation on feature and patient levels from structured Electronic Medical Records based on Skip-gram algorithm

  • Yanqun Huang; 
  • Ni Wang; 
  • Zhiqiang Zhang; 
  • Honglei Liu; 
  • Xiaolu Fei; 
  • Lan Wei; 
  • Hui Chen

ABSTRACT

Background:

The secondary utilization of the structured electronic medical record (sEMR) data has become a challenge due to the diversity, sparsity, and high-dimensionality of the data representation.

Objective:

We aimed to explore the feasibility of the embedding-based feature and patient representation for sEMR data and demonstrate the efficiency and superiority of the embedding-based patient representation.

Methods:

The entire training corpus consisted of records of 104752 hospitalized patients with 21 variables, including demographic characteristics, disease diagnoses, procedures, medications, laboratory tests, and other hospitalization indicators. Discrete values for original categorical variables and binned continuous variables were considered as words (concepts), and thus a record as a sentence in a text. To eliminate the influence the concept sequence played on the embedding algorithm, we randomly shuffled the concepts within a sentence 20 times. For a patient record, each feature concept was embedded into a 200-dimensional real number vector using the Skip-gram algorithm. Then the average of all the embedding concept vectors represented the patient. To assess the effectiveness of these embedding-based feature representations, we used the cosine distances among features’ embedding vectors to capture the latent relationship among the concepts of different features. We further conducted cluster analysis on stroke patients to evaluate and compare the efficiency and superiority of the embedding-based patient representation, where the embedding vectors were trained using the overall patients and just the stroke patients with and without the concept shuffling respectively. The representations of both multi-hot codes and one-hot codes plus original continuous numbers were used as the benchmark representations.

Results:

According to the Silhouette index, stroke patients were clustered into two groups, characterizing in patients with a primary diagnosis of hemorrhage stroke (HS) and ischemic stroke (IS), respectively. Cluster analyses conducted on patients with the embedding representations showed higher applicability (Hopkins Statistics, 0.925), higher aggregation (Silhouette index, 0.862), and lower dispersion (Davies Bouldin index, 0.551) than those conducted on patients with the benchmark representations. The two clusters for patients with the embedding-based representation learned from all the records after the concept shuffling achieved the highest F1-scores of 0.944 for IS and 0.717 for HS, respectively.

Conclusions:

The feature-level embeddings can reflect the potential associations among medical concepts to some degree. The patient-level embeddings can be easily used as continuous input to standard machine learning algorithms and bring performance improvement. We expect that the embedding-based representation will be helpful in a wide range of the secondary use of the sEMR data.


 Citation

Please cite as:

Huang Y, Wang N, Zhang Z, Liu H, Fei X, Wei L, Chen H

Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study

JMIR Med Inform 2021;9(7):e19905

DOI: 10.2196/19905

PMID: 34297000

PMCID: 8367145

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.