Accepted for/Published in: JMIR Medical Informatics
Date Submitted: May 6, 2020
Date Accepted: Jun 5, 2021
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Representation on feature and patient levels from structured Electronic Medical Records based on Skip-gram algorithm
ABSTRACT
Background:
The secondary utilization of the structured electronic medical record (sEMR) data has become a challenge due to the diversity, sparsity, and high-dimensionality of the data representation.
Objective:
We aimed to explore the feasibility of the embedding-based feature and patient representation for sEMR data and demonstrate the efficiency and superiority of the embedding-based patient representation.
Methods:
The entire training corpus consisted of records of 104752 hospitalized patients with 21 variables, including demographic characteristics, disease diagnoses, procedures, medications, laboratory tests, and other hospitalization indicators. Discrete values for original categorical variables and binned continuous variables were considered as words (concepts), and thus a record as a sentence in a text. To eliminate the influence the concept sequence played on the embedding algorithm, we randomly shuffled the concepts within a sentence 20 times. For a patient record, each feature concept was embedded into a 200-dimensional real number vector using the Skip-gram algorithm. Then the average of all the embedding concept vectors represented the patient. To assess the effectiveness of these embedding-based feature representations, we used the cosine distances among features’ embedding vectors to capture the latent relationship among the concepts of different features. We further conducted cluster analysis on stroke patients to evaluate and compare the efficiency and superiority of the embedding-based patient representation, where the embedding vectors were trained using the overall patients and just the stroke patients with and without the concept shuffling respectively. The representations of both multi-hot codes and one-hot codes plus original continuous numbers were used as the benchmark representations.
Results:
According to the Silhouette index, stroke patients were clustered into two groups, characterizing in patients with a primary diagnosis of hemorrhage stroke (HS) and ischemic stroke (IS), respectively. Cluster analyses conducted on patients with the embedding representations showed higher applicability (Hopkins Statistics, 0.925), higher aggregation (Silhouette index, 0.862), and lower dispersion (Davies Bouldin index, 0.551) than those conducted on patients with the benchmark representations. The two clusters for patients with the embedding-based representation learned from all the records after the concept shuffling achieved the highest F1-scores of 0.944 for IS and 0.717 for HS, respectively.
Conclusions:
The feature-level embeddings can reflect the potential associations among medical concepts to some degree. The patient-level embeddings can be easily used as continuous input to standard machine learning algorithms and bring performance improvement. We expect that the embedding-based representation will be helpful in a wide range of the secondary use of the sEMR data.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.