Accepted for/Published in: JMIR AI
Date Submitted: Jul 4, 2022
Open Peer Review Period: Jul 4, 2022 - Aug 29, 2022
Date Accepted: Mar 18, 2023
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Pat2Vec: ICD-10 patient embeddings based on 10 mio cohort for healthcare prediction tasks
ABSTRACT
Background:
ICD-10-based diagnoses in claims data are a crucial part of patients' electronic health records (EHR). Any analysis that uses diagnosis codes of patients to predict subsequent outcomes requires a numerical representation of these string-encoded profiles. So far, most often binary encoded numerizations, usually on a subset of diagnoses, have been used in this regard. In real world applications a number of problems arise: Patients' profiles show high variability even with the same underlying diseases, they may have gaps and do not contain all available information, they cannot be shared without serious privacy concerns, and a large number of appropriate diagnoses have to be considered.
Objective:
We present a self-supervised machine learning framework inspired by natural language processing that embeds complete ICD-10 diagnoses profiles into a privacy preserving, real-valued numerical vector of a small and flexible size.
Methods:
Multiple alternative vectorizations were evaluated using supervised machine learning algorithms on a set of relevant calibration tasks against a baseline model which binary encodes a list of the most common diagnoses. In additional analysis, we identified clusters and visualized the patient vectors in two dimensions to describe subpopulations in the context of healthcare. Furthermore, we tested our vectorization model on the healthcare-relevant task of predicting prospective drug prescription costs from patients' diagnosis histories.
Results:
Our results show that our final models surpass the performance of the baseline model with equal dimension and they show greater robustness to missing data as well as larger gains for lower dimensions, which exemplifies the compression of non-linear information using the vectorization embedding procedure.
Conclusions:
We envision multiple applications for the resulting numerical vector embeddings that will benefit the quality of healthcare, including personalized prevention recommendations, signal detection in patient/drug safety and surveillance, patient clustering in healthcare resource planning, as well as statistical matching in observational data.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.