JMIR Preprints #40755: Pat2Vec: ICD-10 patient embeddings based on 10 mio cohort for healthcare prediction tasks

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)

Pat2Vec: ICD-10 patient embeddings based on 10 mio cohort for healthcare prediction tasks

Edgar Steiger;
Lars Eric Kroll

ABSTRACT

Background:

ICD-10-based diagnoses in claims data are a crucial part of patients' electronic health records (EHR). Any analysis that uses diagnosis codes of patients to predict subsequent outcomes requires a numerical representation of these string-encoded profiles. So far, most often binary encoded numerizations, usually on a subset of diagnoses, have been used in this regard. In real world applications a number of problems arise: Patients' profiles show high variability even with the same underlying diseases, they may have gaps and do not contain all available information, they cannot be shared without serious privacy concerns, and a large number of appropriate diagnoses have to be considered.

Objective:

We present a self-supervised machine learning framework inspired by natural language processing that embeds complete ICD-10 diagnoses profiles into a privacy preserving, real-valued numerical vector of a small and flexible size.

Methods:

Multiple alternative vectorizations were evaluated using supervised machine learning algorithms on a set of relevant calibration tasks against a baseline model which binary encodes a list of the most common diagnoses. In additional analysis, we identified clusters and visualized the patient vectors in two dimensions to describe subpopulations in the context of healthcare. Furthermore, we tested our vectorization model on the healthcare-relevant task of predicting prospective drug prescription costs from patients' diagnosis histories.

Results:

Our results show that our final models surpass the performance of the baseline model with equal dimension and they show greater robustness to missing data as well as larger gains for lower dimensions, which exemplifies the compression of non-linear information using the vectorization embedding procedure.

Conclusions:

We envision multiple applications for the resulting numerical vector embeddings that will benefit the quality of healthcare, including personalized prevention recommendations, signal detection in patient/drug safety and surveillance, patient clustering in healthcare resource planning, as well as statistical matching in observational data.

Citation

Please cite as:

Steiger E, Kroll LE

Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework

JMIR AI 2023;2:e40755

DOI: 10.2196/40755

PMID: 38875541

PMCID: 11041498

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR AI

Date Submitted: Jul 4, 2022

Open Peer Review Period: Jul 4, 2022 - Aug 29, 2022

Date Accepted: Mar 18, 2023

(closed for review but you can still tweet)

Pat2Vec: ICD-10 patient embeddings based on 10 mio cohort for healthcare prediction tasks

ABSTRACT

Citation

Copyright