Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Jul 4, 2022
Open Peer Review Period: Jul 4, 2022 - Aug 29, 2022
Date Accepted: Mar 18, 2023
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework

Steiger E, Kroll LE

Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework

JMIR AI 2023;2:e40755

DOI: 10.2196/40755

PMID: 38875541

PMCID: 11041498

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Pat2Vec: ICD-10 patient embeddings based on 10 mio cohort for healthcare prediction tasks

  • Edgar Steiger; 
  • Lars Eric Kroll

ABSTRACT

Background:

ICD-10-based diagnoses in claims data are a crucial part of patients' electronic health records (EHR). Any analysis that uses diagnosis codes of patients to predict subsequent outcomes requires a numerical representation of these string-encoded profiles. So far, most often binary encoded numerizations, usually on a subset of diagnoses, have been used in this regard. In real world applications a number of problems arise: Patients' profiles show high variability even with the same underlying diseases, they may have gaps and do not contain all available information, they cannot be shared without serious privacy concerns, and a large number of appropriate diagnoses have to be considered.

Objective:

We present a self-supervised machine learning framework inspired by natural language processing that embeds complete ICD-10 diagnoses profiles into a privacy preserving, real-valued numerical vector of a small and flexible size.

Methods:

Multiple alternative vectorizations were evaluated using supervised machine learning algorithms on a set of relevant calibration tasks against a baseline model which binary encodes a list of the most common diagnoses. In additional analysis, we identified clusters and visualized the patient vectors in two dimensions to describe subpopulations in the context of healthcare. Furthermore, we tested our vectorization model on the healthcare-relevant task of predicting prospective drug prescription costs from patients' diagnosis histories.

Results:

Our results show that our final models surpass the performance of the baseline model with equal dimension and they show greater robustness to missing data as well as larger gains for lower dimensions, which exemplifies the compression of non-linear information using the vectorization embedding procedure.

Conclusions:

We envision multiple applications for the resulting numerical vector embeddings that will benefit the quality of healthcare, including personalized prevention recommendations, signal detection in patient/drug safety and surveillance, patient clustering in healthcare resource planning, as well as statistical matching in observational data.


 Citation

Please cite as:

Steiger E, Kroll LE

Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework

JMIR AI 2023;2:e40755

DOI: 10.2196/40755

PMID: 38875541

PMCID: 11041498

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.