Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Representing Injuries in Trauma Patients: Development and Evaluation of Embeddings for Injuries
ABSTRACT
Background:
Trauma patients present with heterogeneous injury patterns that are challenging to represent in statistical models. Traditional approaches either use high-dimensional one-hot encoding, resulting in sparse features, or aggregate injuries into summary scores that lose patient-specific detail.
Objective:
This study developed data-driven ICD-10 embeddings for trauma injuries and evaluated their ability to preserve injury information.
Methods:
Using the National Trauma Data Bank, we trained autoencoder models on all trauma patients from 2018 to generate dense vector representations of ICD-10 injury codes. We evaluated embeddings of dimensions 2, 4, 8, 16, and 32 against one-hot encoding using three prediction tasks: in-hospital mortality, emergency department disposition, and blood transfusion within 24 hours. For each hospital included, we trained separate logistic regression and LightGBM models using 2018 data from that hospital, then evaluated performance on 2019 data from the same hospital. Performance was measured using area under the receiver operating characteristic curve (AUC) and stratified by hospital size.
Results:
In LightGBM models, 8-dimensional embeddings improved AUC compared to one-hot encoding by 0.08 (95% CI: 0.06, 0.10) in small hospitals, 0.03 (0.02, 0.04) in medium hospitals, and 0.02 (0.01, 0.02) in large hospitals, with comparable performance in major hospitals (0.00 [-0.01, 0.01]). In logistic regression, 32-dimensional embeddings showed AUC improvements of 0.03 (0.01, 0.05), 0.02 (0.01, 0.03), and 0.02 (0.02, 0.03) for small, medium, and large hospitals respectively, with similar performance in major hospitals (0.01 [0.00, 0.01]).
Conclusions:
ICD-10 code injury embeddings with ≥8 dimensions preserve clinically relevant information and can outperform one-hot encoding while reducing dimensionality. The embeddings and software are openly available to support further trauma research and applications.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.