Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Apr 21, 2021
Open Peer Review Period: Apr 21, 2021 - May 11, 2021
Date Accepted: Jul 26, 2021
(closed for review but you can still tweet)
Development of patient level cancer prediction models from a nationwide patient cohort: Model development and validation
ABSTRACT
Background:
Nationwide population-based cohorts provide a new opportunity to build automated risk prediction models at patient level, as claim data is one of the useful resources to that end. To avoid unnecessary diagnostic intervention after cancer screening tests, patient level prediction models should be developed
Objective:
We aimed at developing cancer prediction models using nationwide claim databases with machine learning algorithms, which are explainable and easily applicable in real world environments.
Methods:
As source data, we used the Korean National Insurance System Database. Every Korean in ≥40 years old undergoes a national health check-up every two years. We gathered all variables from the database including demographic information, basic laboratory values, anthropometric values, as well as previous medical history. We applied conventional logistic regression methods, light gradient boosting methods, neural networks, and survival analysis, as well as one class embedding classifier methods to effectively analyze high dimension data based on deep learning-based anomaly detection. Performance was measured with area under the curve (AUROC), area under precision recall curve (AUPRC). We validated our models externally with a health check-up database from a tertiary hospital.
Results:
One class embedding classifier model received the highest AUROC scores with values of 0.868, 0.849, 0.798, 0.746, 0.800, 0.749 and 0.790 for liver, lung, colorectal, pancreatic, gastric, breast and cervical cancers respectively. For AURPC, light gradient boosting models has the highest score with values of 0.383, 0.401, 0.387, 0.300, 0.385, 0.357 and 0.296 for liver, lung, colorectal, pancreatic, gastric, breast and cervical cancers.
Conclusions:
Our results show that it is possible to easily develop applicable cancer prediction models with nationwide claim data using machine learning. The seven models have acceptable performances and explainability, which can be distributed easily in real world environments.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.