Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 21, 2021
Open Peer Review Period: Apr 21, 2021 - May 11, 2021
Date Accepted: Jul 26, 2021
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation

Lee E, Jung S, Hwang HJ, Jung J

Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation

JMIR Med Inform 2021;9(8):e29807

DOI: 10.2196/29807

PMID: 34459743

PMCID: 8438609

Development of patient level cancer prediction models from a nationwide patient cohort: Model development and validation

  • Eunsaem Lee; 
  • Seyoung Jung; 
  • Hyung Ju Hwang; 
  • Jaewoo Jung

ABSTRACT

Background:

Nationwide population-based cohorts provide a new opportunity to build automated risk prediction models at patient level, as claim data is one of the useful resources to that end. To avoid unnecessary diagnostic intervention after cancer screening tests, patient level prediction models should be developed

Objective:

We aimed at developing cancer prediction models using nationwide claim databases with machine learning algorithms, which are explainable and easily applicable in real world environments.

Methods:

As source data, we used the Korean National Insurance System Database. Every Korean in ≥40 years old undergoes a national health check-up every two years. We gathered all variables from the database including demographic information, basic laboratory values, anthropometric values, as well as previous medical history. We applied conventional logistic regression methods, light gradient boosting methods, neural networks, and survival analysis, as well as one class embedding classifier methods to effectively analyze high dimension data based on deep learning-based anomaly detection. Performance was measured with area under the curve (AUROC), area under precision recall curve (AUPRC). We validated our models externally with a health check-up database from a tertiary hospital.

Results:

One class embedding classifier model received the highest AUROC scores with values of 0.868, 0.849, 0.798, 0.746, 0.800, 0.749 and 0.790 for liver, lung, colorectal, pancreatic, gastric, breast and cervical cancers respectively. For AURPC, light gradient boosting models has the highest score with values of 0.383, 0.401, 0.387, 0.300, 0.385, 0.357 and 0.296 for liver, lung, colorectal, pancreatic, gastric, breast and cervical cancers.

Conclusions:

Our results show that it is possible to easily develop applicable cancer prediction models with nationwide claim data using machine learning. The seven models have acceptable performances and explainability, which can be distributed easily in real world environments.


 Citation

Please cite as:

Lee E, Jung S, Hwang HJ, Jung J

Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation

JMIR Med Inform 2021;9(8):e29807

DOI: 10.2196/29807

PMID: 34459743

PMCID: 8438609

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.