Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jul 13, 2025
Open Peer Review Period: Aug 1, 2025 - Sep 26, 2025
Date Accepted: Mar 6, 2026
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Development of a Deep Learning Model to Predict 5-Year Mortality in Non-Small Cell Lung Cancer Using the Korean Central Cancer Registry
ABSTRACT
Background:
Non-small cell lung cancer (NSCLC) is one of the most common cancers and a leading cause of cancer-related mortality, making prognostic prediction clinically essential. Machine learning models are increasingly being utilized to assess prognosis; however, developing systems that combine high discrimination with clear, clinically interpretable reasoning remains challenging.
Objective:
To develop deep learning models that predict 5-We identified patients diagnosed between 2014 and 2017 who had complete clinical data, pulmonary function test results, histological information, genomic data, and staging details. After preprocessing, the cohort was divided into stratified training, validation, and test sets in a 70%:15%:15% ratio. Five models were tuned using Hyperband across ten predefined feature groups. The primary metric for evaluation was the area under the receiver operating characteristic curve (AUC); additional metrics reported included accuracy, F1 score, precision, and recall. Group-wise permutation importance was calculated for each model, and the concordance of importance rankings was assessed using the Friedman test. A Cox proportional hazards (CPH) model was utilized as a baseline comparator.year mortality in NSCLC using data from the Korea Central Cancer Registry (KCCR) and to quantify feature importance through permutation testing.
Methods:
We identified patients diagnosed between 2014 and 2017 who had complete clinical data, pulmonary function test results, histological information, genomic data, and staging details. After preprocessing, the cohort was divided into stratified training, validation, and test sets in a 70%:15%:15% ratio. Five models were tuned using Hyperband across ten predefined feature groups. The primary metric for evaluation was the area under the receiver operating characteristic curve (AUC); additional metrics reported included accuracy, F1 score, precision, and recall. Group-wise permutation importance was calculated for each model, and the concordance of importance rankings was assessed using the Friedman test. A Cox proportional hazards (CPH) model was utilized as a baseline comparator.
Results:
All five models yielded comparable discrimination on the test set (AUC 0.875–0.879; accuracy 0.796–0.822; F1 0.815–0.846). Permuting the 'Stage' group resulted in the most significant decrease in AUC, followed by 'Pulmonary Function Test', 'Symptoms', and 'Age'. The 'Gene Mutation' group had a modest overall impact but became more influential within the adenocarcinoma subset. The Friedman test showed no statistically significant differences in importance rankings across the models (p = .928).
Conclusions:
A meticulously tuned, grouped-input deep learning framework offered reliable and interpretable predictions for 5-year mortality in NSCLC. Group-level permutation importance provided stable and reproducible insights into the clinical factors influencing risk, which may guide future model refinement and clinical decision-making.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.