Accepted for/Published in: JMIR Public Health and Surveillance
Date Submitted: Oct 22, 2024
Date Accepted: Jan 29, 2025
Identifying Data-Driven Clinical Subgroups for Cervical Cancer Prevention with Machine Learning: A Population-based, External, and Diagnostic Validation Study
ABSTRACT
Background:
Successful scale-up of high-performance and cost-effective cervical cancer prevention (CCP) is key to identifying gaps and progressing towards cervical cancer elimination.
Objective:
We aimed to propose a computational phenomapping strategy to discover CCP subgroups with differential risks of cervical cancer and validate them upon population representative data.
Methods:
We explored the data-driven CCP subgroups by applying unsupervised machine learning to a deeply phenotyped, population-based discovery cohort. We extracted CCP-specific risks of cervical intraepithelial neoplasia grade 2/3 or worse (CIN2+ and CIN3+), through weighted logistic regression analyses providing odds ratio (OR) estimates. We trained supervised machine learning model and developed pathways to classify individuals, before evaluating its diagnostic validity and usability on external cohort.
Results:
We included 551,934 and 47,130 women from discovery and external cohort, respectively. After identifying five CCP subgroups, we labelled them as (0) healthy, (1) early onset, (2) screening-targeted, (3) late onset, and (4) carcinoma-specific. In external validation, CCP subgroups were similar across datasets. In internal and external diagnostic validity analyses, women in CCP2-4 exhibited differential and increased risk of both CIN2+ (CCP2: OR 5.54 95% CI [3.27-8.86]; CCP3 & 4: 26.56 [24.44-28.88]) and CIN3+. CCP-specific risks of CIN2+/CIN3+ were evident in almost all subgroups. We proposed a computational phenomapping strategy and developed a prototype app to promote translation into real-world screening.
Conclusions:
Across six data sources, multiple machine learning algorithms, and multiple validation methods, we identified five CCP subgroups with good accuracy and diagnostic validity for CIN2+/CIN3+ within and across cohorts, and proposed a triple screening strategy. This new substratification and strategy might provide the global potential to tailor and target adequate follow-up surveillance visits and early treatment with prioritization of those in greatest need, thereby facilitating precision medicine towards cervical cancer elimination.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.