Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Apr 11, 2025
Open Peer Review Period: Apr 23, 2025 - Jun 18, 2025
Date Accepted: Mar 12, 2026
(closed for review but you can still tweet)
Machine Learning Analysis: Data Mining and Feature Selection in Cancer Information Seeking among United States Adults
ABSTRACT
Background:
Feature selection is essential in machine learning (ML) for identifying relevant variables. The Boruta algorithm and the least absolute shrinkage and selection operator (LASSO) are two widely used methods.
Objective:
This study aimed to (1) compare feature selection methods (Boruta, LASSO, their combination, principal component analysis (PCA), and non-feature selection), and (2) develop ML tools to predict cancer information seeking among U.S. adults.
Methods:
Data from 5505 individuals (2630 cancer information seekers and 2975 non-seekers) were selected from the 2022 Health Information National Trends Survey (HINTS 6). Four feature selection approaches and five ML tools (the support vector machines (SVMs) algorithms, logistic regression (LR), random forest (RF), k-nearest neighbor (KNN), and extreme gradient boosting (XGBoost)) were applied to develop ML models to predict cancer information seeking.
Results:
The cancer information seeking prevalence was 47.2% (42.8% for males and 49.7% for females). The Boruta and LASSO selected 45 and 55 variables, respectively, with 36 in common. The PCA identified 21 uncorrelated factors. RF performed best, with similar AUCs for Boruta, LASSO, and no feature selection (≈0.950) and accuracy (≈0.860). Using PCA-selected variables yielded a slightly lower AUC (0.931) but comparable accuracy (0.853). Stepwise regression confirmed 21 of 36 selected key predictors, including personal/family cancer history, health information access, education, income, social media use, smoking, alcohol beliefs, and healthcare visits.
Conclusions:
Feature selection effectively reduce dimensionality while retaining the predictive power. Boruta and LASSO performed comparably in terms of the selected variables. PCA-based selection uncorrelated s also proved useful. The key identified factors associated with cancer information seeking can guide future cancer intervention and prevention strategies.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.