JMIR Preprints #75862: Machine Learning Analysis: Data Mining and Feature Selection in Cancer Information Seeking among United States Adults

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Machine Learning Analysis: Data Mining and Feature Selection in Cancer Information Seeking among United States Adults

Ying Liu;
Kesheng Wang

ABSTRACT

Background:

Feature selection is essential in machine learning (ML) for identifying relevant variables. The Boruta algorithm and the least absolute shrinkage and selection operator (LASSO) are two widely used methods.

Objective:

This study aimed to (1) compare feature selection methods (Boruta, LASSO, their combination, principal component analysis (PCA), and non-feature selection), and (2) develop ML tools to predict cancer information seeking among U.S. adults.

Methods:

Data from 5505 individuals (2630 cancer information seekers and 2975 non-seekers) were selected from the 2022 Health Information National Trends Survey (HINTS 6). Four feature selection approaches and five ML tools (the support vector machines (SVMs) algorithms, logistic regression (LR), random forest (RF), k-nearest neighbor (KNN), and extreme gradient boosting (XGBoost)) were applied to develop ML models to predict cancer information seeking.

Results:

The cancer information seeking prevalence was 47.2% (42.8% for males and 49.7% for females). The Boruta and LASSO selected 45 and 55 variables, respectively, with 36 in common. The PCA identified 21 uncorrelated factors. RF performed best, with similar AUCs for Boruta, LASSO, and no feature selection (≈0.950) and accuracy (≈0.860). Using PCA-selected variables yielded a slightly lower AUC (0.931) but comparable accuracy (0.853). Stepwise regression confirmed 21 of 36 selected key predictors, including personal/family cancer history, health information access, education, income, social media use, smoking, alcohol beliefs, and healthcare visits.

Conclusions:

Feature selection effectively reduce dimensionality while retaining the predictive power. Boruta and LASSO performed comparably in terms of the selected variables. PCA-based selection uncorrelated s also proved useful. The key identified factors associated with cancer information seeking can guide future cancer intervention and prevention strategies.

Citation

Please cite as:

Liu Y, Wang K

Comparison of Feature Selection Methods in Machine Learning Models of Cancer Information Seeking Among United States Adults: Cross-Sectional Study

JMIR Med Inform 2026;14:e75862

DOI: 10.2196/75862

PMID: 42061850

PMCID: 13139833

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 11, 2025

Open Peer Review Period: Apr 23, 2025 - Jun 18, 2025

Date Accepted: Mar 12, 2026

(closed for review but you can still tweet)

Machine Learning Analysis: Data Mining and Feature Selection in Cancer Information Seeking among United States Adults

ABSTRACT

Citation

Copyright