Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 11, 2025
Open Peer Review Period: Apr 23, 2025 - Jun 18, 2025
Date Accepted: Mar 12, 2026
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Comparison of Feature Selection Methods in Machine Learning Models of Cancer Information Seeking Among United States Adults: Cross-Sectional Study

Liu Y, Wang K

Comparison of Feature Selection Methods in Machine Learning Models of Cancer Information Seeking Among United States Adults: Cross-Sectional Study

JMIR Med Inform 2026;14:e75862

DOI: 10.2196/75862

PMID: 34898427

Machine Learning Analysis: Data Mining and Feature Selection in Cancer Information Seeking among United States Adults

  • Ying Liu; 
  • Kesheng Wang

ABSTRACT

Background:

Feature selection is essential in machine learning (ML) for identifying relevant variables. The Boruta algorithm and the least absolute shrinkage and selection operator (LASSO) are two widely used methods.

Objective:

This study aimed to (1) compare feature selection methods (Boruta, LASSO, their combination, principal component analysis (PCA), and non-feature selection), and (2) develop ML tools to predict cancer information seeking among U.S. adults.

Methods:

Data from 5505 individuals (2630 cancer information seekers and 2975 non-seekers) were selected from the 2022 Health Information National Trends Survey (HINTS 6). Four feature selection approaches and five ML tools (the support vector machines (SVMs) algorithms, logistic regression (LR), random forest (RF), k-nearest neighbor (KNN), and extreme gradient boosting (XGBoost)) were applied to develop ML models to predict cancer information seeking.

Results:

The cancer information seeking prevalence was 47.2% (42.8% for males and 49.7% for females). The Boruta and LASSO selected 45 and 55 variables, respectively, with 36 in common. The PCA identified 21 uncorrelated factors. RF performed best, with similar AUCs for Boruta, LASSO, and no feature selection (≈0.950) and accuracy (≈0.860). Using PCA-selected variables yielded a slightly lower AUC (0.931) but comparable accuracy (0.853). Stepwise regression confirmed 21 of 36 selected key predictors, including personal/family cancer history, health information access, education, income, social media use, smoking, alcohol beliefs, and healthcare visits.

Conclusions:

Feature selection effectively reduce dimensionality while retaining the predictive power. Boruta and LASSO performed comparably in terms of the selected variables. PCA-based selection uncorrelated s also proved useful. The key identified factors associated with cancer information seeking can guide future cancer intervention and prevention strategies.


 Citation

Please cite as:

Liu Y, Wang K

Comparison of Feature Selection Methods in Machine Learning Models of Cancer Information Seeking Among United States Adults: Cross-Sectional Study

JMIR Med Inform 2026;14:e75862

DOI: 10.2196/75862

PMID: 34898427

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.