Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jun 16, 2025
Date Accepted: Aug 29, 2025

The final, peer-reviewed published version of this preprint can be found here:

Analysis of Breast Cancer Information on Facebook Using Neural Network–Based Topic Modeling and Metadata Analysis of English and Spanish Content: Comparative Study

Muralidharan R, Soto-Vasquez AD, Montenegro M, Valdez D

Analysis of Breast Cancer Information on Facebook Using Neural Network–Based Topic Modeling and Metadata Analysis of English and Spanish Content: Comparative Study

J Med Internet Res 2025;27:e79161

DOI: 10.2196/79161

PMID: 41091542

PMCID: 12572747

Comparative Analysis of the Breast Cancer Information Landscape on Facebook: Neural Network Topic Modeling and Metadata Analysis of English and Spanish Content

  • Rasika Muralidharan; 
  • Arthur D Soto-Vasquez; 
  • Maria Montenegro; 
  • Danny Valdez

ABSTRACT

Background:

Breast cancer is the most common cancer diagnosis among women, with approximately 2.3 million new cases annually. When faced with life-changing news such as a cancer diagnosis, individuals often turn to the internet to search for information or reassurance, despite the significant risk of encountering low-quality or incorrect information. Although this observation is well documented in the English language, to date, limited work has been done to understand the scope and scale of breast cancer information quality in Spanish—the second most commonly spoken language in the U.S.

Objective:

This study uses Natural Language Processing methods and quantitative modeling to analyze English and Spanish breast cancer posts from Facebook, a vital source of health-related information for 40% of English-speaking and 60% of Spanish-speaking adults in the U.S.

Methods:

Using the CrowdTangle API, we collected and processed N = 243,029 English-language and N = 104,056 Spanish-language Facebook posts. We applied BERTopic with the AllMiniLM-L6 model and k-means clustering to infer thematic structures and used coherence scores to determine the optimal number of topics for each language. Descriptive statistics were used to compare metadata differences across languages. To enable equitable comparison, we calculated the coefficient of variation (standard deviation divided by the mean) for likes, comments, and shares. Finally, we examined the top 1% of the most engaged content to analyze differences in poster characteristics across languages.

Results:

Coherence scores indicated an optimal topic solution of k = 40 for English (0.58) and k = 30 for Spanish (0.52). Thematically, we observed similar content in English and Spanish, with topics spanning mammography, breast cancer events, pink ribbon month, and personal narratives about breast cancer. However, in Spanish, topics referring to local and municipal breast cancer events emerged, which were not present in English. In Spanish, breast cancer prevention topics were also more likely to mention at-home breast exams, which are no longer recommended in the US. Regarding engagement behavior, we observed more consistent liking and sharing behavior in English (i.e., more variability in comments). In Spanish, we observed more consistency in commenting behavior (i.e., more variability in likes and shares). The top 1% of engaged content in English consistently originated from leading breast cancer non-profits and authorities. In Spanish, the top 1% of engaged content originated from local governments or food and beverage companies.

Conclusions:

Our results indicate that Facebook breast cancer content is generally consistent across languages. However, differences in engagement behavior suggest that English- and Spanish-speaking populations engage with content differently, which may highlight cultural variability that should be explored further. Critically, our results also suggest that leading cancer authorities may not have as strong a presence in Spanish. This suggests that what is likely the most accurate and up-to-date information may not be reaching a population especially prone to worse breast cancer prognoses.


 Citation

Please cite as:

Muralidharan R, Soto-Vasquez AD, Montenegro M, Valdez D

Analysis of Breast Cancer Information on Facebook Using Neural Network–Based Topic Modeling and Metadata Analysis of English and Spanish Content: Comparative Study

J Med Internet Res 2025;27:e79161

DOI: 10.2196/79161

PMID: 41091542

PMCID: 12572747

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.