Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Feb 20, 2025
Open Peer Review Period: Feb 20, 2025 - Apr 17, 2025
Date Accepted: Mar 24, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Decoding Digital Discourse Through Multimodal Text and Image Machine Learning Models to Classify Sentiment and Detect Hate Speech in Race- and Lesbian, Gay, Bisexual, Transgender, Queer, Intersex, and Asexual Community–Related Posts on Social Media: Quantitative Study

Nguyen TT, Yue X, Mane H, Steelman K, Mullaputi PSP, Dennard E, Alibilli A, Merchant JS, Criss S, Hswen Y, Nguyen QC

Decoding Digital Discourse Through Multimodal Text and Image Machine Learning Models to Classify Sentiment and Detect Hate Speech in Race- and Lesbian, Gay, Bisexual, Transgender, Queer, Intersex, and Asexual Community–Related Posts on Social Media: Quantitative Study

J Med Internet Res 2025;27:e72822

DOI: 10.2196/72822

PMID: 40354116

PMCID: 12107201

Decoding Digital Discourse: Multimodal Text and Image Machine Learning Models to Classify Sentiment, Hate, and Anti-Hate of Race and LGBTQIA+ Related Posts on Social Media

  • Thu T. Nguyen; 
  • Xiaohe Yue; 
  • Heran Mane; 
  • Kyle Steelman; 
  • Penchala Sai Priya Mullaputi; 
  • Elizabeth Dennard; 
  • Amrutha Alibilli; 
  • Junaid S. Merchant; 
  • Shaniece Criss; 
  • Yulin Hswen; 
  • Quynh C. Nguyen

ABSTRACT

Background:

A major challenge in sentiment analysis on social media is the increasing prevalence of image-based content, particularly memes, which integrate text and visuals to convey nuanced messages. Traditional text-based approaches have been widely used to assess public attitudes and beliefs; however, they often fail to fully capture the meaning of multimodal content, where cultural, contextual, and visual elements play a significant role.

Objective:

This study aims to provide practical guidance for collecting, processing and analyzing social media data using multimodal machine learning models. Specifically, it focuses on training and fine-tuning models to classify sentiment (positive or negative) and detect hate speech (hateful or anti-hateful content).

Methods:

Social media data was collected from Facebook and Instagram using CrowdTangle, a now-discontinued public insights tool from Meta, and from X (formerly Twitter) via its Academic API. The dataset was filtered to include only race-related terms and LGBTQIA+- posts with image attachments ensuring focus on multimodal content. Human annotators labeled 13,000 posts into four categories: negative sentiment, positive sentiment, hate, or anti-hate. We evaluated unimodal models (BERT for text, VGG-16 for images) and multimodal models (CLIP, VisualBERT, and an intermediate fusion model). To enhance model performance, Synthetic Minority Oversampling Technique (SMOTE) was applied to address class imbalances, and Latent Dirichlet Allocation (LDA) was used to improve semantic representations.

Results:

Our findings highlight key differences in model performance. Among unimodal models, BERT outperformed VGG-16, achieving higher accuracy and macro F1 scores across all tasks. Among multimodal models, CLIP achieved the highest accuracy (0.86) in negative sentiment detection, followed by VisualBERT (0.84). For positive sentiment, VisualBERT outperformed other models with the highest accuracy (0.76). In hate speech detection, the intermediate fusion model demonstrated the highest accuracy (0.91) with F1 of 0.64, ensuring balanced performance. Meanwhile, VisualBERT performed best in anti-hate classification, achieving an accuracy of 0.78. Applying LDA and SMOTE improved minority class detection, particularly for anti-hate content. Overall, the intermediate fusion model provided the most balanced performance across tasks, while CLIP excelled in accuracy-driven classifications. Although VisualBERT performed well in certain areas, it struggled to maintain a precision-recall balance. These results emphasize the effectiveness of multimodal approaches over unimodal models in analyzing social media sentiment.

Conclusions:

This study contributes to the growing research on multimodal machine learning by demonstrating how advanced models, data augmentation techniques, and diverse datasets can enhance the analysis of social media content. The findings offer valuable insights for researchers, policymakers, and public health professionals seeking to leverage AI for social media monitoring and addressing broader societal challenges.


 Citation

Please cite as:

Nguyen TT, Yue X, Mane H, Steelman K, Mullaputi PSP, Dennard E, Alibilli A, Merchant JS, Criss S, Hswen Y, Nguyen QC

Decoding Digital Discourse Through Multimodal Text and Image Machine Learning Models to Classify Sentiment and Detect Hate Speech in Race- and Lesbian, Gay, Bisexual, Transgender, Queer, Intersex, and Asexual Community–Related Posts on Social Media: Quantitative Study

J Med Internet Res 2025;27:e72822

DOI: 10.2196/72822

PMID: 40354116

PMCID: 12107201

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.