Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Infodemiology

Date Submitted: Apr 15, 2026
Open Peer Review Period: Apr 27, 2026 - Jun 22, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Identifying Public Response Topics to CDC COVID-19 Communications on Social Media: Infoveillance Study Using Large Language Model–Based Rephrasing

  • Wangjiaxuan Xin; 
  • Shuhua Yin; 
  • Shi Chen; 
  • Yaorong Ge

ABSTRACT

Background:

Public health agencies increasingly rely on social media platforms such as X (formerly Twitter) to monitor public responses to health communications and to support timely, data-driven decision-making during health crises such as the COVID-19 pandemic. In particular, public replies to official communications from agencies such as the Centers for Disease Control and Prevention (CDC) provide valuable insights into population-level perceptions, concerns, and engagement with health policies. However, effectively analyzing these large-scale short-text responses remains challenging due to their brevity, informality, and linguistic variability. Topic modeling offers a scalable approach to identifying thematic patterns in such data, but its performance is often limited when applied to short, noisy social media text, resulting in low-quality and difficult-to-interpret topics. Although recent advances in large language models (LLMs) provide new opportunities to enhance text representation, their potential to improve topic modeling for public health–oriented social media analysis remains underexplored.

Objective:

This study aims to enhance public health surveillance by improving the analysis of public responses to official health communications on social media. To achieve this, we develop and evaluate TM-Rephrase, a model-agnostic framework that leverages LLM–based rephrasing to improve the quality, interpretability, and semantic relevance of topics derived by topic models for public health-related short texts on social media.

Methods:

We analyzed 25,027 public replies to official CDC posts on X collected between May 2020 and November 2022 to examine public responses to health communications. We applied a LLM–based rephrasing framework (TM-Rephrase) to transform informal short texts into more standardized and context-enriched representations using general and colloquial-to-formal schemes. Both original and rephrased texts were analyzed across multiple topic models, and topic quality was evaluated using coherence, uniqueness, redundancy, and diversity metrics, along with qualitative assessment of interpretability and post-topic semantic alignment.

Results:

TM-Rephrase consistently improved topic quality across models and evaluation metrics. For LDA, topic coherence increased from Cv =0.3094 (no rephrasing) to Cv =0.5004 with colloquial-to-formal rephrasing (>60% relative improvement). For BERTopic, coherence improved from Cv=0.4078 to Cv =0.4734. Diversity-related metrics also improved, with TSCTM achieving TU=1.0, TD=1.0, and TR=0 under rephrased conditions, indicating fully distinct topics. These improvements were robust across multiple LLMs (Gemini, GPT-4o-mini, and Mistral-7B). Qualitative results further showed that rephrased texts produced more interpretable and semantically coherent topics (representative keywords), enabling clearer identification of public concerns and responses to health communications, including themes related to vaccination attitudes, perceived risks, and public health measures. This improvement facilitates more reliable characterization of public discourse, which is critical for supporting social media–based public health surveillance and informing communication strategies.

Conclusions:

TM-Rephrase enhances the reliability and interpretability of topic modeling for short, noisy social media data in public health contexts, enabling more actionable insights for public health infoveillance. By improving semantic clarity without modifying underlying models, this framework provides a scalable and practical technique for monitoring public discourse and supporting evidence-based public health communication and decision-making.


 Citation

Please cite as:

Xin W, Yin S, Chen S, Ge Y

Identifying Public Response Topics to CDC COVID-19 Communications on Social Media: Infoveillance Study Using Large Language Model–Based Rephrasing

JMIR Preprints. 15/04/2026:98319

DOI: 10.2196/preprints.98319

URL: https://preprints.jmir.org/preprint/98319

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.