JMIR Preprints #98319: Identifying Public Response Topics to CDC COVID-19 Communications on Social Media: Infoveillance Study Using Large Language Model

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Identifying Public Response Topics to CDC COVID-19 Communications on Social Media: Infoveillance Study Using Large Language Model–Based Rephrasing

Wangjiaxuan Xin;
Shuhua Yin;
Shi Chen;
Yaorong Ge

ABSTRACT

Background:

Public health agencies increasingly rely on social media platforms such as X (formerly Twitter) to monitor public responses to health communications and to support timely, data-driven decision-making during health crises such as the COVID-19 pandemic. In particular, public replies to official communications from agencies such as the Centers for Disease Control and Prevention (CDC) provide valuable insights into population-level perceptions, concerns, and engagement with health policies. However, effectively analyzing these large-scale short-text responses remains challenging due to their brevity, informality, and linguistic variability. Topic modeling offers a scalable approach to identifying thematic patterns in such data, but its performance is often limited when applied to short, noisy social media text, resulting in low-quality and difficult-to-interpret topics. Although recent advances in large language models (LLMs) provide new opportunities to enhance text representation, their potential to improve topic modeling for public health–oriented social media analysis remains underexplored.

Objective:

This study aims to enhance public health surveillance by improving the analysis of public responses to official health communications on social media. To achieve this, we develop and evaluate TM-Rephrase, a model-agnostic framework that leverages LLM–based rephrasing to improve the quality, interpretability, and semantic relevance of topics derived by topic models for public health-related short texts on social media.

Methods:

We analyzed 25,027 public replies to official CDC posts on X collected between May 2020 and November 2022 to examine public responses to health communications. We applied a LLM–based rephrasing framework (TM-Rephrase) to transform informal short texts into more standardized and context-enriched representations using general and colloquial-to-formal schemes. Both original and rephrased texts were analyzed across multiple topic models, and topic quality was evaluated using coherence, uniqueness, redundancy, and diversity metrics, along with qualitative assessment of interpretability and post-topic semantic alignment.

Results:

TM-Rephrase consistently improved topic quality across models and evaluation metrics. For LDA, topic coherence increased from Cv =0.3094 (no rephrasing) to Cv =0.5004 with colloquial-to-formal rephrasing (>60% relative improvement). For BERTopic, coherence improved from Cv=0.4078 to Cv =0.4734. Diversity-related metrics also improved, with TSCTM achieving TU=1.0, TD=1.0, and TR=0 under rephrased conditions, indicating fully distinct topics. These improvements were robust across multiple LLMs (Gemini, GPT-4o-mini, and Mistral-7B). Qualitative results further showed that rephrased texts produced more interpretable and semantically coherent topics (representative keywords), enabling clearer identification of public concerns and responses to health communications, including themes related to vaccination attitudes, perceived risks, and public health measures. This improvement facilitates more reliable characterization of public discourse, which is critical for supporting social media–based public health surveillance and informing communication strategies.

Conclusions:

TM-Rephrase enhances the reliability and interpretability of topic modeling for short, noisy social media data in public health contexts, enabling more actionable insights for public health infoveillance. By improving semantic clarity without modifying underlying models, this framework provides a scalable and practical technique for monitoring public discourse and supporting evidence-based public health communication and decision-making.

Citation

Please cite as:

Xin W, Yin S, Chen S, Ge Y

Identifying Public Response Topics to CDC COVID-19 Communications on Social Media: Infoveillance Study Using Large Language Model–Based Rephrasing

JMIR Preprints. 15/04/2026:98319

DOI: 10.2196/preprints.98319

URL: https://preprints.jmir.org/preprint/98319

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Infodemiology

Date Submitted: Apr 15, 2026

Open Peer Review Period: Apr 27, 2026 - Jun 22, 2026

(currently open for review)

Identifying Public Response Topics to CDC COVID-19 Communications on Social Media: Infoveillance Study Using Large Language Model–Based Rephrasing

ABSTRACT

Citation

Copyright