Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Aug 9, 2024
Date Accepted: May 6, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Use of large language models and synthetic social media to optimize and validate assessment of epidemiological characteristics in social media posts about outbreaks: Infodemiology Study
ABSTRACT
Background:
Use of online search and social media can help identify epidemics, potentially earlier than clinical methods or even potentially identifying otherwise unreported outbreaks. Monitoring for eye-related epidemics can facilitate early public health intervention to reduce transmission and ocular comorbidities associated with outbreaks. However, use of social media for such monitoring is hindered by costs of laborious manual content review. To address this limitation, we have shown utility of large language models (LLMs) to assess probabilities of an outbreak from social media posts. Knowing the probability alone though may not be as informative to public health actions as also knowing more epidemiological characteristics about them, for example knowing the outbreak type, size or which ones the most severe.
Objective:
We assessed if and how well LLMs can classify essential epidemiological features from individual social media posts beyond outbreak probability, including outbreak type, size, severity, etiology and location as well as other health conditions. We employed a validation framework comprising synthetic, Twitter/X and forum posts, comparing an LLMs classification to other independent LLM models and to human experts.
Methods:
To develop effective prompts and test the capability of multiple LLMs, synthetic social media posts were generated. These synthetic posts were embedded with specific pre-classified epidemiological features to simulate various outbreak and control scenarios. To gauge the LLM’s practical utility in real-world epidemiological surveillance, top performing LLM inter-model comparisons were made using Twitter/X and forum posts. Finally, human graders also classified a subset of posts and their classifications were compared to a leading LLM for validation. Comparisons entailed correlation, or sensitivity and specificity statistics.
Results:
Seven LLMs assessed for effectively classifying epidemiological data from diverse social media posts. Notably, GPT-4 and Mixtral 8x22b exhibited high performance in predicting outbreak characteristics like probability, size, and type. Lower performing LLMs were successful for some classifications but not others. Despite strong correlations in comparative validations and known values, discrepancies were noted in a few categories of human assessments. However, overall, the models demonstrated a reliable capacity for nuanced epidemiological analysis across various data sources.
Conclusions:
This investigation into the potential of LLMs for public health infoveillance suggests effectiveness in classifying key epidemiological characteristics from social media content about conjunctivitis outbreaks. Future studies may suggest that while LLMs have potential to support public health monitoring, their optimal role may be to act as a first line of documentation, assessment and classification of potential outbreaks, alerting public health organizations for follow-up of LLM-detected and classified small early outbreaks with a focus on the most severe ones. Clinical Trial: n/a
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.