Accepted for/Published in: Online Journal of Public Health Informatics
Date Submitted: Jul 17, 2025
Date Accepted: Feb 17, 2026
Comparison of Artificial Intelligence (AI) Tools With Human-Coding for Sentiment Analysis, Topic and Thematic Analysis Tasks of Public Health Datasets: Case Study During the COVID-19 Pandemic in Australia
ABSTRACT
Background:
Public opinion, which may be influenced by personal experiences, news and social media, can impact compliance with public health measures (PHMs) during health emergencies. Research on automated methods to measure public opinion increased during the COVID-19 pandemic, enabling rapid analysis of vast online datasets. Challenges remain with data quality, representativeness and interpretive depth, when compared to traditional qualitative methods.
Objective:
This study evaluated the performance of natural language processing (NLP) and large language model (LLM)-based artificial intelligence (AI) tools when compared to human coding for sentiment analysis, topic modelling, and thematic analysis of public health datasets. Tools were selected to reflect those available to public health analysts and decision-makers.
Methods:
Data were collected via Google Alerts (GA) and social media posts from X (formerly Twitter) relevant to COVID-19 mitigation PHM from December 2022-February 2023. Keyword searches focused on vaccines, masks and related topics. Following relevance screening, sentiment analysis was performed on a subset of 400 GA results and 400 tweets by two human raters. Human-coded sentiment analysis was compared to five AI tools: VADER, SentimentGI, SentimentQDAP, Microsoft Azure, and OpenAI’s ChatGPT-4. Topic modelling of the GA and X datasets was conducted using Latent Dirichlet Allocation (LDA) in R and zero-shot prompting in ChatGPT-4, and compared with manual topic summaries. Thematic analysis of positive and negative sentiment datasets was conducted by a human rater and ChatGPT-4, with outputs cross-matched for proficiency and reasonableness.
Results:
Of 2,227 GA results and 3,484 tweets, 58% and 71% respectively were relevant to PHM. Human-coded sentiment analysis showed mostly neutral reporting in news media, while social media expressed more polarised views. Across both datasets, AI tools demonstrated poor concordance with human-coded sentiment (Cohen’s Kappa <0.5 for all tools and sentiment categories). For topic modelling, LLM outputs were more closely aligned with human-generated topics than LDA. For thematic analysis, LLM themes were rated “proficient” or “partially proficient” in 20/20 categories and always “very reasonable”. Human and LLM thematic analyses both identified themes of vaccine effectiveness, debate regarding PHM, and public trust.
Conclusions:
Widely available AI tools currently perform poorly in sentiment analysis of public health datasets. LLMs demonstrate moderate alignment with human analysts for topic modelling and strong alignment for thematic analysis tasks, offering a promising approach for rapid, scalable qualitative assessment of public opinion to inform public health responses in real-time. These tools could complement traditional qualitative research. Further research is needed for non-English datasets.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.