Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Benchmarking Large Language Models for Sentiment Analysis in Pregnancy-Related Weight Discussions on Reddit: A Human Performance-Aligned Evaluation
ABSTRACT
Background:
Pregnancy-related weight gain is both physiologically complex and socially charged, resulting in online discourse where clinical facts often blend with implicit emotional distress. While automated sentiment analysis could support maternal health surveillance, model performance in this emotionally complex context remains inadequately characterized relative to expert human judgment.
Objective:
To benchmark contemporary sentiment classification models on pregnancy-related weight discussions, with emphasis on agreement with expert consensus, robustness under interpretive ambiguity, and practical deployment feasibility.
Methods:
We curated 15,619 Reddit posts and comments using an LLM-guided keyword expansion strategy. A "Gold Standard" was established through expert adjudication of 200 posts. We evaluated 10 models (3 commercial and 5 open-source LLMs, 2 MLMs) across zero-shot and few-shot configurations. Performance was measured using weighted F1-score and Cohen’s κ relative to adjudicated and pre-adjudication human baselines. Validated models were applied to 15,019 unlabeled comments to assess scalability and inter-model agreement.
Results:
Expert annotators achieved moderate pre-adjudication agreement (κ = 0.34), while expert–gold standard agreement establishes an empirical performance ceiling. The top-performing model (Gemini-3-Flash) achieved F1 = 0.66 and κ = 0.48, exceeding pre-adjudication inter-annotator agreement by 0.14. Moderate-parameter open-source models (e.g., GPT-OSS: F1 = 0.65, κ = 0.44) achieved comparable performance. Classification errors concentrated at the Neutral–Negative boundary. Inter-model agreement on unlabeled posts was high (κ = 0.80) but dropped sharply for comments (κ = 0.10) in a stratified sample, indicating a significant discourse-type dependency.
Conclusions:
Sentiment classification in pregnancy weight gain is fundamentally constrained by human interpretive disagreement. Open-source LLMs can substitute for commercial systems on narrative-style discourse, but performance remains bounded by task-intrinsic ambiguity. Model selection should prioritize human-relative agreement and discourse-type alignment over marginal performance gains.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.