JMIR Preprints #92821: Benchmarking Large Language Models for Sentiment Analysis in Pregnancy-Related Weight Discussions on Reddit: A Human Performance-Aligned Evaluation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Benchmarking Large Language Models for Sentiment Analysis in Pregnancy-Related Weight Discussions on Reddit: A Human Performance-Aligned Evaluation

Rezaur Rashid;
Brianna M White;
Fekede Asefa Kumsa;
Arash Shaban-Nejad

ABSTRACT

Background:

Pregnancy-related weight gain is both physiologically complex and socially charged, resulting in online discourse where clinical facts often blend with implicit emotional distress. While automated sentiment analysis could support maternal health surveillance, model performance in this emotionally complex context remains inadequately characterized relative to expert human judgment.

Objective:

To benchmark contemporary sentiment classification models on pregnancy-related weight discussions, with emphasis on agreement with expert consensus, robustness under interpretive ambiguity, and practical deployment feasibility.

Methods:

We curated 15,619 Reddit posts and comments using an LLM-guided keyword expansion strategy. A "Gold Standard" was established through expert adjudication of 200 posts. We evaluated 10 models (3 commercial and 5 open-source LLMs, 2 MLMs) across zero-shot and few-shot configurations. Performance was measured using weighted F1-score and Cohen’s κ relative to adjudicated and pre-adjudication human baselines. Validated models were applied to 15,019 unlabeled comments to assess scalability and inter-model agreement.

Results:

Expert annotators achieved moderate pre-adjudication agreement (κ = 0.34), while expert–gold standard agreement establishes an empirical performance ceiling. The top-performing model (Gemini-3-Flash) achieved F1 = 0.66 and κ = 0.48, exceeding pre-adjudication inter-annotator agreement by 0.14. Moderate-parameter open-source models (e.g., GPT-OSS: F1 = 0.65, κ = 0.44) achieved comparable performance. Classification errors concentrated at the Neutral–Negative boundary. Inter-model agreement on unlabeled posts was high (κ = 0.80) but dropped sharply for comments (κ = 0.10) in a stratified sample, indicating a significant discourse-type dependency.

Conclusions:

Sentiment classification in pregnancy weight gain is fundamentally constrained by human interpretive disagreement. Open-source LLMs can substitute for commercial systems on narrative-style discourse, but performance remains bounded by task-intrinsic ambiguity. Model selection should prioritize human-relative agreement and discourse-type alignment over marginal performance gains.

Citation

Please cite as:

Rashid R, White BM, Kumsa FA, Shaban-Nejad A

Benchmarking Large Language Models for Sentiment Analysis in Pregnancy-Related Weight Discussions on Reddit: A Human Performance-Aligned Evaluation

JMIR Preprints. 03/02/2026:92821

DOI: 10.2196/preprints.92821

URL: https://preprints.jmir.org/preprint/92821

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Feb 3, 2026

Benchmarking Large Language Models for Sentiment Analysis in Pregnancy-Related Weight Discussions on Reddit: A Human Performance-Aligned Evaluation

ABSTRACT

Citation

Copyright