Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Mar 7, 2026
Open Peer Review Period: Mar 9, 2026 - May 4, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

A Precision-Optimised Framework for Extracting Self-Reported Health Outcomes from User-Generated Content: A Large-Scale Analysis of YouTube Comments

  • Ricardo Ribeiro; 
  • Aneesh Zutshi

ABSTRACT

Background:

YouTube is increasingly used for Healthcasting, the sharing of evidence-based dietary and lifestyle interventions by expert researchers and clinicians. In the metabolic health domain, channels focused on Therapeutic Carbohydrate Restriction (TCR) have accumulated audiences of millions. A distinctive feature is the comment section, where viewers share first-person accounts of health changes: weight loss, biomarkers normalised, chronic conditions reversed. At scale, these comments constitute a unique source of real-world outcome data. However, extracting structured health information from hundreds of thousands of unstructured comments with the precision required for outcomes research presents significant computational challenges.

Objective:

To develop and validate a precision-optimised computational framework for systematically extracting self-reported health outcomes from Healthcasting YouTube comments, and to characterise the nature, distribution, and channel-level variation of reported outcomes across a large-scale metabolic health corpus.

Methods:

We collected 209,661 comments from 110 videos across 11 TCR-focused Healthcasting channels (37,742 unique authors; 2013–2026). A four-phase methodology was employed: (1) exploratory corpus characterisation; (2) iterative development of a 35-aspect hierarchical health outcome ontology; (3) a precision-optimised rule-based classification pipeline with manual validation (n=500) and negative-sample recall estimation (n=105); and (4) Aspect-Based Sentiment Analysis using dual-model LLM consensus coding.

Results:

The framework identified 6,671 positive health outcome reports (3.18% prevalence), achieving 97.6% precision (95% CI: 95.7%–98.6%) and estimated 16.5% recall (95% CI: 11.6%–23.6%). Outcomes extended well beyond weight loss: pain and inflammation reduction (17.0%), type 2 diabetes improvement (14.6%), skin health (11.8%), and psychological well-being (11.0%), with 2,032 outcomes spanning 18 named disease conditions. Over half (50.3%) spanned multiple research objectives simultaneously. Significant channel-level variation was observed (χ²=3,509, p<0.001), with positive outcome rates ranging from 1.14% to 8.06% (OR=7.61). A complementary Aspect-Based Sentiment Analysis confirmed a positive-to-negative ratio of 4.6:1, with negative experiences (11.9% of health-related comments) primarily involving gastrointestinal adaptation and cardiovascular concerns.

Conclusions:

Healthcasting YouTube comment sections contain a substantial, structured signal of self-reported health outcomes amenable to systematic computational extraction. The framework generates a high-confidence corpus of 6,510 estimated true positives across 35 health aspects, documenting the breadth and scale of metabolic health improvement reported by users of TCR-focused expert content. These findings provide a validated methodological foundation for AI-augmented digital health platform design.


 Citation

Please cite as:

Ribeiro R, Zutshi A

A Precision-Optimised Framework for Extracting Self-Reported Health Outcomes from User-Generated Content: A Large-Scale Analysis of YouTube Comments

JMIR Preprints. 07/03/2026:94855

DOI: 10.2196/preprints.94855

URL: https://preprints.jmir.org/preprint/94855

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.