Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Mar 17, 2025
Open Peer Review Period: Mar 17, 2025 - May 12, 2025
Date Accepted: Sep 5, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Assessing Large Language Models in Building a Structured Dataset From AskDocs Subreddit Data: Methodological Study

Snell Q, Westhoff C, Westhoff J, Low E, Hanson C, Tass S

Assessing Large Language Models in Building a Structured Dataset From AskDocs Subreddit Data: Methodological Study

J Med Internet Res 2025;27:e74094

DOI: 10.2196/74094

PMID: 41124662

PMCID: 12543290

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Assessing Large Language Models in Building a Structured Dataset from Reddit Data: A Methodological Study

  • Quinn Snell; 
  • Chase Westhoff; 
  • John Westhoff; 
  • Ethan Low; 
  • Carl Hanson; 
  • Shannon Tass

ABSTRACT

Background:

In an era marked by the blooming reliance on digital platforms for healthcare consultation, the subreddit r/AskDocs has emerged as a pivotal forum. However, the vast, unstructured nature of forum data presents a formidable challenge; the extraction and meaningful analysis of such data require advanced tools that can navigate the complexities of language and context inherent in user-generated content.

Objective:

Our objective was to evaluate employing Large Language Models (LLMs) to systematically transform the rich, unstructured textual data from AskDocs into a structured dataset, an approach that aligns more closely with human cognitive processes compared to traditional data extraction methods.

Methods:

We developed a dataset of Reddit posts from r/AskDocs by extracting key information via human annotators. Then using specially engineered prompts we used state-of-the-art Large Language Models (LLMs) to extract data from posts and compared the results. The variation in the LLMs were further compared to the humans to show similarity.

Results:

Our findings indicate that LLMs not only match but, in several aspects, surpass even highly educated humans in extracting information, including both demographic and context details, from unstructured texts.

Conclusions:

This study not only validates the use of LLMs for analyzing digital healthcare communications but also opens new avenues for understanding online behaviors and interactions, signaling a shift towards more sophisticated methodologies in digital research and practice.


 Citation

Please cite as:

Snell Q, Westhoff C, Westhoff J, Low E, Hanson C, Tass S

Assessing Large Language Models in Building a Structured Dataset From AskDocs Subreddit Data: Methodological Study

J Med Internet Res 2025;27:e74094

DOI: 10.2196/74094

PMID: 41124662

PMCID: 12543290

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.