JMIR Preprints #74094: Assessing Large Language Models in Building a Structured Dataset from Reddit Data: A Methodological Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Assessing Large Language Models in Building a Structured Dataset from Reddit Data: A Methodological Study

Quinn Snell;
Chase Westhoff;
John Westhoff;
Ethan Low;
Carl Hanson;
Shannon Tass

ABSTRACT

Background:

In an era marked by the blooming reliance on digital platforms for healthcare consultation, the subreddit r/AskDocs has emerged as a pivotal forum. However, the vast, unstructured nature of forum data presents a formidable challenge; the extraction and meaningful analysis of such data require advanced tools that can navigate the complexities of language and context inherent in user-generated content.

Objective:

Our objective was to evaluate employing Large Language Models (LLMs) to systematically transform the rich, unstructured textual data from AskDocs into a structured dataset, an approach that aligns more closely with human cognitive processes compared to traditional data extraction methods.

Methods:

We developed a dataset of Reddit posts from r/AskDocs by extracting key information via human annotators. Then using specially engineered prompts we used state-of-the-art Large Language Models (LLMs) to extract data from posts and compared the results. The variation in the LLMs were further compared to the humans to show similarity.

Results:

Our findings indicate that LLMs not only match but, in several aspects, surpass even highly educated humans in extracting information, including both demographic and context details, from unstructured texts.

Conclusions:

This study not only validates the use of LLMs for analyzing digital healthcare communications but also opens new avenues for understanding online behaviors and interactions, signaling a shift towards more sophisticated methodologies in digital research and practice.

Citation

Please cite as:

Snell Q, Westhoff C, Westhoff J, Low E, Hanson C, Tass S

Assessing Large Language Models in Building a Structured Dataset From AskDocs Subreddit Data: Methodological Study

J Med Internet Res 2025;27:e74094

DOI: 10.2196/74094

PMID: 41124662

PMCID: 12543290

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Mar 17, 2025

Open Peer Review Period: Mar 17, 2025 - May 12, 2025

Date Accepted: Sep 5, 2025

(closed for review but you can still tweet)

Assessing Large Language Models in Building a Structured Dataset from Reddit Data: A Methodological Study

ABSTRACT

Citation

Copyright