Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Mar 17, 2025
Open Peer Review Period: Mar 17, 2025 - May 12, 2025
Date Accepted: Sep 5, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Assessing Large Language Models in Building a Structured Dataset from Reddit Data: A Methodological Study
ABSTRACT
Background:
In an era marked by the blooming reliance on digital platforms for healthcare consultation, the subreddit r/AskDocs has emerged as a pivotal forum. However, the vast, unstructured nature of forum data presents a formidable challenge; the extraction and meaningful analysis of such data require advanced tools that can navigate the complexities of language and context inherent in user-generated content.
Objective:
Our objective was to evaluate employing Large Language Models (LLMs) to systematically transform the rich, unstructured textual data from AskDocs into a structured dataset, an approach that aligns more closely with human cognitive processes compared to traditional data extraction methods.
Methods:
We developed a dataset of Reddit posts from r/AskDocs by extracting key information via human annotators. Then using specially engineered prompts we used state-of-the-art Large Language Models (LLMs) to extract data from posts and compared the results. The variation in the LLMs were further compared to the humans to show similarity.
Results:
Our findings indicate that LLMs not only match but, in several aspects, surpass even highly educated humans in extracting information, including both demographic and context details, from unstructured texts.
Conclusions:
This study not only validates the use of LLMs for analyzing digital healthcare communications but also opens new avenues for understanding online behaviors and interactions, signaling a shift towards more sophisticated methodologies in digital research and practice.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.