Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jan 6, 2025
Date Accepted: Jul 9, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Subreddit to Symptomatology: A Lexicon-based Approach to Extract Symptoms of Complex Conditions from Online Discourse
ABSTRACT
Background:
Millions of people affected with complex medical conditions with diverse symptoms, often turn to online discourse to share their experiences. While some studies have explored natural language processing (NLP) methods and medical information extraction tools, these typically focus on generic symptoms in clinical notes and struggle to identify disease-specific, subtle symptoms from the informal language used on social media.
Objective:
We aim to extract disease-specific symptoms from peoples’ lived experiences shared on social media, exploring their characteristics, prevalence, and occurrence patterns.
Methods:
We propose a lexicon-based symptom extraction (LSE) method to identify a comprehensive list of disease symptoms, capturing nuanced mentions in social media posts. We initially use a large language model (LLM) to automate the extraction of symptom-related keyphrases that form the lexicon. We evaluated the effectiveness of lexicon extraction against human annotation using a Jaccard Index score. Then, we perform embeddings such as BERT-base, BioBERT, Phrase-BERT to learn representations of these symptom-related keyphrases and employ clustering techniques such as k-means and HDBSCAN. Finally, we chose the BioBERT-based k-means clustering to characterize the unique symptoms. Additionally, we applied symptom normalization to eliminate duplicate and redundant entries in the comprehensive symptom list. We evaluated the outcome with prevalent baselines and major health guidelines.
Results:
In a real-world Polycystic Ovary Syndrome (PCOS) subreddit dataset, we find that LSE significantly outperforms state-of-the-art baselines, achieving at least 41% and 20% higher F1 scores than automatic medical extraction tools and LLMs, respectively. Notably, the comprehensive list of 64 PCOS symptoms generated by LSE ensures extensive coverage of symptoms reported in 7 reputed e-health forums. Analyzing PCOS symptomatology reveals 28 emerging new symptoms including 8 self-reported comorbidities concurring with PCOS.
Conclusions:
The comprehensive patient-reported disease-specific symptom list potentially helps patients and health practitioners resolve the uncertainties surrounding the disease, eliminating the variability of PCOS symptoms prevailing in the community. Analyzing PCOS symptomatology across varied dimensions provides valuable insights for public health research.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.