Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jan 6, 2025
Date Accepted: Jul 9, 2025
Subreddit to Symptomatology: A Lexicon-based Approach to Extract Symptoms of Complex Conditions from Online Discourse
ABSTRACT
Background:
Millions of people affected with complex medical conditions with diverse symptoms, often turn to online discourse to share their experiences. While some studies have explored natural language processing (NLP) methods and medical information extraction tools, these typically focus on generic symptoms in clinical notes and struggle to identify patient-reported, disease-specific, subtle symptoms from online health discourse.
Objective:
We aim to extract patient-reported, disease-specific symptoms shared on social media reflecting lived experiences of thousands of affected individuals and explore the characteristics, prevalence, and occurrence patterns of the symptoms.
Methods:
We propose a lexicon-based symptom extraction (LSE) method to identify a diverse list of disease-specific, patient-reported symptoms. We initially use a large language model (LLM) to accelerate the extraction of symptom-related keyphrases that form the lexicon. We evaluated the effectiveness of lexicon extraction against human annotation using a Jaccard Index score. Then, we leverage BERT-base, BioBERT, Phrase-BERT based embeddings to learn representations of these symptom-related keyphrases and cluster similar symptoms using k-means and HDBSCAN clustering. Among the different options explored in our experiments, the BioBERT-based k-means clustering was found to be the most effective. Finally, we apply symptom normalization to eliminate duplicate and redundant entries in the comprehensive symptom list.
Results:
In a real-world Polycystic Ovary Syndrome (PCOS) subreddit dataset, we find that LSE significantly outperforms state-of-the-art baselines, achieving at least 41% and 20% higher F1 scores (avg. 86.10) than automatic medical extraction tools and LLMs, respectively. Notably, the comprehensive list of 64 PCOS symptoms generated by LSE ensures extensive coverage of symptoms reported in 7 reputed e-health forums. Analyzing PCOS symptomatology reveals 28 potentially emerging symptoms and 8 self-reported comorbidities concurring with PCOS.
Conclusions:
The comprehensive patient-reported disease-specific symptom list potentially helps patients and health practitioners resolve the uncertainties surrounding the disease, eliminating the variability of PCOS symptoms prevailing in the community. Analyzing PCOS symptomatology across varied dimensions provides valuable insights for public health research.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.