JMIR Preprints #70940: Subreddit to Symptomatology: A Lexicon-based Approach to Extract Symptoms of Complex Conditions from Online Discourse

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Subreddit to Symptomatology: A Lexicon-based Approach to Extract Symptoms of Complex Conditions from Online Discourse

Bushra Hossain;
Sarah M Preum;
Md Fazle Rabbi;
Rifat Ara;
Mohammed Eunus Ali

ABSTRACT

Background:

Millions of people affected with complex medical conditions with diverse symptoms, often turn to online discourse to share their experiences. While some studies have explored natural language processing (NLP) methods and medical information extraction tools, these typically focus on generic symptoms in clinical notes and struggle to identify patient-reported, disease-specific, subtle symptoms from online health discourse.

Objective:

We aim to extract patient-reported, disease-specific symptoms shared on social media reflecting lived experiences of thousands of affected individuals and explore the characteristics, prevalence, and occurrence patterns of the symptoms.

Methods:

We propose a lexicon-based symptom extraction (LSE) method to identify a diverse list of disease-specific, patient-reported symptoms. We initially use a large language model (LLM) to accelerate the extraction of symptom-related keyphrases that form the lexicon. We evaluated the effectiveness of lexicon extraction against human annotation using a Jaccard Index score. Then, we leverage BERT-base, BioBERT, Phrase-BERT based embeddings to learn representations of these symptom-related keyphrases and cluster similar symptoms using k-means and HDBSCAN clustering. Among the different options explored in our experiments, the BioBERT-based k-means clustering was found to be the most effective. Finally, we apply symptom normalization to eliminate duplicate and redundant entries in the comprehensive symptom list.

Results:

In a real-world Polycystic Ovary Syndrome (PCOS) subreddit dataset, we find that LSE significantly outperforms state-of-the-art baselines, achieving at least 41% and 20% higher F1 scores (avg. 86.10) than automatic medical extraction tools and LLMs, respectively. Notably, the comprehensive list of 64 PCOS symptoms generated by LSE ensures extensive coverage of symptoms reported in 7 reputed e-health forums. Analyzing PCOS symptomatology reveals 28 potentially emerging symptoms and 8 self-reported comorbidities concurring with PCOS.

Conclusions:

The comprehensive patient-reported disease-specific symptom list potentially helps patients and health practitioners resolve the uncertainties surrounding the disease, eliminating the variability of PCOS symptoms prevailing in the community. Analyzing PCOS symptomatology across varied dimensions provides valuable insights for public health research.

Citation

Please cite as:

Hossain B, Preum SM, Rabbi MF, Ara R, Ali ME

Extracting Symptoms of Complex Conditions From Online Discourse (Subreddit to Symptomatology): Lexicon-Based Approach

JMIR Med Inform 2025;13:e70940

DOI: 10.2196/70940

PMID: 40939164

PMCID: 12475878

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jan 6, 2025

Date Accepted: Jul 9, 2025

Subreddit to Symptomatology: A Lexicon-based Approach to Extract Symptoms of Complex Conditions from Online Discourse

ABSTRACT

Citation

Copyright