JMIR Preprints #67513: Context-Aware Biomedical Word Embeddings Enhance ADR Prediction: A Shift from Word2Vec to BERT

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Context-Aware Biomedical Word Embeddings Enhance ADR Prediction: A Shift from Word2Vec to BERT

Woohyuk Jeon;
Minjae Park;
Doyeon An;
Wonshik Nam;
Ju-Young Shin;
Seunghee Lee;
Suehyun Lee

ABSTRACT

Background:

Adverse drug reactions (ADRs) pose serious risks to patient health, and effectively predicting and managing them is an important public health challenge. Given the complexity and specificity of biomedical text data, the traditional context-independent language model, Word2Vec, has limitations in fully reflecting the domain specificity of such data. Therefore, to predict drug-side effect relationships more accurately, we applied a Bidirectional Encoder Representations from Transformers (BERT) model specialized for biomedical applications.

Objective:

This study aimed to propose a method for extracting drug-side effect relationships from embedding vectors generated by biomedical language models, specifically BERT-based models pre-trained on biomedical corpora. This approach aims to overcome the limitations of the traditional Word2Vec model in accurately capturing complex relationships in biomedical data.

Methods:

Using data from 158,096 pairs of drug-side effect relationships from the Side Effect Resource (SIDER) database, we generated an adjacency matrix and calculated the cosine similarity between the word embedding vectors of drugs and side effects. Relation scores were calculated for a total of 8,235,435 drug-side effect pairs using this similarity. To evaluate the prediction accuracy of drug-side effect relationships, the area under the curve (AUC) value was measured using the calculated relation score and 158,096 known drug-side effect relationships provided by SIDER.

Results:

The clagator/biobert_v1.1 model achieved an AUC of 0.915 at an optimal threshold of 0.289, largely outperforming the existing Word2Vec model, which had an AUC of 0.848. The BERT-based model pre-trained on the biomedical corpus outperformed the vanilla BERT model, with an AUC of 0.857. Furthermore, external validation with the FDA Adverse Event Reporting System (FAERS) data, using Fisher’s exact test based on 8,235,435 predicted drug-side effect pairs and 901,361 known relationships, confirmed high statistical significance (P<.001) with an odds ratio of 4.830. Additionally, a literature review was conducted for predicted drug-side effect relationships. This review reveals that these relationships have been reported in recent studies published after 2016.

Conclusions:

This study introduces a method for extracting drug-side effect relationship data embedded in the pre-trained parameters of language models pre-trained on biomedical corpora and using this information to predict the probability of previously unknown drug-side effect relationships. We improved the accuracy of predicting drug-side effect relationships by using BERT-based models instead of the Word2Vec model. We found that BERT-based models pre-trained with biomedical corpora consider contextual information and achieve better performance in drug-side effect relationship prediction. External validation using the FAERS dataset combined with a literature review of certain cases confirmed high statistical significance, demonstrating the practical applicability of this approach. These results highlight the utility of natural language processing-based approaches for predicting and managing ADRs.

Citation

Please cite as:

Jeon W, Park M, An D, Nam W, Shin JY, Lee S, Lee S

Predicting Drug–Side Effect Relationships From Parametric Knowledge Embedded in Biomedical BERT Models: Methodological Study With a Natural Language Processing Approach

JMIR Med Inform 2025;13:e67513

DOI: 10.2196/67513

PMID: 40638775

PMCID: 12287980

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Oct 14, 2024

Open Peer Review Period: Oct 21, 2024 - Dec 16, 2024

Date Accepted: May 13, 2025

(closed for review but you can still tweet)

Context-Aware Biomedical Word Embeddings Enhance ADR Prediction: A Shift from Word2Vec to BERT

ABSTRACT

Citation

Copyright