Accepted for/Published in: JMIR Formative Research
Date Submitted: May 28, 2023
Date Accepted: Feb 6, 2024
A Machine Learning–Based Approach for Identifying Research Gaps: COVID-19 as a Case Study
ABSTRACT
Background:
Research gaps refer to unanswered questions in the existing body of knowledge, either due to lack of studies or inconclusive results. Research gaps are essential starting points and motivation in scientific research. Traditional methods for identifying research gaps, such as literature reviews and expert opinions, can be time-consuming, labor-intensive, and prone to bias. They may also fall short when dealing with rapidly evolving or time-sensitive subjects. Thus, innovative, scalable approaches are needed to identify research gaps, systematically assess literature, and prioritize areas for further study in the topic of interest.
Objective:
In this paper, we propose a machine learning-based approach for identifying research gaps through the analysis of scientific literature. We used the novel coronavirus (COVID-19) pandemic as a case study.
Methods:
We conducted an analysis to identify research gaps in COVID-19 literature, utilizing the CORD-19 dataset, which comprises 1,121,433 articles related to the COVID-19 pandemic. Our approach is based on the BERTopic topic modeling technique, which leverages transformers and c-TF-IDF (class-based term frequency-inverse document frequency) to create dense clusters allowing for easily interpretable topics. Our BERTopic-based approach involves three stages: embedding documents, clustering documents (dimension reduction and clustering), and representing topics (generating candidates and maximizing candidate relevance).
Results:
After applying study selection criteria, we included 33,206 abstracts in the analysis of this study. The final list of research gaps identified 21 different areas, which were grouped into 6 principal topics. These topics were: Virus of COVID-19, Risk Factors of COVID-19, Prevention of COVID-19, Treatment of COVID-19, Healthcare Delivery during COVID-19, and Impact of COVID-19. The most prominent topic, observed in over half of the analyzed studies, was the “Impact of COVID-19”.
Conclusions:
The proposed machine learning-based approach has the potential to identify research gaps in scientific literature. This study is not intended to replace individual literature research within a selected topic. Instead, it can serve as a guide to formulate precise literature search queries in specific areas associated with research questions that previous publications have earmarked for future exploration. Future research should leverage an up-to-date list of studies that are retrieved from the most common databases in the target area. When feasible, full texts, or at minimum, discussion sections should be analyzed, rather than limiting their analysis to abstracts. Furthermore, future studies could evaluate more efficient modeling algorithms, especially those combining topic modeling with statistical uncertainty quantification such as conformal prediction.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.