Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jan 25, 2021
Date Accepted: Nov 10, 2021
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Improving the evidence-based clinical decision-making process: Interactive classification and topic discovery on diabetes-related biomedical literature
ABSTRACT
Background:
The amount of available textual health data such as scientific and biomedical literature is constantly growing and it becomes more and more challenging for health professionals to properly summarise those data and in consequence to practice evidence-based clinical decision making. Moreover, the exploration of large unstructured health text data is very challenging for non experts due to limited time, resources and skills. Current tools to explore text data lack ease of use, need high computation efforts and have difficulties to incorporate domain knowledge and focus on topics of interest.
Objective:
We developed a methodology which is able to explore and target topics of interest via an interactive user interface for experts and non-experts. We aim to reach near state of the art performance, while reducing memory consumption, increasing scalability and minimizing user interaction effort to improve the clinical decision making process. The performance is evaluated on diabetes-related abstracts from Pubmed.
Methods:
The methodology consists of four parts: 1) A novel interpretable hierarchical clustering of documents where each node is defined by headwords (describe documents in this node the most); 2) An efficient classification system to target topics; 3) Minimized users interaction effort through active learning; 4) A visual user interface through which a user interacts. We evaluated our approach on 50,911 diabetes-related abstracts from Pubmed which provide a hierarchical Medical Subject Headings (MeSH) structure, a unique identifier for a topic. Hierarchical clustering performance was compared against the implementation in the machine learning library scikit-learn. On a subset of 2000 randomly chosen diabetes abstracts, our active learning strategy was compared against three other strategies: random selection of training instances, uncertainty sampling which chooses instances the model is most uncertain about and an expected gradient length strategy based on convolutional neural networks (CNN).
Results:
For the hierarchical clustering performance, we achieved a F1-Score of 0.73 compared to scikit-learn’s of 0.76. Concerning active learning performance, after 200 chosen training samples based on these strategies, the weighted F1-Score over all MeSH codes resulted in satisfying 0.62 F1-Score of our approach, compared to 0.61 of the uncertainty strategy, 0.61 the CNN and 0.45 the random strategy. Moreover, our methodology showed a constant low memory use with increased number of documents but increased execution time.
Conclusions:
We proposed an easy to use tool for experts and non-experts being able to combine domain knowledge with topic exploration and target specific topics of interest while improving transparency. Furthermore our approach is very memory efficient and highly parallelizable making it interesting for large Big Data sets. This approach can be used by health professionals to rapidly get deep insights into biomedical literature to ultimately improve the evidence-based clinical decision making process.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.