Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 7, 2022
Open Peer Review Period: Dec 7, 2022 - Dec 27, 2022
Date Accepted: Sep 6, 2023
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification

Zhang Y, Li X, Liu Y, Li A, Yang X, Tang X

A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification

JMIR Med Inform 2023;11:e44892

DOI: 10.2196/44892

PMID: 37796584

PMCID: 10587805

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

A method of multi-label text classifier at the publication level for cancer literature

  • Ying Zhang; 
  • Xiaoying Li; 
  • Yi Liu; 
  • Aihua Li; 
  • Xuemei Yang; 
  • Xiaoli Tang

ABSTRACT

Background:

Given the threat posed by cancer to human health, there is rapid growth in the volume of data in the cancer field along with increasing attention being paid to interdisciplinary and cooperative research. The low-resolution classifier of reported research at the journal level fails to satisfy the advanced research demands and a single label does not adequately characterize the literature. There is thus a need to establish a multi-label classifier with higher resolution to support cancer research.

Objective:

This paper presents a multi-label classifier with scalability for classifying literature on cancer research directly at the publication level and assign proper content-based labels, in order to support the highest-resolution classification. This model could be used to support academic statistics and solve the low-resolution problem of subject classification of the cancer research due to ambiguity of the journal-level classifier.

Methods:

We propose a new effective probabilistic classifier for literature classification by introducing the model of “BERT + X” and obtain the best option for “X,” namely, TextRNN. Firstly, a corpus of 50,000 data collected from DIMENSIONS was divided into a training set and a test set at a ratio of 7:3. Secondly, using ICRP CT, a classification for cancer, we compared the performance of classifiers formed by BERT and classical deep learning models such as recurrent neural networks (RNN), convolutional neural networks (CNN), TextRNN, TextCNN, and FastText, followed by metrics analysis. Finally, we conclude that the model of “BERT + TextRNN” is the best fit for multi-label classifier of cancer research and areas with similar text structure characteristics and label distribution features at the publication level by means of visualization and statistical analysis.

Results:

Based on the “BERT + X”, we trained a multi-label classifier model of classifying literature at the publication level directly, rather than categorization from coarse to fine; after comparing various constructed models, the classifier was obtained based on the optimal model “BERT + TextRNN” which could be directly applied in production and research, with P = 0.9142, R = 0.8560, F1 = 0.8842. Moreover, we discussed why the model would be effective in the cancer field, found that the articles published in this field have distinctive characteristics in text structure and label distribution, and concluded through quantitative analysis that the model has the potential to be generalized to other fields with similar characteristics.

Conclusions:

This paper presents a scalable and extensible model that is suitable for high-resolution subject classifier of the cancer literature at the publication level, based on “BERT + TextRNN.” The model is also applicable to other literature with highly professional, systematic, and uniform long-form standardized text. Verification of the multi-label classifier for literature at the publication level indicates that it could provide effective support for academic statistics and clinical research.


 Citation

Please cite as:

Zhang Y, Li X, Liu Y, Li A, Yang X, Tang X

A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification

JMIR Med Inform 2023;11:e44892

DOI: 10.2196/44892

PMID: 37796584

PMCID: 10587805

The author of this paper has made a PDF available, but requires the user to login, or create an account.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.