Accepted for/Published in: JMIR Formative Research
Date Submitted: Apr 14, 2024
Date Accepted: Nov 28, 2024
Synthetic Data-Driven Approaches for Chinese Medical Abstract Sentence Classification: A Computational Study
ABSTRACT
Background:
Medical abstract sentence classification is essential for improving medical database searches, literature reviews, and generating new abstracts. However, research in Chinese medical abstract classification is limited by a shortage of suitable Chinese datasets. Given the vastness of Chinese medical literature and the unique value of traditional Chinese medicine, precise classification of these abstracts is crucial for enhancing global medical research.
Objective:
The primary objective of this study is to generate a substantial volume of labeled Chinese abstract sentences without the necessity for manual annotation, thereby assembling new training datasets. Building upon this foundation, we also aim to develop more accurate text classification algorithms.
Methods:
We developed three training datasets (Dataset#1, Dataset#2, Dataset#3) and a test dataset to evaluate our model. Dataset#1 includes 15,000 abstract sentences translated into Chinese from the PubMed dataset, while Dataset#2 and Dataset#3, each containing 15,000 sentences, were generated using GPT-3.5 from the CSL database's 40,000 Chinese medical abstracts. Dataset#2 utilized titles and keywords for pseudo-labels, and Dataset#3 aligned abstracts with category labels. The test dataset comprises 87,000 sentences from 20,000 abstracts. Using SBERT embeddings for deeper semantic analysis, we assessed our model's performance through clustering (SBERT-DocSCAN) and supervised methods (SBERT-MEC), validating its effectiveness and robustness via extensive ablation studies and feature analysis.
Results:
Our investigation encompassed separate training sessions for the clustering and supervised models across three distinct datasets, followed by a comprehensive evaluation using the Test dataset. The outcomes affirmatively demonstrated that our models outperformed the baseline metrics. Specifically, when trained on Dataset#1, the SBERT-DocSCAN model registered an impressive accuracy and F1 score of 89.85% on the Test dataset. Concurrently, the SBERT-MEC algorithm exhibited comparable performance with an accuracy of 89.38% and an identical F1 score. Training on Dataset#2 yielded similarly positive results for the SBERT-DocSCAN model, achieving an accuracy and F1 score of 89.83%, while the SBERT-MEC algorithm recorded an accuracy of 86.73% and an F1 score of 86.51%. Notably, training with Dataset#3 allowed the SBERT-DocSCAN model to attain the best with an accuracy and F1 score of 91.30%, whereas the SBERT-MEC algorithm also showed robust performance, obtaining an accuracy of 90.39% and an F1 score of 90.35%. The ablation analysis further underscored the pivotal role of the integrated features and methodologies in augmenting the classification efficiency, substantiating the advanced capabilities of our models.
Conclusions:
Our approach innovatively addresses the challenge of limited availability of datasets for the classification of Chinese medical abstracts by generating novel datasets. Furthermore, our deployment of SBERT-DocSCAN and SBERT-MEC models significantly advances the precision in classifying Chinese medical abstracts. This enhancement is evident even when leveraging synthetic datasets augmented with pseudo-labels, underscoring the efficacy of our methodology in overcoming dataset constraints and improving classification accuracy.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.