Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Feb 14, 2025
Date Accepted: Dec 23, 2025
Classification of Cochrane Plain Languages Summaries on Conclusiveness of Recommendations: Comparing BERT-based Language Models and ChatGPT
ABSTRACT
Background:
Cochrane Plain Language Summaries (PLSs) aim to make systematic review findings more accessible to the general public. However, inconsistencies in how conclusions are presented may impact comprehension and decision-making. Classifying PLSs based on conclusiveness can improve clarity and facilitate informed health decisions.
Objective:
To develop and evaluate deep learning language models for the classification of PLSs according to three levels of conclusiveness (conclusive, inconclusive, and unclear) and to compare their performance with a general-purpose large language model (ChatGPT-4o).
Methods:
We used a publicly available dataset containing 4,405 Cochrane PLSs of systematic reviews published until 2019, already classified by humans according to nine categories of conclusiveness regarding the intervention’s effectiveness/safety. We merged these categories into three classes based on the strength of conclusiveness: conclusive, inconclusive, and unclear. For the finetuning, we used SciBERT, a pretrained language model trained on 1.14 million papers primarily from the health sciences, and Longformer, a transformer model designed specifically to process long documents. The script was developed using Python programming language and the PyTorch framework. We performed evaluation metrics using the scikit-learn machine learning library and determined the area under the curve of the receiver operating characteristic (AUCROC) to measure the models’ performances between sensitivity and specificity. We also analysed a separate set of 213 PLSs and compared the predictions of our pretrained models with both manual verification and outputs generated by ChatGPT.
Results:
After seven epochs of training, the model based on SciBERT achieved a balanced accuracy of 56.6%. The AUCROC was 0.91 for ‘conclusive’, 0.67 for ‘inconclusive’, and 0.75 for ‘unclear’ conclusiveness classes. The Longformer-based model had a balanced accuracy of 60.9%, with AUCROCs of 0.86 for ‘conclusive’, 0.67 for ‘inconclusive’, and 0.72 for ‘unclear’ conclusiveness classes. Both models underperformed compared to ChatGPT, which demonstrated higher accuracy (74.2%), better precision and recall, and a higher Cohen’s Kappa (0.57).
Conclusions:
Fine-tuning two transformer-based language models showed mixed results in classifying Cochrane PLSs by conclusiveness. Their performance was satisfactory for the conclusive PLSs, but limited in distinguishing between inconclusive and unclear ones, likely due to semantic overlap and subtle linguistic differences. These findings suggest that while finetuned models show potential, general-purpose LLMs like ChatGPT may currently offer more reliable results for practical classification tasks in biomedical applications.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.