JMIR Preprints #72657: Classification of Cochrane Plain Languages Summaries on Conclusiveness of Recommendations: Comparing BERT-based Language Models and ChatGPT

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Classification of Cochrane Plain Languages Summaries on Conclusiveness of Recommendations: Comparing BERT-based Language Models and ChatGPT

Antonija Mijatovic;
Luka Ursić;
Nensi Bralić;
Ružica Bandić;
Barbara Ćaćić;
Ivan Buljan;
Ana Marušić

ABSTRACT

Background:

Cochrane Plain Language Summaries (PLSs) aim to make systematic review findings more accessible to the general public. However, inconsistencies in how conclusions are presented may impact comprehension and decision-making. Classifying PLSs based on conclusiveness can improve clarity and facilitate informed health decisions.

Objective:

To develop and evaluate deep learning language models for the classification of PLSs according to three levels of conclusiveness (conclusive, inconclusive, and unclear) and to compare their performance with a general-purpose large language model (ChatGPT-4o).

Methods:

We used a publicly available dataset containing 4,405 Cochrane PLSs of systematic reviews published until 2019, already classified by humans according to nine categories of conclusiveness regarding the intervention’s effectiveness/safety. We merged these categories into three classes based on the strength of conclusiveness: conclusive, inconclusive, and unclear. For the finetuning, we used SciBERT, a pretrained language model trained on 1.14 million papers primarily from the health sciences, and Longformer, a transformer model designed specifically to process long documents. The script was developed using Python programming language and the PyTorch framework. We performed evaluation metrics using the scikit-learn machine learning library and determined the area under the curve of the receiver operating characteristic (AUCROC) to measure the models’ performances between sensitivity and specificity. We also analysed a separate set of 213 PLSs and compared the predictions of our pretrained models with both manual verification and outputs generated by ChatGPT.

Results:

After seven epochs of training, the model based on SciBERT achieved a balanced accuracy of 56.6%. The AUCROC was 0.91 for ‘conclusive’, 0.67 for ‘inconclusive’, and 0.75 for ‘unclear’ conclusiveness classes. The Longformer-based model had a balanced accuracy of 60.9%, with AUCROCs of 0.86 for ‘conclusive’, 0.67 for ‘inconclusive’, and 0.72 for ‘unclear’ conclusiveness classes. Both models underperformed compared to ChatGPT, which demonstrated higher accuracy (74.2%), better precision and recall, and a higher Cohen’s Kappa (0.57).

Conclusions:

Fine-tuning two transformer-based language models showed mixed results in classifying Cochrane PLSs by conclusiveness. Their performance was satisfactory for the conclusive PLSs, but limited in distinguishing between inconclusive and unclear ones, likely due to semantic overlap and subtle linguistic differences. These findings suggest that while finetuned models show potential, general-purpose LLMs like ChatGPT may currently offer more reliable results for practical classification tasks in biomedical applications.

Citation

Please cite as:

Mijatovic A, Ursić L, Bralić N, Bandić R, Ćaćić B, Buljan I, Marušić A

Classification of Cochrane Plain Language Summaries by Conclusiveness Using Transformer-Based Models and ChatGPT: Retrospective Observational Study

JMIR Med Inform 2026;14:e72657

DOI: 10.2196/72657

PMID: 41980187

PMCID: 13078607

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Feb 14, 2025

Date Accepted: Dec 23, 2025

Classification of Cochrane Plain Languages Summaries on Conclusiveness of Recommendations: Comparing BERT-based Language Models and ChatGPT

ABSTRACT

Citation

Copyright