Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: May 12, 2025
Open Peer Review Period: May 14, 2025 - Jul 9, 2025
Date Accepted: Mar 17, 2026
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Fine-Tuning and Benchmarking Transformer Models for Multiclass Classification of Clinical Research Papers: Retrospective Modeling Study

Zhou F, Lokker C, Parrish R, Haynes RB, Iorio A, Saha A, Afzal M

Fine-Tuning and Benchmarking Transformer Models for Multiclass Classification of Clinical Research Papers: Retrospective Modeling Study

JMIR AI 2026;5:e77311

DOI: 10.2196/77311

PMID: 42061835

Fine-tuning and Benchmarking of Transformer Models for Multiclass Classification of Clinical Research Articles

  • Fangwen Zhou; 
  • Cynthia Lokker; 
  • Rick Parrish; 
  • R. Brian Haynes; 
  • Alfonso Iorio; 
  • Ashirbani Saha; 
  • Muhammad Afzal

ABSTRACT

Background:

The exponential growth of digital information has led to an unprecedented expansion in the volume of unstructured text data. Efficient classification of these articles is critical for timely evidence synthesis and informed decision-making in healthcare. Machine learning techniques have shown considerable promise for text classification tasks. However, multiclass classification of articles by study publication type has been largely overlooked compared to binary or multilabel classification. Addressing this gap could significantlyThe objective of this study was to fine-tune and evaluate domain-specific transformer-based language models on a gold-standard dataset for multiclass classification of clinical literature into mutually exclusive categories: original study, review, evidence-based guideline, and non-experimental. enhance knowledge translation workflows and support systematic review processes.

Objective:

The objective of this study was to fine-tune and evaluate domain-specific transformer-based language models on a gold-standard dataset for multiclass classification of clinical literature into mutually exclusive categories: original study, review, evidence-based guideline, and non-experimental.

Methods:

The titles and abstracts of McMaster’s Premium LiteratUre Service (PLUS) dataset of 162,380 articles were used for fine-tuning 7 domain-specific transformers. Clinical experts classified articles into four mutually exclusive publication types. PLUS data were split 80:10:10 for training, validation, and testing, with Clinical Hedges used for external validation. A grid search evaluated the impact of class weight adjustments, learning rate, batch size, warmup ratio, and weight decay, totaling 1,890 configurations. Models were assessed using 10 metrics, including area under the receiver operating characteristic curve (AUROC), F1 score, and Matthew’s correlation coefficient (MCC). Performance of individual classes was assessed using a one-to-rest approach, and overall performance was assessed using macro average. Optimal models identified from validation results were further tested on both PLUS and Clinical Hedges datasets, with calibration assessed visually.

Results:

Ten best-performing models achieved macro AUROC ≥0.99, F1 ≥0.89 and MCC ≥0.88 on the validation and test sets. Performance declined on Clinical Hedges. Models were consistently better at classifying original studies and reviews. BioBERT-based models had superior calibration performance, especially for original studies and reviews. Optimal configurations for search included lower learning rates (1E-5 and 3E-5), mid-range batch sizes (32–128), and lower weight decay (0.005-0.010). Class weight adjustments improved recall but generally reduced performance in other metrics. Models generally struggled with accurately classifying non-experimental and guideline articles, potentially due to class imbalance and content heterogeneity.

Conclusions:

This study utilized a comprehensive hyperparameter search to highlight the effectiveness of fine-tuned transformer models, notably BioBERT variants, for multiclass clinical literature classification. While class weighting generally decreased overall performance, addressing class imbalance through alternative methods such as hierarchical classification or targeted resampling warrants future exploration. Optimal hyperparameter configurations were crucial for robust performance, aligning with previous literature. These findings support future modelling research and the practical deployment in human-in-the-loop systems to support knowledge synthesis and translation workflows using optimal configurations found in this work.


 Citation

Please cite as:

Zhou F, Lokker C, Parrish R, Haynes RB, Iorio A, Saha A, Afzal M

Fine-Tuning and Benchmarking Transformer Models for Multiclass Classification of Clinical Research Papers: Retrospective Modeling Study

JMIR AI 2026;5:e77311

DOI: 10.2196/77311

PMID: 42061835

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.