Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jul 31, 2025
Date Accepted: Apr 29, 2026
Understanding Transformer-based Classifications of Medical Text: Proof of Concept using LLM for Attribution of Feature Importance
ABSTRACT
Background:
Deep learning has demonstrated excellent performance in biomedical literature classification. However, the opacity of these models’ decision-making processes limits their interpretability and adoption. Explainable artificial intelligence (XAI) methods, including SHapley Additive exPlanations (SHAP) and integrated gradients (IG), have been proposed to address this issue, yet computational complexity remains high. Generative large language models (LLMs) may offer a novel approach for generating interpretable and context-aware explanations.
Objective:
To investigate the effectiveness of Generative Pre-trained Transformer (GPT) -4o as a perturbation-based explainer for a BioLinkBERT text classifier by comparing its explanations to SHAP partition explainer and IG in terms of faithfulness.
Methods:
A stratified sample of 200 articles from McMaster PLUS and Clinical Hedges databases was classified by BioLinkBERT. GPT-4o, SHAP partition explainer, and IG were used to generate token-level feature attributions. GPT-based explanations were derived through iterative masking perturbation. Explanations were evaluated using a modified version of the area over the perturbation curve (AOPC), correlation analyses, and qualitative assessment of feature importance attribution.
Results:
SHAP (AOPC 0.222; 95% confidence interval [CI] 0.200 to 0.244) and IG (AOPC 0.225; 95% CI 0.202 to 0.247) provided consistent and faithful explanations, effectively identifying tokens relevant to study rigour (e.g., "randomized," "blind"). Conversely, GPT-4o explanations were poor (AOPC 0.029; 95% CI 0.014 to 0.043) with nonsensical token attributions. Correlation analysis showed moderate alignment between SHAP and IG (Pearson’s r 0.367), whereas GPT-4o had minimal (Pearson’s r ≤0.032) correlation with these established methods.
Conclusions:
GPT-4o, despite its advanced contextual capabilities, performed poorly as a standalone explainer compared to established methods like SHAP and IG. These findings highlight the need for further research into specialized prompt engineering and potential hybrid methods integrating LLMs with traditional XAI techniques to improve interpretability without sacrificing computational efficiency or explanation quality.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.