Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jun 13, 2025
Date Accepted: Mar 17, 2026
Date Submitted to PubMed: Mar 25, 2026
Evaluating Encoder and Decoder Models for Extended Clinical Concept Recognition in Japanese Clinical Texts: A Comparative Study with Weighted Soft Matching
ABSTRACT
Background:
The digitization of medical documents has resulted in vast amounts of information being stored electronically. Extracting medical knowledge for secondary purposes, such as diagnostic support, continues to pose a substantial challenge. While conventional named entity recognition (NER) has focused on short terms (e.g., genes, diseases, chemicals), the extraction and assessment of longer, complex expressions remain underexplored. Clinically vital concepts, such as diseases, pathologies, symptoms, and findings, often manifest as long phrases, whose accurate extraction is crucial for advanced applications like constructing causal knowledge from case reports. Consequently, a comprehensive framework addressing both short-term and clinically meaningful long phrase units—termed extended Named Entity Recognition (E-NER)—is essential.
Objective:
This study, the first comprehensive investigation into E-NER model selection, aimed to identify optimal strategies by comparing encoder versus decoder models and general-purpose versus domain-specific pretraining. We also analyzed variations in model effectiveness with respect to target length and proposed a novel E-NER evaluation metric.
Methods:
We evaluated the extraction performance of 17 encoder and decoder models using the J-CaseMap database, which comprises approximately 20,000 case reports from Japan annotated with clinical concepts. Performance was primarily assessed using our novel “weighted soft matching score,” which distinctively penalizes the fragmentation of long extraction targets and weights scores by target length to account for the increased difficulty of extracting longer expressions.
Results:
The encoder model JMedDeBERTa(s), pretrained on domain-specific medical texts, demonstrated the highest performance (F1-score = 0.7582). Model performance generally declined with higher penalties for fragmentation, although substantial deterioration was not consistently observed. Overall, encoder models significantly outperformed decoder models despite having fewer parameters, and token classification was more effective than instruction tuning. The advantage provided by domain-specific pretraining was apparent but modest, suggesting that syntactic information may be more critical than specialized terminology for E-NER.
Conclusions:
This study demonstrates that for the E-NER task, a token classification approach employing an encoder model, particularly JMedDeBERTa(s) pretrained on medical texts, delivers optimal performance. Notably, no decoder model outperformed its encoder counterpart, underscoring that encoder-based methods can achieve high accuracy with fewer parameters, offering benefits in resource-constrained environments. Our findings on domain-specific pretraining suggest that although beneficial, syntactic understanding may be more essential than specialized terminology for E-NER, enabling models trained on limited domain-specific text—or even general text if domain-specific data are scarce—to perform comparably. Furthermore, token classification proved more effective for extended phrases than instruction tuning, which is better suited for shorter terms. Evaluation using the weighted soft matching score also indicated that model performance did not substantially deteriorate with increased fragmentation penalties, indicating infrequent marker position splits during the extraction of long expressions. These findings offer broadly applicable insights for information extraction tasks across varied medical texts.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.