JMIR Preprints #78681: Evaluating Encoder and Decoder Models for Extended Clinical Concept Recognition in Japanese Clinical Texts: A Comparative Study with Weighted Soft Matching

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluating Encoder and Decoder Models for Extended Clinical Concept Recognition in Japanese Clinical Texts: A Comparative Study with Weighted Soft Matching

Yuya Tsukiji;
Satoshi Kataoka;
Masafumi Itokazu;
Ryozo Nagai;
Takeshi Imai

ABSTRACT

Background:

The digitization of medical documents has resulted in vast amounts of information being stored electronically. Extracting medical knowledge for secondary purposes, such as diagnostic support, continues to pose a substantial challenge. While conventional named entity recognition (NER) has focused on short terms (e.g., genes, diseases, chemicals), the extraction and assessment of longer, complex expressions remain underexplored. Clinically vital concepts, such as diseases, pathologies, symptoms, and findings, often manifest as long phrases, whose accurate extraction is crucial for advanced applications like constructing causal knowledge from case reports. Consequently, a comprehensive framework addressing both short-term and clinically meaningful long phrase units—termed extended Named Entity Recognition (E-NER)—is essential.

Objective:

This study, the first comprehensive investigation into E-NER model selection, aimed to identify optimal strategies by comparing encoder versus decoder models and general-purpose versus domain-specific pretraining. We also analyzed variations in model effectiveness with respect to target length and proposed a novel E-NER evaluation metric.

Methods:

We evaluated the extraction performance of 17 encoder and decoder models using the J-CaseMap database, which comprises approximately 20,000 case reports from Japan annotated with clinical concepts. Performance was primarily assessed using our novel “weighted soft matching score,” which distinctively penalizes the fragmentation of long extraction targets and weights scores by target length to account for the increased difficulty of extracting longer expressions.

Results:

The encoder model JMedDeBERTa(s), pretrained on domain-specific medical texts, demonstrated the highest performance (F1-score = 0.7582). Model performance generally declined with higher penalties for fragmentation, although substantial deterioration was not consistently observed. Overall, encoder models significantly outperformed decoder models despite having fewer parameters, and token classification was more effective than instruction tuning. The advantage provided by domain-specific pretraining was apparent but modest, suggesting that syntactic information may be more critical than specialized terminology for E-NER.

Conclusions:

This study demonstrates that for the E-NER task, a token classification approach employing an encoder model, particularly JMedDeBERTa(s) pretrained on medical texts, delivers optimal performance. Notably, no decoder model outperformed its encoder counterpart, underscoring that encoder-based methods can achieve high accuracy with fewer parameters, offering benefits in resource-constrained environments. Our findings on domain-specific pretraining suggest that although beneficial, syntactic understanding may be more essential than specialized terminology for E-NER, enabling models trained on limited domain-specific text—or even general text if domain-specific data are scarce—to perform comparably. Furthermore, token classification proved more effective for extended phrases than instruction tuning, which is better suited for shorter terms. Evaluation using the weighted soft matching score also indicated that model performance did not substantially deteriorate with increased fragmentation penalties, indicating infrequent marker position splits during the extraction of long expressions. These findings offer broadly applicable insights for information extraction tasks across varied medical texts.

Citation

Please cite as:

Tsukiji Y, Kataoka S, Itokazu M, Nagai R, Imai T

Evaluating Encoder and Decoder Models for Extended Clinical Concept Recognition in Japanese Clinical Texts: Comparative Study With Weighted Soft Matching

J Med Internet Res 2026;28:e78681

DOI: 10.2196/78681

PMID: 42133937

PMCID: 13175525

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jun 13, 2025

Date Accepted: Mar 17, 2026

Date Submitted to PubMed: Mar 25, 2026

Evaluating Encoder and Decoder Models for Extended Clinical Concept Recognition in Japanese Clinical Texts: A Comparative Study with Weighted Soft Matching

ABSTRACT

Citation

Copyright