JMIR Preprints #86533: Automated ICD-10–Anchored Classification of Primary Care Text Data: Development and Evaluation of a Custom Multi-Label Classifier

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Automated ICD-10–Anchored Classification of Primary Care Text Data: Development and Evaluation of a Custom Multi-Label Classifier

Christina Haag;
Thomas Grischott;
Jakob M. Burgstaller;
Stefan Markun;
Oliver Senn;
Viktor von Wyl

ABSTRACT

Background:

Electronic health records are a vast and valuable source of information, useful for tasks such as estimating disease prevalence. However, much of this information, particularly doctors’ notes, is in free-text format rather than in a structured form and is therefore not readily amenable to analysis. Manual coding of this textual data is both time-consuming and resource-intensive, making it impractical for large datasets. The advent of powerful open-source language models offers innovative solutions to this scalability challenge.

Objective:

By providing hands-on guidance for applied health researchers, this study aims to demonstrate the effective and accurate automatic classification of free-text notes using a language model fine-tuned for automated International Statistical Classification of Diseases and Related Health Problems – 10th version (ICD-10) coding.

Methods:

Building on the extensive ‘FIRE’ routine database from the Institute of Primary Care at the University Hospital Zurich and the University of Zurich, we trained a large language model–based multi-label classifier on a dataset of 38’728 free-text notes which had been manually categorized into 48 classes, using specific ICD-10 codes and code ranges or non-diagnostic/ad hoc labels (e.g., ‘unclear diagnosis’, ‘status post’). We stratified the labelled data into training (70%), validation (15%) and post-training test (15%) sets, ensuring similar label distributions across these sets. Using the Transformers Python library, we trained the model over 10 epochs and evaluated it on the post-training test dataset.

Results:

Across 48 classes, the FIRE classifier achieved strong performance on the held-out post-training set: F1 = 0.85 (micro, overall across all predictions), 0.86 (macro, mean of per-class scores treating classes equally), and 0.84 (weighted, per-class scores weighted by class frequency).

Conclusions:

This study demonstrates steps for training open-source models and highlights the potential of large-scale language models to streamline and scale the extraction of diagnostic information for practical applications. Our model can be robustly deployed, for example, for pre-screening and labelling of free-text information, thus potentially reducing the burden of repetitive and error-prone manual handling. Clinical Trial: N/A

Citation

Please cite as:

Haag C, Grischott T, Burgstaller JM, Markun S, Senn O, von Wyl V

Automated ICD-10–Anchored Classification of Primary Care Text Data: Development and Evaluation of a Custom Multilabel Classifier

JMIR Med Inform 2026;14:e86533

DOI: 10.2196/86533

PMID: 41941723

PMCID: 13053002

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Oct 26, 2025

Open Peer Review Period: Oct 26, 2025 - Dec 21, 2025

Date Accepted: Mar 6, 2026

(closed for review but you can still tweet)

Automated ICD-10–Anchored Classification of Primary Care Text Data: Development and Evaluation of a Custom Multi-Label Classifier

ABSTRACT

Citation

Copyright