Currently accepted at: JMIR Medical Informatics
Date Submitted: Oct 26, 2025
Open Peer Review Period: Oct 26, 2025 - Dec 21, 2025
Date Accepted: Mar 6, 2026
(closed for review but you can still tweet)
This paper has been accepted and is currently in production.
It will appear shortly on 10.2196/86533
The final accepted version (not copyedited yet) is in this tab.
Automated ICD-10–Anchored Classification of Primary Care Text Data: Development and Evaluation of a Custom Multi-Label Classifier
ABSTRACT
Background:
Electronic health records are a vast and valuable source of information, useful for tasks such as estimating disease prevalence. However, much of this information, particularly doctors’ notes, is in free-text format rather than in a structured form and is therefore not readily amenable to analysis. Manual coding of this textual data is both time-consuming and resource-intensive, making it impractical for large datasets. The advent of powerful open-source language models offers innovative solutions to this scalability challenge.
Objective:
By providing hands-on guidance for applied health researchers, this study aims to demonstrate the effective and accurate automatic classification of free-text notes using a language model fine-tuned for automated International Statistical Classification of Diseases and Related Health Problems – 10th version (ICD-10) coding.
Methods:
Building on the extensive ‘FIRE’ routine database from the Institute of Primary Care at the University Hospital Zurich and the University of Zurich, we trained a large language model–based multi-label classifier on a dataset of 38’728 free-text notes which had been manually categorized into 48 classes, using specific ICD-10 codes and code ranges or non-diagnostic/ad hoc labels (e.g., ‘unclear diagnosis’, ‘status post’). We stratified the labelled data into training (70%), validation (15%) and post-training test (15%) sets, ensuring similar label distributions across these sets. Using the Transformers Python library, we trained the model over 10 epochs and evaluated it on the post-training test dataset.
Results:
Across 48 classes, the FIRE classifier achieved strong performance on the held-out post-training set: F1 = 0.85 (micro, overall across all predictions), 0.86 (macro, mean of per-class scores treating classes equally), and 0.84 (weighted, per-class scores weighted by class frequency).
Conclusions:
This study demonstrates steps for training open-source models and highlights the potential of large-scale language models to streamline and scale the extraction of diagnostic information for practical applications. Our model can be robustly deployed, for example, for pre-screening and labelling of free-text information, thus potentially reducing the burden of repetitive and error-prone manual handling. Clinical Trial: N/A
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.