Accepted for/Published in: JMIR Medical Informatics
Date Submitted: May 13, 2025
Date Accepted: Dec 27, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
An iterative and generic framework to build effective medical status extractors based on sentence classification of Electronic Health Records under constraint resources
ABSTRACT
Background:
Clinical data warehouses store large amounts of patient information in the form of unstructured text, from which medical status can be extracted using natural language processing (NLP) for research. Current machine learning-based extraction systems use the named entity recognition (NER) approach, requiring lots of manual annotation by medical experts. However, the limited number of available experts and heavy workload make it challenging to build and replicate these systems, highlighting the need for solutions to reduce time and effort.
Objective:
We introduce an iterative and generic framework for extracting patients’ medical status from electronic health records that can be more efficient and reproducible (MSEP pipeline).
Methods:
Our medical status extraction pipeline uses NLP methods, classifying sentences in health records rather than extracting entities. Deep learning fine-tunes pre-trained CamemBERT for classification, while rules and large language model (LLM) prompts serve as comparisons. Two specialists annotated data, aided by rule-based pre-annotation. Performance and stability were assessed using stratified cross-validation.
Results:
The pipeline extracted patient status for six conditions, achieving over 90% F-score for five using fine-tuned CamemBERT, out-performing LLM prompts and rule-based extractors. However, extracting medical status with few training samples, such as family history of cancer, remained challenging (80% F-score). Datasets construction has been accelerated implementing sentence classification instead of named entities annotation (1.2-2.9s/sentence versus 7.8-16.5 s/sentence). The minimum time spent to build and perfect an extractor using the pipeline is 8 hours.
Conclusions:
MSEP is a pipeline for extracting medical status from texts. It has achieved state-of-the-art performance, and can be built and replicated with efficiency and little requirements. It is highly customizable based on research objectives and resource availability.
Citation
The author of this paper has made a PDF available, but requires the user to login, or create an account.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.