JMIR Preprints #77409: An iterative and generic framework to build effective medical status extractors based on sentence classification of Electronic Health Records under constraint resources

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

An iterative and generic framework to build effective medical status extractors based on sentence classification of Electronic Health Records under constraint resources

Chuanming Dong;
Boris Delange;
Alex Poiron;
Mohamed El Azzouzi;
Clément François;
Guillaume Bouzillé;
Marc Cuggia;
Sandie Cabon

ABSTRACT

Background:

Clinical data warehouses store large amounts of patient information in the form of unstructured text, from which medical status can be extracted using natural language processing (NLP) for research. Current machine learning-based extraction systems use the named entity recognition (NER) approach, requiring lots of manual annotation by medical experts. However, the limited number of available experts and heavy workload make it challenging to build and replicate these systems, highlighting the need for solutions to reduce time and effort.

Objective:

We introduce an iterative and generic framework for extracting patients’ medical status from electronic health records that can be more efficient and reproducible (MSEP pipeline).

Methods:

Our medical status extraction pipeline uses NLP methods, classifying sentences in health records rather than extracting entities. Deep learning fine-tunes pre-trained CamemBERT for classification, while rules and large language model (LLM) prompts serve as comparisons. Two specialists annotated data, aided by rule-based pre-annotation. Performance and stability were assessed using stratified cross-validation.

Results:

The pipeline extracted patient status for six conditions, achieving over 90% F-score for five using fine-tuned CamemBERT, out-performing LLM prompts and rule-based extractors. However, extracting medical status with few training samples, such as family history of cancer, remained challenging (80% F-score). Datasets construction has been accelerated implementing sentence classification instead of named entities annotation (1.2-2.9s/sentence versus 7.8-16.5 s/sentence). The minimum time spent to build and perfect an extractor using the pipeline is 8 hours.

Conclusions:

MSEP is a pipeline for extracting medical status from texts. It has achieved state-of-the-art performance, and can be built and replicated with efficiency and little requirements. It is highly customizable based on research objectives and resource availability.

Citation

Please cite as:

Dong C, Delange B, Poiron A, El Azzouzi M, François C, Bouzillé G, Cuggia M, Cabon S

A Sentence Classification–Based Medical Status Extraction Pipeline for Electronic Health Records: Institutional Case Study

JMIR Med Inform 2026;14:e77409

DOI: 10.2196/77409

PMID: 41923446

PMCID: 13044345

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 13, 2025

Date Accepted: Dec 27, 2025

An iterative and generic framework to build effective medical status extractors based on sentence classification of Electronic Health Records under constraint resources

ABSTRACT

Citation

Copyright

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 13, 2025

Date Accepted: Dec 27, 2025

An iterative and generic framework to build effective medical status extractors based on sentence classification of Electronic Health Records under constraint resources

ABSTRACT

Citation

The author of this paper has made a PDF available, but requires the user to login, or create an account.

Copyright