Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 13, 2025
Date Accepted: Dec 27, 2025

The final, peer-reviewed published version of this preprint can be found here:

A Sentence Classification–Based Medical Status Extraction Pipeline for Electronic Health Records: Institutional Case Study

Dong C, Delange B, Poiron A, El Azzouzi M, François C, Bouzillé G, Cuggia M, Cabon S

A Sentence Classification–Based Medical Status Extraction Pipeline for Electronic Health Records: Institutional Case Study

JMIR Med Inform 2026;14:e77409

DOI: 10.2196/77409

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

An iterative and generic framework to build effective medical status extractors based on sentence classification of Electronic Health Records under constraint resources

  • Chuanming Dong; 
  • Boris Delange; 
  • Alex Poiron; 
  • Mohamed El Azzouzi; 
  • Clément François; 
  • Guillaume Bouzillé; 
  • Marc Cuggia; 
  • Sandie Cabon

ABSTRACT

Background:

Clinical data warehouses store large amounts of patient information in the form of unstructured text, from which medical status can be extracted using natural language processing (NLP) for research. Current machine learning-based extraction systems use the named entity recognition (NER) approach, requiring lots of manual annotation by medical experts. However, the limited number of available experts and heavy workload make it challenging to build and replicate these systems, highlighting the need for solutions to reduce time and effort.

Objective:

We introduce an iterative and generic framework for extracting patients’ medical status from electronic health records that can be more efficient and reproducible (MSEP pipeline).

Methods:

Our medical status extraction pipeline uses NLP methods, classifying sentences in health records rather than extracting entities. Deep learning fine-tunes pre-trained CamemBERT for classification, while rules and large language model (LLM) prompts serve as comparisons. Two specialists annotated data, aided by rule-based pre-annotation. Performance and stability were assessed using stratified cross-validation.

Results:

The pipeline extracted patient status for six conditions, achieving over 90% F-score for five using fine-tuned CamemBERT, out-performing LLM prompts and rule-based extractors. However, extracting medical status with few training samples, such as family history of cancer, remained challenging (80% F-score). Datasets construction has been accelerated implementing sentence classification instead of named entities annotation (1.2-2.9s/sentence versus 7.8-16.5 s/sentence). The minimum time spent to build and perfect an extractor using the pipeline is 8 hours.

Conclusions:

MSEP is a pipeline for extracting medical status from texts. It has achieved state-of-the-art performance, and can be built and replicated with efficiency and little requirements. It is highly customizable based on research objectives and resource availability.


 Citation

Please cite as:

Dong C, Delange B, Poiron A, El Azzouzi M, François C, Bouzillé G, Cuggia M, Cabon S

A Sentence Classification–Based Medical Status Extraction Pipeline for Electronic Health Records: Institutional Case Study

JMIR Med Inform 2026;14:e77409

DOI: 10.2196/77409

The author of this paper has made a PDF available, but requires the user to login, or create an account.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.