Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 20, 2020
Date Accepted: Jun 17, 2021

The final, peer-reviewed published version of this preprint can be found here:

Assessing the Performance of Clinical Natural Language Processing Systems: Development of an Evaluation Methodology

Canales L, Menke S, Marchesseau S, D’Agostino A, del Rio-Bermudez C, Taberna M, Tello J

Assessing the Performance of Clinical Natural Language Processing Systems: Development of an Evaluation Methodology

JMIR Med Inform 2021;9(7):e20492

DOI: 10.2196/20492

PMID: 34297002

PMCID: 8367121

Evaluation methodology for clinical NLP systems including the creation of an unbiased validation dataset

  • Lea Canales; 
  • Sebastian Menke; 
  • Stephanie Marchesseau; 
  • Ariel D’Agostino; 
  • Carlos del Rio-Bermudez; 
  • Miren Taberna; 
  • Jorge Tello

ABSTRACT

Background:

Clinical Natural Language Processing (NLP) systems are of crucial importance, because of their increasing relevance in driving decisions about clinical practice. However, carrying out a sound evaluation of NLP systems is complex and hindered by a lack of guidance on how to approach it.

Objective:

This research aims to provide a state-of-the-art methodology for the evaluation of a clinical NLP system, thereby guiding NLP researchers in this process with the final goal to ensure the robustness and representativeness of the performance metrics.

Methods:

We developed a methodology that guides through the process of developing an evaluation of a clinical NLP system using Savana’s ‘EHRead technology’ applied on a real use-case on chronic obstructive pulmonary disease (COPD). In addition, we further introduce SLiCE, a software tool that assists NLP specialists to create a statistically useful gold standard.

Results:

The gold standard contained 49.6% positive and 50.4% negative examples for COPD. For the COPD study, the confidence interval (CI) of the primary variable COPD, calculated using SLiCE, demonstrated its usefulness with CI widths of 0.074 for Precision, 0.046 for Recall, and 0.061 for F1, respectively.

Conclusions:

Our proposed methodology aims to assist the process of creating an evaluation of a clinical NLP system. Researchers can follow our suggestions step-by-step and use SLiCE to statistically back up their gold standard. We successfully evaluated Savana’s ‘EHRead technology’ using our proposed methodology on a real use-case. We share here the outcome of our experiences working in developing NLP solutions for the clinical domain, hoping that it might help others to establish sound protocols for the evaluation of their NLP system.


 Citation

Please cite as:

Canales L, Menke S, Marchesseau S, D’Agostino A, del Rio-Bermudez C, Taberna M, Tello J

Assessing the Performance of Clinical Natural Language Processing Systems: Development of an Evaluation Methodology

JMIR Med Inform 2021;9(7):e20492

DOI: 10.2196/20492

PMID: 34297002

PMCID: 8367121

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.