JMIR Preprints #20492: Evaluation methodology for clinical NLP systems including the creation of an unbiased validation dataset

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluation methodology for clinical NLP systems including the creation of an unbiased validation dataset

Lea Canales;
Sebastian Menke;
Stephanie Marchesseau;
Ariel D’Agostino;
Carlos del Rio-Bermudez;
Miren Taberna;
Jorge Tello

ABSTRACT

Background:

Clinical Natural Language Processing (NLP) systems are of crucial importance, because of their increasing relevance in driving decisions about clinical practice. However, carrying out a sound evaluation of NLP systems is complex and hindered by a lack of guidance on how to approach it.

Objective:

This research aims to provide a state-of-the-art methodology for the evaluation of a clinical NLP system, thereby guiding NLP researchers in this process with the final goal to ensure the robustness and representativeness of the performance metrics.

Methods:

We developed a methodology that guides through the process of developing an evaluation of a clinical NLP system using Savana’s ‘EHRead technology’ applied on a real use-case on chronic obstructive pulmonary disease (COPD). In addition, we further introduce SLiCE, a software tool that assists NLP specialists to create a statistically useful gold standard.

Results:

The gold standard contained 49.6% positive and 50.4% negative examples for COPD. For the COPD study, the confidence interval (CI) of the primary variable COPD, calculated using SLiCE, demonstrated its usefulness with CI widths of 0.074 for Precision, 0.046 for Recall, and 0.061 for F1, respectively.

Conclusions:

Our proposed methodology aims to assist the process of creating an evaluation of a clinical NLP system. Researchers can follow our suggestions step-by-step and use SLiCE to statistically back up their gold standard. We successfully evaluated Savana’s ‘EHRead technology’ using our proposed methodology on a real use-case. We share here the outcome of our experiences working in developing NLP solutions for the clinical domain, hoping that it might help others to establish sound protocols for the evaluation of their NLP system.

Citation

Please cite as:

Canales L, Menke S, Marchesseau S, D’Agostino A, del Rio-Bermudez C, Taberna M, Tello J

Assessing the Performance of Clinical Natural Language Processing Systems: Development of an Evaluation Methodology

JMIR Med Inform 2021;9(7):e20492

DOI: 10.2196/20492

PMID: 34297002

PMCID: 8367121

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 20, 2020

Date Accepted: Jun 17, 2021

Evaluation methodology for clinical NLP systems including the creation of an unbiased validation dataset

ABSTRACT

Citation

Copyright