Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Dec 25, 2020
Date Accepted: Aug 12, 2021

The final, peer-reviewed published version of this preprint can be found here:

Natural Language Processing and Machine Learning Methods to Characterize Unstructured Patient-Reported Outcomes: Validation Study

Huang IC, Lu Z, Sim JA, Forrest C, Krull K, Srivastava K, Hudson M, Robison L, Baker J

Natural Language Processing and Machine Learning Methods to Characterize Unstructured Patient-Reported Outcomes: Validation Study

J Med Internet Res 2021;23(11):e26777

DOI: 10.2196/26777

PMID: 34730546

PMCID: 8600437

Natural Language Processing and Machine Learning Methods to Characterize Unstructured Patient-Reported Outcomes: A Validation Study

  • I-Chan Huang; 
  • Zhaohua Lu; 
  • Jin-Ah Sim; 
  • Christopher Forrest; 
  • Kevin Krull; 
  • Kumar Srivastava; 
  • Melissa Hudson; 
  • Leslie Robison; 
  • Justin Baker

ABSTRACT

Background:

Assessing patient-reported outcomes (PROs) through interviews or conversations during clinical encounters provides insightful information about survivorship.

Objective:

This study aimed to test the validity of natural language processing (NLP) and machine learning (ML) algorithms in identifying different attributes of pain interference and fatigue symptoms experienced by child/adolescent cancer survivors vs. the judgment by PRO content experts as the gold standard to validate NLP/ML algorithms.

Methods:

This cross-section study focused on child/adolescent cancer survivors aged 8-17.9 years and caregivers from whom 391 meaning units in pain interference domain and 423 in fatigue domain were generated for analyses. Data were collected from After the Completion of Therapy Clinic at St. Jude Children’s Research Hospital. Pain interference and fatigue symptoms experienced were reported through in-depth interviews. After verbatim transcription, analyzable sentences (i.e., meaning units) were semantically labeled by 2 content experts for each attribute (physical, cognitive, social, or unclassified). Two NLP/ML methods were used to extract and validate the semantic features: 1) Bidirectional Encoder Representations from Transformers (BERT) and 2) Word2vec plus one of the ML methods, the Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost), respectively. Receiver operating characteristic (ROC), and precision-recall (PR) curves were used to evaluate the accuracy and validity of NLP/ML methods.

Results:

Compared to Word2vec/SVM and Word2vec/XGBoost, BERT demonstrated higher accuracy on both symptom domains, including 0.931 (95%CI=0.905, 0.957) and 0.916 (95% CI=0.887, 0.941) for problems with cognitive and social attributes on pain interference, and 0.929 (95% CI=0.903, 0.953) and 0.917 (95% CI=0.891, 0.943) for problems with cognitive and social attributes on fatigue. In addition, BERT yielded superior areas under the ROC curve for cognitive attribute on pain interference and fatigue domains (0.923 [95% CI=0.879, 0.997]; 0.948 [95% CI=0.922, 0.979]), and superior areas under the PR curve for cognitive attribute on pain interference and fatigue domains (0.818 [95% CI=0.735, 0.917]; 0.855 [95% CI=0.791, 0.930).

Conclusions:

The BERT method performed superior to other methods. As an alternative to using standard PRO surveys, collecting unstructured PROs via interviews or conversations during clinical encounters and applying NLP/ML methods can facilitate PRO assessment in child/adolescent cancer survivors.


 Citation

Please cite as:

Huang IC, Lu Z, Sim JA, Forrest C, Krull K, Srivastava K, Hudson M, Robison L, Baker J

Natural Language Processing and Machine Learning Methods to Characterize Unstructured Patient-Reported Outcomes: Validation Study

J Med Internet Res 2021;23(11):e26777

DOI: 10.2196/26777

PMID: 34730546

PMCID: 8600437

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.