Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 7, 2021
Date Accepted: Sep 18, 2021

The final, peer-reviewed published version of this preprint can be found here:

Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students With Linguistically Enhanced Machine Learning Algorithms: Development and Validation Study

Xie W, Ji M, Hao T, Chow CY

Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students With Linguistically Enhanced Machine Learning Algorithms: Development and Validation Study

JMIR Med Inform 2021;9(10):e25110

DOI: 10.2196/25110

PMID: 34698644

PMCID: 8579219

Developing Linguistically Enhanced Machine Learning Algorithms for Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students

  • Wenxiu Xie; 
  • Meng Ji; 
  • Tianyong Hao; 
  • Chi-Yin Chow

ABSTRACT

Background:

There is an increasing body of research on the development of machine learning algorithms on the evaluation of online health educational resources for specific readerships. Machine learning algorithms are known for their lack of interpretability compared to statistics. Given their high predictive precision, improving the interpretability of these algorithms can help increase their applicability, replicability in health educational research, applied linguistics, as well as in the development and review of new health education resources for effective, accessible health education.

Objective:

Our study aimed to develop a linguistically enriched machine learning model to predict binary outcomes of online English health educational resources, in terms of their easiness and complexity for international tertiary students.

Methods:

Logistic regression (LR) emerged as the best-performing algorithm compared to SVM (linear), SVM (RBF), random forest (RF) and extreme gradient boost tree (XGB) on the transformed dataset using L2 normalisation. We applied recursive feature elimination (RFE) with support vector machine (SVM) to perform automatic feature selection. The automatically selected features (AS) (67) were then further streamlined through expert review. The finalised feature set of 22 semantic features achieved similar AUC, sensitivity, specificity, and accuracy compared to the original (115) and the AS feature set (67). LR with the linguistically enhanced (LE) feature set (22) exhibited important stability and robustness on the train data of different sizes (20%, 40%, 60%, 80%); and consistently high performance when compared to the other 4 algorithms (SVM_linear; SVM_RBF; RF; XGB).

Results:

We identified semantic features (with positive regression coefficients) contributing to the prediction of easy-to-understand online health texts; and semantic features (with negative regression coefficients) contributing to the prediction of hard-to-understand health materials for readers from non-native English backgrounds. Language complexity was explained by lexical difficulty (rarity, medical terminology), verbs typical of medical discourse, syntactic complexity, languages easiness of online health materials was associated with features such as common speech act verbs, personal pronouns, familiar reasoning verbs. Successive permutation of features illustrated the interaction between these features and their impact on key performance indicators of the machine learning algorithms.

Conclusions:

The new logistic regression model developed exhibited consistency, scalability, more importantly, interpretability based on existing health and linguistic research. It was found that low and high linguistic accessibility of online health materials was explained by two sets of distinct semantic features. This revealed the inherent complexity of effective health communication beyond current readability analyses which were limited to syntactic complexity and lexical difficulty


 Citation

Please cite as:

Xie W, Ji M, Hao T, Chow CY

Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students With Linguistically Enhanced Machine Learning Algorithms: Development and Validation Study

JMIR Med Inform 2021;9(10):e25110

DOI: 10.2196/25110

PMID: 34698644

PMCID: 8579219

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.