JMIR Preprints #25110: Developing Linguistically Enhanced Machine Learning Algorithms for Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Developing Linguistically Enhanced Machine Learning Algorithms for Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students

Wenxiu Xie;
Meng Ji;
Tianyong Hao;
Chi-Yin Chow

ABSTRACT

Background:

There is an increasing body of research on the development of machine learning algorithms on the evaluation of online health educational resources for specific readerships. Machine learning algorithms are known for their lack of interpretability compared to statistics. Given their high predictive precision, improving the interpretability of these algorithms can help increase their applicability, replicability in health educational research, applied linguistics, as well as in the development and review of new health education resources for effective, accessible health education.

Objective:

Our study aimed to develop a linguistically enriched machine learning model to predict binary outcomes of online English health educational resources, in terms of their easiness and complexity for international tertiary students.

Methods:

Logistic regression (LR) emerged as the best-performing algorithm compared to SVM (linear), SVM (RBF), random forest (RF) and extreme gradient boost tree (XGB) on the transformed dataset using L2 normalisation. We applied recursive feature elimination (RFE) with support vector machine (SVM) to perform automatic feature selection. The automatically selected features (AS) (67) were then further streamlined through expert review. The finalised feature set of 22 semantic features achieved similar AUC, sensitivity, specificity, and accuracy compared to the original (115) and the AS feature set (67). LR with the linguistically enhanced (LE) feature set (22) exhibited important stability and robustness on the train data of different sizes (20%, 40%, 60%, 80%); and consistently high performance when compared to the other 4 algorithms (SVM_linear; SVM_RBF; RF; XGB).

Results:

We identified semantic features (with positive regression coefficients) contributing to the prediction of easy-to-understand online health texts; and semantic features (with negative regression coefficients) contributing to the prediction of hard-to-understand health materials for readers from non-native English backgrounds. Language complexity was explained by lexical difficulty (rarity, medical terminology), verbs typical of medical discourse, syntactic complexity, languages easiness of online health materials was associated with features such as common speech act verbs, personal pronouns, familiar reasoning verbs. Successive permutation of features illustrated the interaction between these features and their impact on key performance indicators of the machine learning algorithms.

Conclusions:

The new logistic regression model developed exhibited consistency, scalability, more importantly, interpretability based on existing health and linguistic research. It was found that low and high linguistic accessibility of online health materials was explained by two sets of distinct semantic features. This revealed the inherent complexity of effective health communication beyond current readability analyses which were limited to syntactic complexity and lexical difficulty

Citation

Please cite as:

Xie W, Ji M, Hao T, Chow CY

Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students With Linguistically Enhanced Machine Learning Algorithms: Development and Validation Study

JMIR Med Inform 2021;9(10):e25110

DOI: 10.2196/25110

PMID: 34698644

PMCID: 8579219

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 7, 2021

Date Accepted: Sep 18, 2021

Developing Linguistically Enhanced Machine Learning Algorithms for Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students

ABSTRACT

Citation

Copyright