Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 1, 2021
Date Accepted: Jun 15, 2021

The final, peer-reviewed published version of this preprint can be found here:

Predicting Writing Styles of Web-Based Materials for Children’s Health Education Using the Selection of Semantic Features: Machine Learning Approach

Xie W, Ji M, Liu Y, Hao T, Chow CY

Predicting Writing Styles of Web-Based Materials for Children’s Health Education Using the Selection of Semantic Features: Machine Learning Approach

JMIR Med Inform 2021;9(7):e30115

DOI: 10.2196/30115

PMID: 34292167

PMCID: 8367110

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Predicting Health Information Suitability for Children Using Machine-Learning Assisted Selection of Semantic Features

  • Wenxiu Xie; 
  • Meng Ji; 
  • Yanmeng Liu; 
  • Tianyong Hao; 
  • Chi-Yin Chow

ABSTRACT

Background:

Suitability of health resources for specific readerships represents a critical yet underexplored area of research in health informatics, despite its importance in health literacy and health education. High relevance of health information can improve the suitability and readability of online health educational resources for young readers. It has an important role in developing the health literacy of children with increasing exposure to online health information. Existing research on health resource evaluation is limited to the analysis of the morphological and syntactic complexity. Besides, empirical instruments do not exist to evaluate the suitability of online health information for children.

Objective:

We aimed to develop algorithms to predict suitability of online health information for this understudied user group, using a small number of semantic features to provide accurate and convenient tools for automatic prediction of the suitability of online health information for children.

Methods:

Combining machine learning and linguistic insights, we identified semantic features to predict the suitability of online health information for children, as an emerging and large readership on online health information. The selection of natural language features as predicator variables of algorithms went through initial automatic feature selection using Ridge Classifier, support vector machine, extreme gradient boost, followed by revision by linguists, education experts based on effective health information design. We compared algorithms using the automatically selected features (19) and linguistically enhanced features (20), using the initial features (115) as the baseline.

Results:

Using 5-fold cross-validation, comparing with the baseline (115 features), the Gaussian Naive Bayes model (20 features) achieved statistically higher mean sensitivity (P =0.0206, 95% CI: -0.016, 0.1929); mean specificity (P = 0.0205, 95% CI: -0.016, 0.199); mean AUC (P =0.017, 95% CI: -0.007, 0.140); mean Macro F1 (P =0.0061, 95% CI: 0.016, 0.167). The statistically improved performance of the final model (20 features) stands in contrast with the statistically insignificant changes between the original feature set (115) and the automatically selected features (19): mean sensitivity (P =0.134, 95% CI: -0.1699, 0.0681), mean specificity (P = 0.1001, 95% CI: -0.1389, 0.4017); mean AUC (P =0.0082, 95% CI: 0.0059, 0.1126), and mean macro F1 (P = 0.9796, 95% CI: -0.0555, 0.0548). This demonstrates the importance and effectiveness of combing automatic feature selection and expert-based linguistic revision to develop most effective machine learning algorithms from high-dimensional datasets.

Conclusions:

Our study developed machine learning algorithms for evaluating health information suitability for children, an important readership who is having increasing reliance on online health information for developing their health literacy. User-adaptive automatic assessment of online health contents holds much promise for distant and remote health education among young readers. Our study leveraged the precision, adaptability of machine learning algorithms and insights from health linguistics to help advance this significant yet understudied area of research.


 Citation

Please cite as:

Xie W, Ji M, Liu Y, Hao T, Chow CY

Predicting Writing Styles of Web-Based Materials for Children’s Health Education Using the Selection of Semantic Features: Machine Learning Approach

JMIR Med Inform 2021;9(7):e30115

DOI: 10.2196/30115

PMID: 34292167

PMCID: 8367110

The author of this paper has made a PDF available, but requires the user to login, or create an account.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.