Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Oct 21, 2022
Open Peer Review Period: Oct 19, 2022 - Oct 4, 2023
Date Accepted: Jul 22, 2023
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study

Wibaek R, Andersen GS, Dahm CC, Witte DR, Hulman A

Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study

JMIR Med Inform 2023;11:e43638

DOI: 10.2196/43638

PMID: 37787655

PMCID: 10547934

Large language models for epidemiological research via automated machine learning: a case study and method comparison from the British National Child Development Study

  • Rasmus Wibaek; 
  • Gregers Stig Andersen; 
  • Christina C Dahm; 
  • Daniel R Witte; 
  • Adam Hulman

ABSTRACT

Background:

Large language models have had a huge impact on natural language processing (NLP) in recent years. However, their application in epidemiological research is still limited to analysis of electronic health records and social media data.

Objective:

To demonstrate the potential beyond these domains, we aimed to develop prediction models based on texts collected in an epidemiological cohort and compared their performance to classical regression methods.

Methods:

We used data from the British National Child Development Study, where 10,567 11-year-old children wrote essays about how they imagined themselves as 25-year-olds. Fifteen percent of the dataset was set aside as a test set for performance evaluation. Pre-trained language models were fine-tuned using AutoTrain (by Hugging Face) to predict current reading comprehension score (0-35) and future body mass index (BMI) and physical activity (active vs. inactive) at the age of 33. We then compared their predictive performance (accuracy or discrimination) with linear and logistic regression models including demographic and lifestyle factors of the parents and the children between birth and age 11 as predictors.

Results:

NLP clearly outperformed linear regression when predicting reading comprehension score (RMSE=3.89 [95% CI: 3.74, 4.05] for NLP vs. 4.14 [3.98, 4.30] and 5.41 [5.23, 5.58] for regression models with and without general ability score as predictor). Predictive performance for physical activity was similarly poor for the two methods (AUC ROC=0.55 [0.52, 0.60] for both), but slightly better than random assignment, while linear regression clearly outperformed the NLP approach when predicting BMI (RMSE=4.38 [4.02, 4.74] for NLP vs. 3.85 [3.54, 4.16] for regression). The NLP approach did not perform better than simply assigning the mean BMI from the training set as predictors.

Conclusions:

Our study demonstrated the potential of using large language models to utilize text collected in epidemiological studies. The performance of the approach appeared to depend on how directly the topic of the text was related to outcome. Open-ended questions specifically designed to capture certain health concepts and lived experiences in combination with NLP methods should receive more attention in future epidemiological studies.


 Citation

Please cite as:

Wibaek R, Andersen GS, Dahm CC, Witte DR, Hulman A

Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study

JMIR Med Inform 2023;11:e43638

DOI: 10.2196/43638

PMID: 37787655

PMCID: 10547934

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.