JMIR Preprints #43638: Large language models for epidemiological research via automated machine learning: a case study and method comparison from the British National Child Development Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large language models for epidemiological research via automated machine learning: a case study and method comparison from the British National Child Development Study

Rasmus Wibaek;
Gregers Stig Andersen;
Christina C Dahm;
Daniel R Witte;
Adam Hulman

ABSTRACT

Background:

Large language models have had a huge impact on natural language processing (NLP) in recent years. However, their application in epidemiological research is still limited to analysis of electronic health records and social media data.

Objective:

To demonstrate the potential beyond these domains, we aimed to develop prediction models based on texts collected in an epidemiological cohort and compared their performance to classical regression methods.

Methods:

We used data from the British National Child Development Study, where 10,567 11-year-old children wrote essays about how they imagined themselves as 25-year-olds. Fifteen percent of the dataset was set aside as a test set for performance evaluation. Pre-trained language models were fine-tuned using AutoTrain (by Hugging Face) to predict current reading comprehension score (0-35) and future body mass index (BMI) and physical activity (active vs. inactive) at the age of 33. We then compared their predictive performance (accuracy or discrimination) with linear and logistic regression models including demographic and lifestyle factors of the parents and the children between birth and age 11 as predictors.

Results:

NLP clearly outperformed linear regression when predicting reading comprehension score (RMSE=3.89 [95% CI: 3.74, 4.05] for NLP vs. 4.14 [3.98, 4.30] and 5.41 [5.23, 5.58] for regression models with and without general ability score as predictor). Predictive performance for physical activity was similarly poor for the two methods (AUC ROC=0.55 [0.52, 0.60] for both), but slightly better than random assignment, while linear regression clearly outperformed the NLP approach when predicting BMI (RMSE=4.38 [4.02, 4.74] for NLP vs. 3.85 [3.54, 4.16] for regression). The NLP approach did not perform better than simply assigning the mean BMI from the training set as predictors.

Conclusions:

Our study demonstrated the potential of using large language models to utilize text collected in epidemiological studies. The performance of the approach appeared to depend on how directly the topic of the text was related to outcome. Open-ended questions specifically designed to capture certain health concepts and lived experiences in combination with NLP methods should receive more attention in future epidemiological studies.

Citation

Please cite as:

Wibaek R, Andersen GS, Dahm CC, Witte DR, Hulman A

Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study

JMIR Med Inform 2023;11:e43638

DOI: 10.2196/43638

PMID: 37787655

PMCID: 10547934

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Oct 21, 2022

Open Peer Review Period: Oct 19, 2022 - Oct 4, 2023

Date Accepted: Jul 22, 2023

(closed for review but you can still tweet)

Large language models for epidemiological research via automated machine learning: a case study and method comparison from the British National Child Development Study

ABSTRACT

Citation

Copyright