JMIR Preprints #51433: Leveraging Open-Source Large Language Models for Data Augmentation to Improve Text Classification in Surveys of Medical Staff

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)

Leveraging Open-Source Large Language Models for Data Augmentation to Improve Text Classification in Surveys of Medical Staff

Carl Ehrett;
Sudeep Hegde;
Kwame Andre;
Dixizi Liu;
Timothy Wilson

ABSTRACT

Background:

Generative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in healthcare settings, particularly for augmenting small datasets in text classification tasks, remains underexplored, particularly for cost- and privacy-conscious applications that do not permit the use of third-party services such as OpenAI’s ChatGPT.

Objective:

This paper explores the use of open-source LLMs, such as Large Language Model Meta AI (LLaMA) and Alpaca models, for data augmentation in a specific text classification task related to hospital staff surveys.

Methods:

The surveys were designed to elicit narratives of everyday adaptation by frontline radiology staff during the initial phase of the COVID-19 pandemic. The study evaluates the effectiveness of various LLMs, temperature settings, and downstream classifiers in improving classifier performance.

Results:

The overall best-performing combination of LLM, temperature, classifier, and number of augments is LLaMA 7B at temperature 0.7 using Robustly Optimized BERT Pretraining Approach (RoBERTa) with 100 augments, with an average the Area Under the Receiver Operating Characteristic curve (AUC) of [0.87] ±[0.02: 1 standard deviation]. The results demonstrate that open-source LLMs can enhance text classifiers' performance for small datasets in healthcare contexts, providing promising pathways for improving medical education processes and patient care practices.

Conclusions:

The study demonstrates the value of data augmentation with open-source LLMs, highlights the importance of privacy and ethical considerations when using LLMs, and suggests future directions for research in this field.

Citation

Please cite as:

Ehrett C, Hegde S, Andre K, Liu D, Wilson T

Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study

JMIR Med Educ 2024;10:e51433

DOI: 10.2196/51433

PMID: 39560937

PMCID: 11590755

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jul 31, 2023

Date Accepted: Aug 15, 2024

Leveraging Open-Source Large Language Models for Data Augmentation to Improve Text Classification in Surveys of Medical Staff

ABSTRACT

Citation

Copyright