JMIR Preprints #52095: Sample size considerations for fine-tuning Large Language Models for Named Entity Recognition Tasks: A methodological study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)

Sample size considerations for fine-tuning Large Language Models for Named Entity Recognition Tasks: A methodological study

Zoltan P. Majdik;
S. Scott Graham;
Sabrina N. Rodriguez;
Martha S. Karnes;
Jared T. Jensen;
Joshua B. Barbour;
Justin F. Rousseau

ABSTRACT

Background:

Large language models (LLM) have the potential to support promising new applications in health informatics. However, there is a lack of practical data available on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts.

Objective:

To evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named-entity recognition (NER) for a custom dataset of conflict of interest (COI) disclosure statements.

Methods:

A random sample of 200 disclosure statements was prepared for annotation. All PERSON and ORG entities were identified by each of the two raters and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune RoBERTa models for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence or EPS) and trained model performance (F1). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.

Results:

Fine-tuned models ranged in overall NER performance from F1 = 0.433 to F1 = 0.936, with an average model performance of F1 = 0.836 (SD = 0.135). The two-predictor multiple linear regression model was statistically significant, F(2,2497) = 2034, P<.001; multiple R2 = 0.6197. The estimates for each independent variable were also statistically significant with βEPS = 0.04 (95% CI: 0.02 to 0.06) and βsent = 0.0004 (95% CI: 0.00034 to 0.00036).The threshold model for total sentences estimates that a diminishing margin of return occurs at 448 sentences (95% CI: 437 to 456), p = 0. The threshold model for EPS likewise indicates a diminishing margin of return at a token density of 1.36 (95% CI: 1.35 to 1.37), P>.001.

Conclusions:

Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data.

Citation

Please cite as:

Majdik ZP, Graham SS, Rodriguez SN, Karnes MS, Jensen JT, Barbour JB, Rousseau JF

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study

JMIR AI 2024;3:e52095

DOI: 10.2196/52095

PMID: 38875593

PMCID: 11140272

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR AI

Date Submitted: Aug 22, 2023

Date Accepted: Mar 30, 2024

Sample size considerations for fine-tuning Large Language Models for Named Entity Recognition Tasks: A methodological study

ABSTRACT

Citation

Copyright