Sample size considerations for fine-tuning Large Language Models for Named Entity Recognition Tasks: A methodological study
ABSTRACT
Background:
Large language models (LLM) have the potential to support promising new applications in health informatics. However, there is a lack of practical data available on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts.
Objective:
To evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named-entity recognition (NER) for a custom dataset of conflict of interest (COI) disclosure statements.
Methods:
A random sample of 200 disclosure statements was prepared for annotation. All PERSON and ORG entities were identified by each of the two raters and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune RoBERTa models for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence or EPS) and trained model performance (F1). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.
Results:
Fine-tuned models ranged in overall NER performance from F1 = 0.433 to F1 = 0.936, with an average model performance of F1 = 0.836 (SD = 0.135). The two-predictor multiple linear regression model was statistically significant, F(2,2497) = 2034, P<.001; multiple R2 = 0.6197. The estimates for each independent variable were also statistically significant with βEPS = 0.04 (95% CI: 0.02 to 0.06) and βsent = 0.0004 (95% CI: 0.00034 to 0.00036).The threshold model for total sentences estimates that a diminishing margin of return occurs at 448 sentences (95% CI: 437 to 456), p = 0. The threshold model for EPS likewise indicates a diminishing margin of return at a token density of 1.36 (95% CI: 1.35 to 1.37), P>.001.
Conclusions:
Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.