Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Oct 12, 2024
Date Accepted: Apr 28, 2025
Large Language Models in Randomized Controlled Trials Design: Observational Study
ABSTRACT
Background:
Randomized controlled trials (RCTs) face challenges such as limited generalizability, insufficient recruitment diversity, and high failure rates, often due to restrictive eligibility criteria and inefficient patient selection. Large language models (LLMs) have shown promise in various clinical tasks, but their potential role in RCT design remains underexplored.
Objective:
This study investigates the ability of LLMs, specifically GPT-4-Turbo-Preview, to assist in designing RCTs that enhance generalizability, recruitment diversity, and reduce failure rates, while maintaining clinical safety and ethical standards.
Methods:
We conducted a non-interventional, observational study analyzing 20 parallel-arm RCTs, comprising 10 completed and 10 ongoing studies published after January 2024 to mitigate pretraining biases. The LLM was tasked with generating RCT designs based on input criteria, including eligibility, recruitment strategies, interventions, and outcomes. The accuracy of LLM-generated designs was quantitatively assessed by two independent clinical experts by comparing them to clinically validated ground truth data from ClinicalTrials.gov. We have conducted statistical analysis using natural language processing (NLP) based methods, including BLEU, ROGUE-L and METEOR, for objective scoring on corresponding LLM outputs. Qualitative assessments were performed using Likert scale ratings (1–3) for domains such as safety, clinical accuracy, objectivity or bias, pragmatism, inclusivity and diversity.
Results:
The LLM achieved an overall accuracy of 72% in replicating RCT designs. Recruitment and intervention designs demonstrated high agreement with the ground truth, achieving 88% and 93% accuracy, respectively. However, LLMs showed lower accuracy in designing eligibility criteria (55%) and outcomes measurement (53%). NLP statistical analysis reported BLEU = 0.04, ROUGE-L = 0.20, and METEOR = 0.18 on average objective scoring of LLM output. Qualitative evaluations showed that LLM-generated designs scored above 2 points and closely matched to original designs scoring across all domains, indicating strong clinical alignment. Specifically, both original and LLM based design ranked similarly higher in safety, clinical accuracy, objectivity or bias in published RCTs. While LLM based design ranked non inferior to original designs in ongoing registered RCTs in multiple domains. In particular, LLMs enhanced diversity and pragmatism, which are key factors in improving RCT generalizability and addressing failure rates.
Conclusions:
LLMs, such as GPT-4-Turbo-Preview, have demonstrated potential in improving RCT design, particularly in recruitment and intervention planning, while enhancing generalizability and addressing diversity. However, expert oversight and regulatory measures are essential to ensure patient safety and ethical standards. The findings support further integration of LLMs into clinical trial design, although continued refinement is necessary to address limitations in eligibility and outcomes measurement.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.