JMIR Preprints #67469: Large Language Models in Randomized Controlled Trials Design: Observational Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large Language Models in Randomized Controlled Trials Design: Observational Study

Liyuan Jin;
Jasmine Chiat Ling Ong;
Kabilan Elangovan;
Yuhe Ke;
Alexandra Pyle;
Daniel Shu Wei Ting;
Nan Liu

ABSTRACT

Background:

Randomized controlled trials (RCTs) face challenges such as limited generalizability, insufficient recruitment diversity, and high failure rates, often due to restrictive eligibility criteria and inefficient patient selection. Large language models (LLMs) have shown promise in various clinical tasks, but their potential role in RCT design remains underexplored.

Objective:

This study investigates the ability of LLMs, specifically GPT-4-Turbo-Preview, to assist in designing RCTs that enhance generalizability, recruitment diversity, and reduce failure rates, while maintaining clinical safety and ethical standards.

Methods:

We conducted a non-interventional, observational study analyzing 20 parallel-arm RCTs, comprising 10 completed and 10 ongoing studies published after January 2024 to mitigate pretraining biases. The LLM was tasked with generating RCT designs based on input criteria, including eligibility, recruitment strategies, interventions, and outcomes. The accuracy of LLM-generated designs was quantitatively assessed by two independent clinical experts by comparing them to clinically validated ground truth data from ClinicalTrials.gov. We have conducted statistical analysis using natural language processing (NLP) based methods, including BLEU, ROGUE-L and METEOR, for objective scoring on corresponding LLM outputs. Qualitative assessments were performed using Likert scale ratings (1–3) for domains such as safety, clinical accuracy, objectivity or bias, pragmatism, inclusivity and diversity.

Results:

The LLM achieved an overall accuracy of 72% in replicating RCT designs. Recruitment and intervention designs demonstrated high agreement with the ground truth, achieving 88% and 93% accuracy, respectively. However, LLMs showed lower accuracy in designing eligibility criteria (55%) and outcomes measurement (53%). NLP statistical analysis reported BLEU = 0.04, ROUGE-L = 0.20, and METEOR = 0.18 on average objective scoring of LLM output. Qualitative evaluations showed that LLM-generated designs scored above 2 points and closely matched to original designs scoring across all domains, indicating strong clinical alignment. Specifically, both original and LLM based design ranked similarly higher in safety, clinical accuracy, objectivity or bias in published RCTs. While LLM based design ranked non inferior to original designs in ongoing registered RCTs in multiple domains. In particular, LLMs enhanced diversity and pragmatism, which are key factors in improving RCT generalizability and addressing failure rates.

Conclusions:

LLMs, such as GPT-4-Turbo-Preview, have demonstrated potential in improving RCT design, particularly in recruitment and intervention planning, while enhancing generalizability and addressing diversity. However, expert oversight and regulatory measures are essential to ensure patient safety and ethical standards. The findings support further integration of LLMs into clinical trial design, although continued refinement is necessary to address limitations in eligibility and outcomes measurement.

Citation

Please cite as:

Jin L, Ong JCL, Elangovan K, Ke Y, Pyle A, Ting DSW, Liu N

Large Language Models in Randomized Controlled Trials Design: Observational Study

J Med Internet Res 2025;27:e67469

DOI: 10.2196/67469

PMID: 41055064

PMCID: 12407223

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 12, 2024

Date Accepted: Apr 28, 2025

Large Language Models in Randomized Controlled Trials Design: Observational Study

ABSTRACT

Citation

Copyright