Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Apr 19, 2022
Open Peer Review Period: Apr 19, 2022 - Jun 14, 2022
Date Accepted: Nov 16, 2022
(closed for review but you can still tweet)
Predicting Publication of Clinical Trials Using Structured and Unstructured Data
ABSTRACT
Background:
Publication of registered clinical trials is a critical step in the timely dissemination of trial findings, which can improve healthcare and advance medical research. However, a significant proportion of completed clinical trials are never published, motivating the need to analyse the factors behind success or failure to publish. This could inform study design, help regulatory decision making, and improve resource allocation. It could also enhance our understanding of bias in publication of trials, and publication trends based on the research direction or strength of the findings. While publication of clinical trials has been addressed in several descriptive studies at an aggregate level, there is a lack of research on predictive analysis of a trial’s publishability given an individual (planned) clinical trial description.
Objective:
To carry out a study that combines structured and unstructured (textual) features relevant to the clinical trial publication status in a single predictive approach. Established natural language processing (NLP) techniques as well as recent advances in using pretrained language models as textual encoders enable us to incorporate information from the textual descriptions of clinical trials into a machine learning approach. We are particularly interested in whether and which textual features can improve the classification accuracy for publication outcome.
Methods:
In this study, we use pre-recorded metadata from ClinicalTrials.gov (a registry of clinical trials) and MEDLINE (a bibliographic database of academic journal articles) to build a dataset of clinical trials (N=76,950) that contains the description of a registered trial and its publication outcome (36% published, 64% unpublished). This is the largest dataset of its kind, which we release as part of this work. The publication outcome in the dataset was identified from MEDLINE based on clinical trial identifiers. We carried out a detailed descriptive analysis and predicted the publication outcome using two approaches: a neural network that represents the text using a large domain-specific language model, and a random forest classifier using a weighted “bag-of-words” representation of text.
Results:
First, our analysis of the newly-created dataset corroborates several findings from the existing literature about attributes associated with a higher publication rate (e.g. the phase of a clinical trial). Second, a crucial observation from our predictive modelling is that the addition of textual features (e.g. eligibility criteria) offers consistent improvements over approaches that only use structured data (F1 =.62–.64 vs. F1 =.61 without textual features). Both pretrained language models and more basic word-based representations provide high-utility text representations, with no significant empirical difference between the two.
Conclusions:
Different factors affect whether a registered clinical trial is published or not. Our approach to predictive modelling combines heterogeneous features, both structured and unstructured (textual). We show that methods from NLP can provide effective textual features to enable more accurate prediction of publication success, which has not been explored for this task in previous work.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.