JMIR Preprints #86425: An NLP Framework for Structuring and Visualizing Clinical Trial Eligibility Criteria at Scale

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

An NLP Framework for Structuring and Visualizing Clinical Trial Eligibility Criteria at Scale

Guannan Gong;
Justin Xie;
Jeet Parikh;
Jessica Liu;
Sameer Pandya

ABSTRACT

Background:

Eligibility criteria are essential to clinical trial design, guiding recruitment and ensuring patient safety and scientific rigor. However, criteria are often lengthy, heterogeneous, and inconsistently formatted, which hinders large-scale interpretation and slows patient-trial matching. Manual review is time-consuming and error-prone. Advances in natural language processing (NLP) and large language models (LLMs) offer opportunities to standardize and analyze eligibility text at scale.

Objective:

To develop and evaluate a scalable system that uses LLM-enabled NLP and unsupervised learning to identify, normalize, categorize, and visualize clinical trial eligibility criteria, with the goal of improving patient-trial matching and revealing domain-level trends.

Methods:

We designed a three-part pipeline: (1) representation of eligibility text using embeddings followed by clustering to group semantically similar criteria; (2) dual-layer, zero-shot LLM summarization for concept normalization, refinement, and de-duplication of cluster exemplars; and (3) an interactive, web-based visualization interface to explore criteria distributions and trends by disease domain and over time. The pipeline was applied to 53,872 oncology trials (breast, lung, and gastrointestinal) indexed on ClinicalTrials.gov. Outputs include cluster labels, normalized criterion summaries, and per-domain frequency profiles. Feasibility was assessed by successful end-to-end processing and inspection of face validity for cluster coherence and domain-specific patterns.

Results:

The system successfully processed all 53,872 trials and generated stable clusters of inclusion/exclusion concepts. The LLM summarization layers produced concise, non-redundant labels that improved interpretability of clustered criteria. The visualization interface enabled rapid exploration of cross-trial patterns and temporal trends within breast, lung, and gastrointestinal oncology, facilitating identification of common inclusion requirements and potential barriers to enrollment. A public, open-source demonstration instance allows interactive exploration of these clusters and summaries.

Conclusions:

A combined embeddings–clustering–LLM pipeline can standardize heterogeneous eligibility text and surface domain-level patterns at scale. This framework provides a foundation for accelerating patient-trial matching and informing future trial design. While the current implementation was evaluated on ClinicalTrials.gov oncology trials, the approach is readily generalizable to additional diseases and alternative modeling configurations. Clinical Trial: Not applicable (methods and secondary analysis of publicly available trial records).

Citation

Please cite as:

Gong G, Xie J, Parikh J, Liu J, Pandya S

A Natural Language Processing Framework for Structuring and Visualizing Clinical Trial Eligibility Criteria at Scale: Protocol for a Quantitative Study

JMIR Res Protoc 2026;15:e86425

DOI: 10.2196/86425

PMID: 42133919

PMCID: 13175524

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Research Protocols

Date Submitted: Oct 23, 2025

Date Accepted: Mar 12, 2026

An NLP Framework for Structuring and Visualizing Clinical Trial Eligibility Criteria at Scale

ABSTRACT

Citation

Copyright