Accepted for/Published in: JMIR Research Protocols
Date Submitted: Oct 23, 2025
Date Accepted: Mar 12, 2026
An NLP Framework for Structuring and Visualizing Clinical Trial Eligibility Criteria at Scale
ABSTRACT
Background:
Eligibility criteria are essential to clinical trial design, guiding recruitment and ensuring patient safety and scientific rigor. However, criteria are often lengthy, heterogeneous, and inconsistently formatted, which hinders large-scale interpretation and slows patient-trial matching. Manual review is time-consuming and error-prone. Advances in natural language processing (NLP) and large language models (LLMs) offer opportunities to standardize and analyze eligibility text at scale.
Objective:
To develop and evaluate a scalable system that uses LLM-enabled NLP and unsupervised learning to identify, normalize, categorize, and visualize clinical trial eligibility criteria, with the goal of improving patient-trial matching and revealing domain-level trends.
Methods:
We designed a three-part pipeline: (1) representation of eligibility text using embeddings followed by clustering to group semantically similar criteria; (2) dual-layer, zero-shot LLM summarization for concept normalization, refinement, and de-duplication of cluster exemplars; and (3) an interactive, web-based visualization interface to explore criteria distributions and trends by disease domain and over time. The pipeline was applied to 53,872 oncology trials (breast, lung, and gastrointestinal) indexed on ClinicalTrials.gov. Outputs include cluster labels, normalized criterion summaries, and per-domain frequency profiles. Feasibility was assessed by successful end-to-end processing and inspection of face validity for cluster coherence and domain-specific patterns.
Results:
The system successfully processed all 53,872 trials and generated stable clusters of inclusion/exclusion concepts. The LLM summarization layers produced concise, non-redundant labels that improved interpretability of clustered criteria. The visualization interface enabled rapid exploration of cross-trial patterns and temporal trends within breast, lung, and gastrointestinal oncology, facilitating identification of common inclusion requirements and potential barriers to enrollment. A public, open-source demonstration instance allows interactive exploration of these clusters and summaries.
Conclusions:
A combined embeddings–clustering–LLM pipeline can standardize heterogeneous eligibility text and surface domain-level patterns at scale. This framework provides a foundation for accelerating patient-trial matching and informing future trial design. While the current implementation was evaluated on ClinicalTrials.gov oncology trials, the approach is readily generalizable to additional diseases and alternative modeling configurations. Clinical Trial: Not applicable (methods and secondary analysis of publicly available trial records).
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.