Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Bioinformatics and Biotechnology

Date Submitted: Dec 30, 2024
Date Accepted: Apr 27, 2025

The final, peer-reviewed published version of this preprint can be found here:

Extracting Knowledge From Scientific Texts on Patient-Derived Cancer Models Using Large Language Models: Algorithm Development and Validation Study

Yao J, Perova Z, Mandloi T, Lewis E, Parkinson H, Savova G

Extracting Knowledge From Scientific Texts on Patient-Derived Cancer Models Using Large Language Models: Algorithm Development and Validation Study

JMIR Bioinform Biotech 2025;6:e70706

DOI: 10.2196/70706

PMID: 41342130

PMCID: 12232492

Extracting Knowledge from Scientific Texts on Patient-Derived Cancer Models Using Large Language Models: Algorithm Development and Validation

  • Jiarui Yao; 
  • Zinaida Perova; 
  • Tushar Mandloi; 
  • Elizabeth Lewis; 
  • Helen Parkinson; 
  • Guergana Savova

ABSTRACT

Background:

Patient-derived cancer models (PDCMs) have emerged as indispensable tools in both cancer research and preclinical studies. The number of publications on PDCMs increased significantly in the last decade. Developments in Artificial Intelligence (AI), particularly Large Language Models (LLMs), hold promise for extracting knowledge from scientific texts at scale.

Objective:

The goal of this work is to research LLM-based systems to extract PDCM-related entities from scientific texts automatically.

Methods:

We explore direct prompting and soft prompting using LLMs. For direct prompting, we manually create prompts to guide the LLMs to output PDCM-related entities from texts. The prompt consists of an instruction, definitions of entity types, gold examples and a query. We automatically train soft prompts – a novel line of research in this domain -- as continuous vectors using machine learning approaches. We experiment with state-of-the-art LLMs – proprietary GPT4-o and a series of open LLaMA3 family models.

Results:

We annotated 100 abstracts of PDCM-relevant papers, focusing on papers about PDCMs for which metadata and data were deposited to the CancerModels.Org platform, resulting in 3,313 entity mentions for 15 entity types. We used 60 abstracts (2,089 entities) for training, 20 abstracts (542 entities) to refine the prompts, and 20 abstracts (682 entities) for the final evaluation. We evaluated the output for exact and overlapping span matching in two settings: (1) direct prompting where the prompts are manually created, and (2) soft prompting where the prompts are automatically learned continuous vectors. Results are reported as precision/positive predictive value, recall/sensitivity and F1 (harmonic mean of precision and recall). GPT4-o with direct prompting achieved 50.48 F1 and 71.36 F1 for exact and overlapping match evaluation settings, respectively. In both evaluation settings, we saw a performance improvement by applying soft prompting on LLaMA3 models. The F1 score of LLaMA3.2 3B with soft prompting increased from 7.06 to 46.68 in the exact match evaluation setting, and from 12.0 to 71.80 in the overlapping match evaluation setting, slightly higher than direct prompting GPT4-o.

Conclusions:

In this work, we applied recent advancements in LLMs to automatically extract PDCM-relevant entities from scientific texts. In our experiments, GPT4-o with direct prompts maintained competitive results. Soft prompting helped improve the performance of smaller open LLMs by a large margin. Our work shows that it is possible to achieve the performance of proprietary LLMs by training soft prompts with smaller open models. At a higher level, our study contributes to the growing body of research into understanding what tasks benefit from using LLMs as LLMs are likely not the perfect technology that could solve every single task.


 Citation

Please cite as:

Yao J, Perova Z, Mandloi T, Lewis E, Parkinson H, Savova G

Extracting Knowledge From Scientific Texts on Patient-Derived Cancer Models Using Large Language Models: Algorithm Development and Validation Study

JMIR Bioinform Biotech 2025;6:e70706

DOI: 10.2196/70706

PMID: 41342130

PMCID: 12232492

Per the author's request the PDF is not available.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.