JMIR Preprints #77837: Large Language Model Enhanced Drug Reposition Knowledge Extraction via Long Chain of Thought：Development and Evaluation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large Language Model Enhanced Drug Reposition Knowledge Extraction via Long Chain of Thought：Development and Evaluation Study

Hongyu Kang;
Jiao Li;
Li Hou;
Xiaowei Xu;
Si Zheng;
Qin Li

ABSTRACT

Background:

Drug repositioning is a pivotal strategy in pharmaceutical research, offering accelerated and cost-effective therapeutic discovery. However, biomedical information relevant to drug repositioning is often complex, dispersed, and underutilized due to limitations in traditional extraction methods, such as reliance on annotated data and poor generalizability. Large Language Models (LLMs) show promise but face challenges like hallucinations and interpretability issues.

Objective:

This study proposes Long Chain-of-Thought for Drug Repositioning Knowledge Extraction (LCoDR-KE), a lightweight and domain-specific framework to enhance LLMs’ accuracy and adaptability in extracting structured biomedical knowledge for drug repositioning.

Methods:

A domain-specific schema defined 11 entities (e.g., drug, disease) and 18 relationships (e.g., treats, is biomarker of). Following the established schema architecture, we constructed automatic annotation based on 10,000 PubMed abstracts via chain-of-thought prompt engineering. 1,000 expert-validated abstracts were curated into DrugReC, a high-quality specialized corpus, while the remaining entries were allocated for model training purposes. Then, the proposed LCoDR-KE framework combined supervised fine-tuning of the Qwen2.5-7B-Instruct model with reinforcement learning and dual-reward mechanisms. Performance was evaluated against state-of-the-art models (e.g., CRF, BERT, BioBERT, Qwen2.5, DeepSeek-R1 and model variants) using precision, recall, and F1-score. Additionally, the convergence of the training method was assessed by analyzing performance progression across iteration steps.

Results:

LCoDR-KE achieved an entity F1 of 81.46% (e.g., drug: 95.83%, disease: 90.52%) and triplet F1 of 69.04%, outperforming traditional models and rivaling larger LLMs (DeepSeek-R1: entity F1=84.64%, triplet F1=69.02%). Ablation studies confirmed the contributions of SFT (8.61% and 20.70% F1 drop if removed) and reinforcement learning (6.09% and 14.09% F1 drop if removed). The training process demonstrated stable convergence, validated through iterative performance monitoring. Error analysis revealed four main types of mistakes and challenges for further improvement.

Conclusions:

LCoDR-KE enhances LLMs’ domain-specific adaptability for drug repositioning by offering an open-source drug repositioning corpus (DrugReC) and a LCoT-farmwork based on lightweight LLM model. This framework supports drug discovery and knowledge reasoning while providing scalable, interpretable solutions applicable to broader biomedical knowledge extraction tasks. The proposed corpus dataset and source code are available at: https://github.com/kang-hongyu/ LCoDR-KE.

Citation

Please cite as:

Kang H, Li J, Hou L, Xu X, Zheng S, Li Q

Large Language Model–Enhanced Drug Repositioning Knowledge Extraction via Long Chain-of-Thought: Development and Evaluation Study

JMIR Med Inform 2025;13:e77837

DOI: 10.2196/77837

PMID: 41056561

PMCID: 12503436

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 21, 2025

Date Accepted: Sep 3, 2025

(closed for review but you can still tweet)

Large Language Model Enhanced Drug Reposition Knowledge Extraction via Long Chain of Thought：Development and Evaluation Study

ABSTRACT

Citation