Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 21, 2025
Date Accepted: Sep 3, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Large Language Model–Enhanced Drug Repositioning Knowledge Extraction via Long Chain-of-Thought: Development and Evaluation Study

Kang H, Li J, Hou L, Xu X, Zheng S, Li Q

Large Language Model–Enhanced Drug Repositioning Knowledge Extraction via Long Chain-of-Thought: Development and Evaluation Study

JMIR Med Inform 2025;13:e77837

DOI: 10.2196/77837

PMID: 41056561

PMCID: 12503436

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Large Language Model Enhanced Drug Reposition Knowledge Extraction via Long Chain of Thought:Construction and Evaluation

  • Hongyu Kang; 
  • Jiao Li; 
  • Li Hou; 
  • Xiaowei Xu; 
  • Si Zheng; 
  • Qin Li

ABSTRACT

Background:

Drug repositioning is a pivotal strategy in pharmaceutical research, offering accelerated and cost-effective therapeutic discovery. However, biomedical information relevant to drug repositioning is often complex, dispersed, and underutilized due to limitations in traditional extraction methods, such as reliance on annotated data and poor generalizability. Large Language Models (LLMs) show promise but face challenges like hallucinations and interpretability issues.

Objective:

This study proposes Long Chain-of-Thought for Drug Repositioning Knowledge Extraction (LCoDR-KE), a lightweight and domain-specific framework to enhance LLMs’ accuracy and adaptability in extracting structured biomedical knowledge for drug repositioning.

Methods:

A domain-specific schema defined 11 entities (e.g., drug, disease) and 18 relationships (e.g., treats, is biomarker of). Following the established schema architecture, we constructed automatic annotation based on 10,000 PubMed abstracts via chain-of-thought prompt engineering. 1,000 expert-validated abstracts were curated into DrugReC, a high-quality specialized corpus, while the remaining entries were allocated for model training purposes. Then, the proposed LCoDR-KE framework combined supervised fine-tuning of the Qwen2.5-7B-Instruct model with reinforcement learning and dual-reward mechanisms. Performance was evaluated against state-of-the-art models (e.g., CRF, BERT, BioBERT, Qwen2.5, DeepSeek-R1 and model variants) using precision, recall, and F1-score. Additionally, the convergence of the training method was assessed by analyzing performance progression across iteration steps.

Results:

LCoDR-KE achieved an entity F1 of 81.46% (e.g., drug: 95.83%, disease: 90.52%) and triplet F1 of 69.04%, outperforming traditional models and rivaling larger LLMs (DeepSeek-R1: entity F1=84.64%, triplet F1=69.02%). Ablation studies confirmed the contributions of SFT (8.61% and 20.70% F1 drop if removed) and reinforcement learning (6.09% and 14.09% F1 drop if removed). The training process demonstrated stable convergence, validated through iterative performance monitoring. Error analysis revealed four main types of mistakes and challenges for further improvement.

Conclusions:

LCoDR-KE enhances LLMs’ domain-specific adaptability for drug repositioning by offering an open-source drug repositioning corpus (DrugReC) and a LCoT-farmwork based on lightweight LLM model. This framework supports drug discovery and knowledge reasoning while providing scalable, interpretable solutions applicable to broader biomedical knowledge extraction tasks. The proposed corpus dataset and source code are available at: https://github.com/kang-hongyu/ LCoDR-KE.


 Citation

Please cite as:

Kang H, Li J, Hou L, Xu X, Zheng S, Li Q

Large Language Model–Enhanced Drug Repositioning Knowledge Extraction via Long Chain-of-Thought: Development and Evaluation Study

JMIR Med Inform 2025;13:e77837

DOI: 10.2196/77837

PMID: 41056561

PMCID: 12503436

Per the author's request the PDF is not available.