Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Cancer

Date Submitted: Apr 27, 2026

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Metastasis Extraction from NSCLC Clinical Notes: A Retrospective Comparative Evaluation of Large Language Model-Based Classification

  • Sweta Balaji; 
  • Kiersten Campbell; 
  • Rou-Zhen Chen; 
  • Daniel Smith; 
  • Matthew Reyna; 
  • Abeed Sarker; 
  • Ravi Parikh; 
  • Selen Bozkurt

ABSTRACT

Background:

Identification of metastasis status in non-small cell lung cancer (NSCLC) is a critical part of understanding disease prognosis, treatment courses, trial eligibility, and population-level cancer surveillance. However, metastasis records are inconsistently recorded in structured cancer registry fields, since manual abstraction of clinical notes is often a resource-intensive and error-prone process. This challenge highlights an opportunity for leveraging large language models to conduct high-scale metastasis extraction from real-world clinical documentation.

Objective:

We conducted a retrospective, multi-cohort comparative evaluation of three distinct large language models (LLMs) for two independent classification tasks: overall metastasis presence at any site and brain/CNS metastasis presence. We evaluated model performance on two independent NSCLC cohorts: (1) a registry-linked cohort used for model development and internal validation and (2) an independent cohort with manual note-level annotations for further validation. We further explored whether our methods could analyze clinical documentation and recover missing or outdated metastasis information in structured registry fields.

Methods:

Patient cohorts were derived from the Winship Cancer Institute. Cohort 1 (n=579 patients; 24,887 notes across 69 note types; 2023–2025) used registry-linked metastasis fields as the reference standard. Cohort 2 (n=22 patients; 644 radiology notes; 2010-2021) was drawn from two completed randomized trials and used dual-annotator manual labels (Cohen's κ: 0.93 overall metastasis, 0.88 CNS metastasis) as the reference standard. We fine-tuned the GatorTron-base encoder model for each independent binary classification task, respectively. We evaluated MedGemma-27B-text and Llama 3.1-70B using zero-shot prompting. A separate cohort of 675 patients with missing or unknown registry labels was used for an exploratory missingness-recovery analysis, validated against manual annotations of a random subsample.

Results:

More than half (54%) of initially identified Cohort 1 patients had missing or unknown registry metastasis labels. For overall metastasis, fine-tuned MedGemma demonstrated the best performance in overall metastasis classification (Cohort 1: F1=0.80, Cohort 2 patient-level: F1=1.0, Cohort 2 note-level: F1=0.93). For brain/CNS metastasis, Llama3 performed best in both cohorts (Cohort 1: F1=0.79, Cohort 2 patient-level: F1=0.93, Cohort 2 note-level: F1=0.86). The fine-tuned GatorTron model showed strong performance for classification of overall metastasis in Cohort 1 (F1=0.72). Error analysis indicated that most model errors reflected incomplete registry labels, ambiguous clinical language, or missing documentation rather than true model errors. In the exploratory recovery analysis, model predictions agreed with manual annotations at accuracy=0.90 and F1=0.89.

Conclusions:

All models demonstrated relatively high performance. The zero-shot generative models were more robust to nuanced documentation and context-dependent brain/CNS metastasis extraction. The fine-tuned encoder model demonstrated strong classification performance but may have been limited by potential inaccuracies in the registry reference standards during model training. This study further demonstrated the potential of LLMs in recovering clinically plausible structured labels from narrative text, complementing cancer registries for metastasis ascertainment.


 Citation

Please cite as:

Balaji S, Campbell K, Chen RZ, Smith D, Reyna M, Sarker A, Parikh R, Bozkurt S

Metastasis Extraction from NSCLC Clinical Notes: A Retrospective Comparative Evaluation of Large Language Model-Based Classification

JMIR Preprints. 27/04/2026:99573

DOI: 10.2196/preprints.99573

URL: https://preprints.jmir.org/preprint/99573

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.