Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Cancer

Date Submitted: Jan 31, 2025
Open Peer Review Period: Jan 31, 2025 - Mar 28, 2025
Date Accepted: Sep 16, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

Hashtarkhani S, Rashid R, Brett CL, Chinthala L, Kumsa FA, Zink JA, Davis RL, Schwartz DL, Shaban-Nejad A

Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

JMIR Cancer 2025;11:e72005

DOI: 10.2196/72005

PMID: 41037674

PMCID: 12490771

Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

  • Soheil Hashtarkhani; 
  • Rezaur Rashid; 
  • Christopher L Brett; 
  • Lokesh Chinthala; 
  • Fekede Asefa Kumsa; 
  • Janet A Zink; 
  • Robert L Davis; 
  • David L Schwartz; 
  • Arash Shaban-Nejad

ABSTRACT

Background:

Electronic Health Records (EHRs) contain inconsistently structured or free-text data, requiring efficient preprocessing to enable predictive healthcare models. While artificial intelligence-driven natural language processing tools show promise for automating diagnosis classification, their comparative performance and clinical reliability require systematic evaluation.

Objective:

To evaluate the performance of four Large Language Models (GPT-3.5, GPT-4o, Llama 3.2, and Gemini 1.5) and BioBERT in classifying cancer diagnoses from structured and unstructured electronic health records data.

Methods:

We analyzed 762 unique diagnoses (326 ICD code descriptions, 436 free text entries) from 3,456 cancer patients' records. Models were tested on their ability to categorize diagnoses into 14 predefined categories. Two oncology experts validated classifications.

Results:

BioBERT achieved the highest accuracy (90.7%) and weighted accuracy (94.6%) for ICD codes, but its performance dropped to 81.6% accuracy for free text. GPT-4o matched BioBERT’s ICD code accuracy and slightly outperformed it in free text (81.8% accuracy), while GPT-3.5, Gemini, and Llama showed lower overall performance. Common misclassification patterns included difficulty distinguishing metastatic cancers and interpreting ambiguous clinical terminology.

Conclusions:

While current accuracy levels are sufficient for administrative tasks, success in clinical applications depends on standardized documentation combined with appropriate human oversight for critical decisions.


 Citation

Please cite as:

Hashtarkhani S, Rashid R, Brett CL, Chinthala L, Kumsa FA, Zink JA, Davis RL, Schwartz DL, Shaban-Nejad A

Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

JMIR Cancer 2025;11:e72005

DOI: 10.2196/72005

PMID: 41037674

PMCID: 12490771

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.