JMIR Preprints #72005: Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

Soheil Hashtarkhani;
Rezaur Rashid;
Christopher L Brett;
Lokesh Chinthala;
Fekede Asefa Kumsa;
Janet A Zink;
Robert L Davis;
David L Schwartz;
Arash Shaban-Nejad

ABSTRACT

Background:

Electronic Health Records (EHRs) contain inconsistently structured or free-text data, requiring efficient preprocessing to enable predictive healthcare models. While artificial intelligence-driven natural language processing tools show promise for automating diagnosis classification, their comparative performance and clinical reliability require systematic evaluation.

Objective:

To evaluate the performance of four Large Language Models (GPT-3.5, GPT-4o, Llama 3.2, and Gemini 1.5) and BioBERT in classifying cancer diagnoses from structured and unstructured electronic health records data.

Methods:

We analyzed 762 unique diagnoses (326 ICD code descriptions, 436 free text entries) from 3,456 cancer patients' records. Models were tested on their ability to categorize diagnoses into 14 predefined categories. Two oncology experts validated classifications.

Results:

BioBERT achieved the highest accuracy (90.7%) and weighted accuracy (94.6%) for ICD codes, but its performance dropped to 81.6% accuracy for free text. GPT-4o matched BioBERT’s ICD code accuracy and slightly outperformed it in free text (81.8% accuracy), while GPT-3.5, Gemini, and Llama showed lower overall performance. Common misclassification patterns included difficulty distinguishing metastatic cancers and interpreting ambiguous clinical terminology.

Conclusions:

While current accuracy levels are sufficient for administrative tasks, success in clinical applications depends on standardized documentation combined with appropriate human oversight for critical decisions.

Citation

Please cite as:

Hashtarkhani S, Rashid R, Brett CL, Chinthala L, Kumsa FA, Zink JA, Davis RL, Schwartz DL, Shaban-Nejad A

Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

JMIR Cancer 2025;11:e72005

DOI: 10.2196/72005

PMID: 41037674

PMCID: 12490771

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Cancer

Date Submitted: Jan 31, 2025

Open Peer Review Period: Jan 31, 2025 - Mar 28, 2025

Date Accepted: Sep 16, 2025

(closed for review but you can still tweet)

Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

ABSTRACT

Citation

Copyright