JMIR Preprints #87802: Performance Evaluation of ChatGPT 5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: A Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance Evaluation of ChatGPT 5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: A Comparative Study

Xianfei Ye;
Xinglun Qi;
Lina Fan;
Qian Yu;
Suming Zhou;
Chunyun Ren;
Dagan Yang

ABSTRACT

Background:

The interpretation of complete blood count (CBC) reports is a critical yet subjective task in the diagnosis of hematologic diseases. While large language models (LLMs) show promise for clinical decision support, their real-world performance and safety profiles remain insufficiently evaluated.

Objective:

To evaluate and compare three advanced LLMs—ChatGPT 5, Grok 4, and DeepSeek R1—in interpreting real-world CBC reports for hematologic diseases across multiple quality and task-specific dimensions.

Methods:

We retrospectively collected 100 CBC reports from patients with confirmed hematologic diseases at a tertiary hospital. Three LLMs interpreted these reports across five sequential tasks: analyzer alert processing, abnormal item identification, correlation analysis of abnormal items, preliminary diagnosis, and clinical management. Outputs were evaluated in a blinded manner by two junior and two senior laboratory professionals across six quality dimensions using 5-point Likert scales. Inter-rater reliability was assessed via intraclass correlation coefficients (ICC). Model performance was compared using Friedman tests, and errors were classified as either hallucinations or reasoning errors.

Results:

Across 100 report interpretations, DeepSeek R1 achieved superior performance (median score 4.0 [IQR 4.0–5.0] for junior evaluators; 5.0 [IQR 4.0–5.0] for senior evaluators) with excellent inter-rater reliability (junior ICC 0.817 [95% CI 0.804–0.830]; senior ICC 0.766 [95% CI 0.749–0.782]). Senior evaluators of three LLMs consistently assigned higher ratings than junior evaluators (p < 0.001). DeepSeek R1 outperformed the other models in five of six quality dimensions and across all clinical tasks (all p < 0.001). ChatGPT 5 demonstrated the highest concordance with gold-standard diagnoses (93%), whereas Grok 4 aligned most closely with initial clinical suspicions (96%) but demonstrated the lowest concordance with gold-standard diagnoses (89%). Notably, ChatGPT 5 exhibited 12 hallucination errors during analyzer alert processing; Grok 4 produced the highest proportion of unsafe outputs in clinical management (3.8%); and all models made unsupported inferences to varying degrees during the correlation analysis of abnormal items.

Conclusions:

DeepSeek R1 achieved the highest ratings for CBC interpretation, particularly among senior evaluators, reflecting near-expert performance. ChatGPT 5 demonstrated the highest concordance with gold-standard diagnoses, highlighting strong reasoning capabilities. However, all models exhibited distinct error patterns and performance heterogeneity, underscoring the necessity for human oversight and providing an evidence-based framework for safe LLM deployment in laboratory medicine.

Citation

Please cite as:

Ye X, Qi X, Fan L, Yu Q, Zhou S, Ren C, Yang D

Performance Evaluation of GPT-5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: Retrospective Comparative Study

J Med Internet Res 2026;28:e87802

DOI: 10.2196/87802

PMID: 42247415

PMCID: 13240632

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Nov 14, 2025

Date Accepted: May 6, 2026

Performance Evaluation of ChatGPT 5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: A Comparative Study

ABSTRACT

Citation

Copyright