Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Mar 21, 2025
Date Accepted: Jun 10, 2025

The final, peer-reviewed published version of this preprint can be found here:

Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach

Naliyatthaliyazchayil P, Muthyala R, Gichoya JW, Purkayastha S

Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach

J Med Internet Res 2025;27:e74142

DOI: 10.2196/74142

PMID: 40737604

PMCID: 12310144

Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: A Zero-Shot Prompting Approach

  • Parvati Naliyatthaliyazchayil; 
  • Raajitha Muthyala; 
  • Judy Wawira Gichoya; 
  • Saptarshi Purkayastha

ABSTRACT

Background:

Large Language Models (LLMs) such as ChatGPT-4, LLaMA-3.1, Gemini-1.5, DeepSeek-R1, and OpenAI-O3 have shown promising potential in healthcare, particularly for complex clinical reasoning and decision support. However, their reliability across aggregated critical tasks like diagnosis, medical coding, and risk prediction has received mixed reviews, especially in real-world settings without task-specific training.

Objective:

This study aims to evaluate and compare the zero-shot performance of reasoning and non-reasoning LLMs in three essential clinical tasks: (1) primary diagnosis generation, (2) ICD-9 medical code prediction, and (3) hospital readmission risk stratification. The goal is to assess whether these models can serve as general-purpose clinical decision support tools and to identify performance gaps in current capabilities.

Methods:

Using the MIMIC-IV dataset, we selected a random cohort of 300 hospital discharge summaries. Prompts were engineered to include structured clinical content from five note sections: chief complaints, past medical history, surgical history, labs, and imaging. Prompts were standardized and zero-shot, with no model fine-tuning or repetition across runs. All model interactions were conducted through publicly available web user interfaces, without using APIs or backend access, to simulate real-world accessibility for non-technical users. We incorporated rationale elicitation into prompts to evaluate model transparency, especially in reasoning models. Ground-truth labels were derived from the primary diagnosis documented in clinical notes, structured ICD-9 codes from diagnosis, and hospital-recorded readmission frequencies for risk stratification. Performance was measured using F1 scores and correctness percentages, and comparative performance was analyzed statistically.

Results:

Among non-reasoning models, LLaMA-3.1 achieved the highest primary diagnosis accuracy (85%), followed by ChatGPT-4 (84.7%) and Gemini-1.5 (79%). For ICD-9 prediction, correctness dropped significantly across all models: LLaMA-3.1 (42.6%), ChatGPT-4 (40.6%), Gemini-1.5 (14.6%). Hospital readmission risk prediction showed low performance in non-reasoning models: LLaMA-3.1 (41.3%), Gemini-1.5 (40.7%), ChatGPT-4 (33%). Among reasoning models, OpenAI-O3 outperformed in diagnosis (90%) and ICD-9 coding (45.3%), while DeepSeek-R1 performed slightly better in the readmission risk prediction (72.66% vs. O3’s 70.66%). Despite improved explainability, reasoning models generated verbose responses, which may impact real-time usability. None of the models met clinical standards across all tasks, and performance in medical coding remained the weakest area across all models.

Conclusions:

Current LLMs exhibit moderate success in zero-shot diagnosis and risk prediction but underperform in ICD-9 code generation, reinforcing findings from prior studies. Reasoning models offer marginally better performance and increased interpretability but remain limited in reliability. Overall, statistical analysis between the models revealed that O3 outperformed the other models when considering the aggregated performance across tasks. These results highlight the need for task-specific fine-tuning and adding more human-in-loop models to train them. Future work will explore fine-tuning, stability through repeated trials, and evaluation on a different subset of de-identified real-world data with a larger sample size.


 Citation

Please cite as:

Naliyatthaliyazchayil P, Muthyala R, Gichoya JW, Purkayastha S

Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach

J Med Internet Res 2025;27:e74142

DOI: 10.2196/74142

PMID: 40737604

PMCID: 12310144

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.