Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 8, 2024
Date Accepted: Jul 18, 2024

The final, peer-reviewed published version of this preprint can be found here:

Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis

Liu X, Duan C, Kim Mk, Jee E, Maharjan B, Du D, Huang Y, Zhang L, Jiang X

Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis

JMIR Med Inform 2024;12:e59273

DOI: 10.2196/59273

PMID: 39106482

PMCID: 11336503

Comparative Performance of Claude-3 Opus and ChatGPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis

  • Xu Liu; 
  • Chaoli Duan; 
  • Min-kyu Kim; 
  • Eunjin Jee; 
  • Beenu Maharjan; 
  • Dan Du; 
  • Yuwei Huang; 
  • Lu Zhang; 
  • Xian Jiang

ABSTRACT

Background:

Recent advancements in artificial intelligence (AI) and large language models (LLMs) have shown promising potential in various medical fields, including dermatology. LLMs, such as ChatGPT, have demonstrated their ability to generate human-like responses to text-based prompts and assist in clinical decision-making. With the introduction of image analysis capabilities in LLMs, such as ChatGPT Vision , the application of these models in dermatological diagnostics has garnered significant interest.However, the emergence of other LLMs, such as Claude 3 Opus, warrants investigation. Claude 3 Opus is an advanced conversational AI model that has shown promising performance in various natural language processing tasks. Its ability to engage in context-aware dialogues and provide coherent responses makes it a potential candidate for assisting in clinical decision-making, including dermatological diagnostics.

Objective:

We compared the diagnostic performance of Claude 3 Opus and ChatGPT-4 to provide insights into their strengths and weaknesses and guide the selection and optimization of AI-assisted diagnostic tools in dermatology.

Methods:

We randomly selected 100 histopathology-confirmed dermoscopic images (50 malignant, 50 benign) from the International Skin Imaging Collaboration (ISIC) Archive database. Each model was prompted to provide the top 3 differential diagnoses for each image, ranked by likelihood. The models' responses were recorded for further analysis. We assessed primary diagnosis accuracy, top 3 differential diagnoses accuracy, and malignancy discrimination ability.

Results:

McNemar's test determined statistical significance (α=0.05). For primary diagnosis accuracy, Claude 3 Opus achieved 54.90% sensitivity, 57.14% specificity, and 56.00% accuracy, while GPT4-Vision demonstrated 56.86% sensitivity, 38.78% specificity, and 48.00% accuracy (p=0.170). For top 3 differential diagnoses accuracy, Claude 3 Opus and ChatGPT-4 included the correct diagnosis in 76.00% and 78.00% of cases, respectively (p=0.564). For malignancy discrimination, Claude 3 Opus outperformed ChatGPT-4 with 47.06% sensitivity, 81.63% specificity, and 64.00% accuracy compared to 45.10%, 42.86%, and 44.00%, respectively (p=0.001). Further quantifying the difference in malignancy discrimination ability, we calculated odds ratios (ORs) and 95% confidence intervals (CIs). Claude 3 Opus had an OR of 3.951 (95% CI: 1.685-9.263), indicating a stronger association between its predictions and actual malignancy compared to ChatGPT-4's OR of 0.616 (95% CI: 0.297-1.278) .

Conclusions:

Our study highlights the potential of LLMs in assisting dermatologists but also reveals their limitations. Both models made errors in diagnosing melanoma and benign lesions. Claude 3 Opus misdiagnosed melanoma as benign lesions in several cases, while ChatGPT-4 made similar errors. Conversely, both models misclassified benign lesions as melanoma in some examples. These findings underscore the limitations of current AI models and emphasize that they may not replace clinical diagnosis and treatment. In the future, more research should focus on developing robust, transparent, and clinically validated models through collaborative efforts between AI researchers, dermatologists, and other healthcare professionals. While AI can provide valuable insights, it is crucial to recognize that these models are not yet capable of replacing the expertise and judgment of trained clinicians in diagnosing and managing skin lesions. Clinical Trial: None


 Citation

Please cite as:

Liu X, Duan C, Kim Mk, Jee E, Maharjan B, Du D, Huang Y, Zhang L, Jiang X

Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis

JMIR Med Inform 2024;12:e59273

DOI: 10.2196/59273

PMID: 39106482

PMCID: 11336503

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.