Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 20, 2025
Date Accepted: Aug 20, 2025

The final, peer-reviewed published version of this preprint can be found here:

Data Contamination in AI Evaluation

Acar A

Data Contamination in AI Evaluation

JMIR Med Inform 2025;13:e80987

DOI: 10.2196/80987

PMID: 41021280

PMCID: 12519028

Comment on “Clinical Performance and Communication Skills of ChatGPT Versus Physicians in Emergency Medicine: Simulated Patient Study”

  • Alaeddin Acar

ABSTRACT

To the Editor: This letter is regarding the recent publication of the article titled "Clinical Performance and Communication Skills of ChatGPT Versus Physicians in Emergency Medicine: Simulated Patient Study"[1]. The study makes a significant contribution to the growing field of AI evaluation in medicine, and I congratulate the authors on their valuable work. However, I would like to highlight a potential methodological limitation in the written exam portion of the study. The authors state that their exam questions were taken from a 2018 textbook, "100 Cases in Emergency Medicine and Critical Care"[2]. The AI model they tested, ChatGPT, was trained on huge amounts of public text from the internet, which likely included this textbook. This means ChatGPT may have seen the exact questions and answers before during its training. This problem is known as "data contamination." If the AI has already seen the test questions, its high scores might show good memory, not good medical reasoning. This makes the comparison to human doctors, who were seeing the questions for the first time, unfair. The study found that ChatGPT performed much better than doctors on this written test, but this result could be due to this methodological limitation. Other researchers in the field are aware of this problem and take steps to avoid it. For example, a study by Busch et al. [3] on radiology used private, members-only cases that were not likely in the AI's training data to minimize this risk. Another study by Noda et al. [4] on a Japanese medical exam used questions from an exam that took place after the AI's training data cut-off date. These studies show the importance of using new and unseen questions when testing AI. Because the study by Park et al. did not use this approach, I believe the results of their written exam should be viewed with caution. Future studies must use methods like those in the Busch and Noda papers to ensure a fair and valid test of AI's abilities.


 Citation

Please cite as:

Acar A

Data Contamination in AI Evaluation

JMIR Med Inform 2025;13:e80987

DOI: 10.2196/80987

PMID: 41021280

PMCID: 12519028

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.