Accepted for/Published in: JMIR Formative Research
Date Submitted: Dec 15, 2024
Open Peer Review Period: Dec 16, 2024 - Feb 10, 2025
Date Accepted: Aug 13, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Large Language Models (Chat GPT, Claude, and Gemini) have potential to significantly transform data analysis and medical education in Assisted Reproductive Technology: Comparison Study.
ABSTRACT
Background:
Recent studies have demonstrated that large language models (LLMs) exhibit exceptional performance in medical examinations. However, there is a lack of reports assessing their capabilities in specific domains or their application in practical data analysis using code interpreters. Furthermore, comparative analyses across different LLMs have not been extensively conducted.
Objective:
The purpose of this study was to evaluate whether advanced AI models can analyze data from template-based input and can demonstrate basic knowledge of reproductive medicine. The three AI models (Chat GPT, Claude, and Gemini) were evaluated for their data analytical capabilities through numerical calculations and graph rendering. Their knowledge of infertility treatment was assessed by solving ten examination questions from experts.
Methods:
First, we uploaded data to the AI models and furnished instruction templates using the chat interface. The study investigated whether the AI models could perform pregnancy rate analysis and graph rendering, based on blastocyst grades according to Gardner criteria. Second, we assessed model diagnostic capabilities based on specialized knowledge. This evaluation utilized ten questions derived from the Japanese Fertility Specialist Examination and the Embryologist Certification Exam, along with chromosome imaging. These materials were curated under the supervision of certified embryologists and fertility specialists. All procedures were repeated ten times per AI model.
Results:
GPT-4o and Gemini performed analyses within 30 seconds, requiring minor corrections from time to time thereafter. However, the process did not reach the stage of data analysis with Claude. GPT-4o, Claude, and Gemini achieved perfect scores on a set of nine knowledge-based questions derived from professional fertility specialist examinations. However, none of the AI models were able to accurately perform karyotype diagnostic tasks in reproductive medicine.
Conclusions:
This rapid processing demonstrates the potential for these AI models to significantly expedite data-intensive tasks in clinical settings. This performance underscores their potential utility as educational tools or decision support systems in reproductive medicine. However, none of the models were able to accurately interpret and diagnose using medical images.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.