Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Dec 15, 2024
Open Peer Review Period: Dec 16, 2024 - Feb 10, 2025
Date Accepted: Aug 13, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Application of Large Language Models in Data Analysis and Medical Education for Assisted Reproductive Technology: Comparative Study

Okuyama N, Ishi M, Fukuoka Y, Hattori H, Kasahara Y, Toshihiro T, Yoshinaga K, Hashimoto T, Kyono K

Application of Large Language Models in Data Analysis and Medical Education for Assisted Reproductive Technology: Comparative Study

JMIR Form Res 2025;9:e70107

DOI: 10.2196/70107

PMID: 41032884

PMCID: 12488165

Application of Large Language Models in Data Analysis and Medical Education for Assisted Reproductive Technology: Comparative Study

  • Noriyuki Okuyama; 
  • Mika Ishi; 
  • Yuriko Fukuoka; 
  • Hiromitsu Hattori; 
  • Yuta Kasahara; 
  • Tai Toshihiro; 
  • Koki Yoshinaga; 
  • Tomoko Hashimoto; 
  • Koichi Kyono

ABSTRACT

Background:

Recent studies have demonstrated that Large Language Models (LLMs) exhibit exceptional performance in medical examinations. However, there is a lack of reports assessing their capabilities in specific domains or their application in practical data analysis using code interpreters. Furthermore, comparative analyses across different LLMs have not been extensively conducted.

Objective:

The purpose of this study was to evaluate whether advanced AI models can analyze data from template-based input and demonstrate basic knowledge of reproductive medicine. The three AI models (ChatGPT, Claude, and Gemini) were evaluated for their data analytical capabilities through numerical calculations and graph rendering. Their knowledge of infertility treatment was assessed using ten examination questions developed by experts.

Methods:

First, we uploaded data to the AI models and furnished instruction templates using the chat interface. The study investigated whether the AI models could perform pregnancy rate analysis and graph rendering, based on blastocyst grades according to Gardner criteria. Second, we assessed model diagnostic capabilities based on specialized knowledge. This evaluation utilized ten questions derived from the Japanese Fertility Specialist Examination and the Embryologist Certification Exam, along with chromosome imaging. These materials were curated under the supervision of certified embryologists and fertility specialists. All procedures were repeated ten times per AI model.

Results:

GPT-4o achieved Grade A output (defined as achieving the objective with a single output attempt) in 9 out of 10 trials, outperforming GPT-4, which achieved Grade A in 7 out of 10. The average processing times for data analysis were 26.8 seconds for GPT-4o and 36.7 seconds for GPT-4, whereas Claude failed in all 10 attempts. Gemini achieved an average processing time of 23.0 seconds and received Grade A in 6 out of 10 trials, though occasional manual corrections were needed. Embryologists required an average of 358.3 seconds for the same tasks. In the knowledge-based assessment, GPT-4o, Claude, and Gemini achieved perfect scores (9/9) on multiple-choice questions, while GPT-4 showed a 60% success rate on one question. None of the AI models could reliably diagnose chromosomal abnormalities from karyotype images, with the highest image diagnostic accuracy being 70% for Claude and Gemini.

Conclusions:

This rapid processing demonstrates the potential for these AI models to significantly expedite data-intensive tasks in clinical settings. This performance underscores their potential utility as educational tools or decision support systems in reproductive medicine. However, none of the models were able to accurately interpret and diagnose using medical images.


 Citation

Please cite as:

Okuyama N, Ishi M, Fukuoka Y, Hattori H, Kasahara Y, Toshihiro T, Yoshinaga K, Hashimoto T, Kyono K

Application of Large Language Models in Data Analysis and Medical Education for Assisted Reproductive Technology: Comparative Study

JMIR Form Res 2025;9:e70107

DOI: 10.2196/70107

PMID: 41032884

PMCID: 12488165

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.