Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 7, 2024
Open Peer Review Period: Apr 20, 2024 - Jun 20, 2024
Date Accepted: Jul 5, 2024
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study

Akyon SH, Akyon FC, Camyar AS, Hızlı F, Sarı T, Hızlı Å

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study

JMIR Med Inform 2024;12:e59258

DOI: 10.2196/59258

PMID: 39230947

PMCID: 11411230

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Evaluating the Medical Article Understanding Capabilities of Generative Artificial Intelligence Tools

  • Seyma Handan Akyon; 
  • Fatih Cagatay Akyon; 
  • Ahmet Sefa Camyar; 
  • Fatih Hızlı; 
  • Talha Sarı; 
  • Åžamil Hızlı

ABSTRACT

Background:

Reading medical articles is a challenging and time-consuming task for doctors, especially when the articles are long and complex. There is a need for a tool that can help doctors to process and understand medical articles more efficiently, accurately and fast.Generative artificial intelligence (AI) tools can assist doctors in analyzing medical articles, but there is no research evaluating medical articles and understanding the capabilities of new generative AI tools.

Objective:

This study aims to critically assess and compare the comprehension capabilities of Large Language Models (LLMs) in accurately and efficiently understanding medical research articles using the STROBE checklist.

Methods:

The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative AI tools in medical articles. We designed a novel benchmark pipeline that can process PUBMED articles regardless of their length using various generative AI tools. Using this benchmark pipeline, we compared the answers of several generative AI tools (Chat-GPT 3.5-turbo, chat-GPT-4, Palm, Claude v1, Gemini pro) with the golden standard for 50 medical research articles from PUBMED. The experienced medical professor's answers to these questions are assigned as the golden standard. This study will evaluate the performance of various Large Language Models (LLMs) in accurately answering specific questions related to different sections of a scholarly article: title and abstract, methods, results, and discussion in fifteen questions from the STROBE Checklist

Results:

Among the answers given by LLMs to the questions, the LLM that gave the most correct answers (66.9%) was GPT 3.5. This was followed by GPT 4-1106 version (65.6%), Palm2- (62.1%), Claude v1 (58.3%), Gemini pro (49.2%) and GPT-4 0613 version (44.1%). LLMs showcased distinct performances for each question across different parts of a scholarly article - with certain models like Palm 2 and GPT 3.5 showing remarkable versatility and depth in understanding.

Conclusions:

This study is the first study to evaluate the performance of different LLMs in evaluating their ability to understanding medical articles using the Retrieval-Augmented Generation (RAG) method by giving documents.


 Citation

Please cite as:

Akyon SH, Akyon FC, Camyar AS, Hızlı F, Sarı T, Hızlı Å

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study

JMIR Med Inform 2024;12:e59258

DOI: 10.2196/59258

PMID: 39230947

PMCID: 11411230

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.