Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Apr 7, 2024
Open Peer Review Period: Apr 20, 2024 - Jun 20, 2024
Date Accepted: Jul 5, 2024
(closed for review but you can still tweet)
Evaluating the Medical Article Understanding Capabilities of Generative Artificial Intelligence Tools
ABSTRACT
Background:
Reading medical articles is a challenging and time-consuming task for doctors, especially when the articles are long and complex. There is a need for a tool that can help doctors to process and understand medical articles more efficiently, accurately and fast.Generative artificial intelligence (AI) tools can assist doctors in analyzing medical articles, but there is no research evaluating medical articles and understanding the capabilities of new generative AI tools.
Objective:
This study aims to critically assess and compare the comprehension capabilities of Large Language Models (LLMs) in accurately and efficiently understanding medical research articles using the STROBE checklist.
Methods:
The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative AI tools in medical articles. We designed a novel benchmark pipeline that can process PUBMED articles regardless of their length using various generative AI tools. Using this benchmark pipeline, we compared the answers of several generative AI tools (Chat-GPT 3.5-turbo, chat-GPT-4, Palm, Claude v1, Gemini pro) with the golden standard for 50 medical research articles from PUBMED. The experienced medical professor's answers to these questions are assigned as the golden standard. This study will evaluate the performance of various Large Language Models (LLMs) in accurately answering specific questions related to different sections of a scholarly article: title and abstract, methods, results, and discussion in fifteen questions from the STROBE Checklist
Results:
Among the answers given by LLMs to the questions, the LLM that gave the most correct answers (66.9%) was GPT 3.5. This was followed by GPT 4-1106 version (65.6%), Palm2- (62.1%), Claude v1 (58.3%), Gemini pro (49.2%) and GPT-4 0613 version (44.1%). LLMs showcased distinct performances for each question across different parts of a scholarly article - with certain models like Palm 2 and GPT 3.5 showing remarkable versatility and depth in understanding.
Conclusions:
This study is the first study to evaluate the performance of different LLMs in evaluating their ability to understanding medical articles using the Retrieval-Augmented Generation (RAG) method by giving documents.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.