JMIR Preprints #64318: Performance Assessment of Large Language Models in Medical Consultation: A Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance Assessment of Large Language Models in Medical Consultation: A Comparative Study

Heyoung Yang;
Sujeong Seo;
Kyuli Kim

ABSTRACT

Background:

The COVID-19 pandemic has exacerbated depression, which is recognized as a significant social and medical concern. With the increasing interest in generative AI as an interactive consultant, there is a need to assess its applicability in medical discussions and consultations, particularly in the domain of depression.

Objective:

This study aims to evaluate the capability of large language models (LLMs) in generating responses to depression-related questions and compare the similarity between the generated and original answers.

Methods:

Depression-related questions and corresponding answers were collected from PubMedQA and QuoraQA datasets. Four LLMs (BioGPT, PMC-Llama, GPT-3.5, and Llama2) were utilized to generate responses to these questions. The quantity and quality of the generated responses were assessed, and the similarity between the quality answers and the correct answers was measured using cosine similarity.

Results:

The latest general LLMs, GPT-3.5 and Llama2, outperformed the biomedical domain LLMs (BioGPT and PMC-Llama) in generating answers to medical questions sourced from PubMedQA. GPT-3.5 and Llama2 generated answers with higher similarity to the original answers, with scores of 0.632 and 0.590, respectively. The textual validity of the generated answers was also deemed satisfactory.

Conclusions:

The rapid development of LLMs in recent years suggests that version upgrades of general-purpose models offer greater benefits in enhancing the capacity to generate "knowledge text" in the biomedical domain compared to fine-tuning for the biomedical field specifically. The findings highlight the potential for future advancements in prompt engineering and interactive process modeling to further enhance the capability of general LLMs in generating responses to biomedical questions.

Citation

Please cite as:

Yang H, Seo S, Kim K

Performance Assessment of Large Language Models in Medical Consultation: Comparative Study

JMIR Med Inform 2025;13:e64318

DOI: 10.2196/64318

PMID: 39763114

PMCID: 11888074

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 15, 2024

Date Accepted: Jan 4, 2025

Date Submitted to PubMed: Jan 7, 2025

Performance Assessment of Large Language Models in Medical Consultation: A Comparative Study

ABSTRACT

Citation

Copyright