Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 6, 2025
Date Accepted: Oct 6, 2025
Date Submitted to PubMed: Oct 7, 2025

The final, peer-reviewed published version of this preprint can be found here:

Human-Machine Agreement in Medical Ethics: Patient Autonomy Case-Based Evaluation of Large Language Models

Mugu V, Carr B, Khandelwal A, Olson M, Schupbach J, Zietlow J, Vu TD, Chan A, Collura C, Schmitz J

Human-Machine Agreement in Medical Ethics: Patient Autonomy Case-Based Evaluation of Large Language Models

JMIR Med Inform 2025;13:e77061

DOI: 10.2196/77061

PMID: 41056099

PMCID: 12592888

Human-Machine Agreement in Medical Ethics: A Patient Autonomy Case-based Evaluation of Large Language Models.

  • Vamshi Mugu; 
  • Brendan Carr; 
  • Ashish Khandelwal; 
  • Mike Olson; 
  • John Schupbach; 
  • John Zietlow; 
  • T.N. Diem Vu; 
  • Alex Chan; 
  • Christopher Collura; 
  • John Schmitz

ABSTRACT

Background:

Medical ethics provides a moral framework for the practice of clinical medicine. Four principles: beneficence, non-maleficence, patient autonomy, and justice – form the cornerstones of medical ethics as it is practiced today. Of these four principles, patient autonomy holds a pivotal position and often takes precedence in ethical dilemmas that result from conflicts among the four principles. Its importance serves as a constant reminder to the clinician that the “needs of the patient come first.” With their remarkable ability to process natural language, large language models (LLMs) have recently pervaded nearly every aspect of human life, including medicine and medical ethics. Reliance on tools such as LLMs, however, poses fundamental questions in medical ethics, where human-like reasoning, emotional intelligence, and an understanding of local context and values are of utmost importance. While emphasizing the central role of the human factor, we undertake a bold venture to establish some confidence in LLMs as it pertains to medical ethics by not only evaluating the status quo of foundational LLMs but also exploring ways to improve the LLMs, using patient-autonomy based hypothetical cases. While literature today is certainly lacking in such ventures, we also believe projects such as ours must be frequently revisited in the field of LLMs which is evolving at a pace that is both rapid and unprecedented.

Objective:

While emphasizing the central role of the human factor, we undertake a bold venture to establish some confidence in LLMs as it pertains to medical ethics by not only evaluating the status quo of foundational LLMs but also exploring ways to improve the LLMs, using patient-autonomy based hypothetical cases. While literature today is certainly lacking in such ventures, we also believe projects such as ours must be frequently revisited in the field of LLMs which is evolving at a pace that is both rapid and unprecedented.

Methods:

We evaluated three foundational LLMs (ChatGPT, LLaMA, and Gemini) on hypothetical cases in patient autonomy. We used Cohen κ to compare LLM responses to the consensus from a physician panel. McMemar’s test was used during Improvement phase and to report the final significance of improved agreement of each LLM with physician consensus. A p < 0.05 was considered significant. An agreement with κ < 0 was designated poor, 0-0.2 slight, 0.2-0.4 fair, 0.41-0.6 moderate, 0.61-0.8 substantial, and 0.81-1 almost perfect.

Results:

There was slight to fair agreement between the foundational LLMs and the physician consensus. With iterative improvement techniques, this agreement evolved to be substantial or higher (Cohen κ of 0.73-0.82). The degree of improvement was statistically significant (p=0.006 for ChatGPT, p<0.001 for Gemini, and p<0.001 for LLaMA).

Conclusions:

While LLMs hold great potential for use in medicine, there needs to be an abundance of caution in using foundational LLMs in domains such as medical ethics. With adequate human oversight in testing and utilizing established techniques, LLM responses can be better aligned to human responses, even in the domain of medical ethics. Clinical Trial: N/A


 Citation

Please cite as:

Mugu V, Carr B, Khandelwal A, Olson M, Schupbach J, Zietlow J, Vu TD, Chan A, Collura C, Schmitz J

Human-Machine Agreement in Medical Ethics: Patient Autonomy Case-Based Evaluation of Large Language Models

JMIR Med Inform 2025;13:e77061

DOI: 10.2196/77061

PMID: 41056099

PMCID: 12592888

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.