JMIR Preprints #67967: Large Language Model Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large Language Model Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions

Verity Schaye;
David DiTullio;
Benedict Guzman;
Scott Vennemeyer;
Hanniel Shih;
Ilan Reinstein;
Danielle E Weber;
Abbie Goodman;
Danny T. Y. Wu;
Daniel J. Sartori;
Sally A. Santen;
Larry Gruppen;
Yindalon Aphinyanaphongs;
Jesse Burk Rafel

ABSTRACT

Background:

Clinical reasoning is an essential skill, yet physicians receive limited feedback. Artificial intelligence holds promise to fill this gap.

Objective:

We report the development of both named entity recognition (NER), logic-based and large language model (LLM)-based assessments of CR documentation in the electronic health record (EHR) across two institutions.

Methods:

Two note sets were retrieved from the EHR at each institution (NYU Grossman School of Medicine (NYU) and University of Cincinnati College of Medicine (UC)): 1) retrospective dataset comprised of internal medicine resident admission notes from July 2020-December 2021 (n=700 NYU notes, n=450 UC notes) and 2) prospective validation dataset from July 2023-December 2023 (n=155 NYU notes, n=92 UC notes). Using a validated human gold standard for assessment of CR documentation, the R-DEA tool, clinicians rated notes for D (differential diagnosis) and EA (explanation of reasoning) quality, each on 3-point scales (D0, D1, D2 and EA0, EA1, EA2). Model training occurred accordingly on the retrospective datasets: 1) NYU development of NER, logic-based model with validation at UC, 2) NYU fine tune training of LLM NYUTron (a BERT-like (Bidirectional Encoder Representation with Transformer) LLM with about 110 million parameters that has been pre-trained on 7.25 million clinical notes), 3) NYU fine tune training of LLM GatorTron (an open source LLM with 345 million parameters that was pre-trained on over 82 billion words of de-identified clinical text), 4) UC fine tune training of NYU fine-tuned GatorTron, and 5) UC fine tune training of GatorTron. The best performing models were validated with the prospective datasets and performance assessed with F1 scores for the NER, logic-based model and AUROC and AUPRC for the LLMs.

Results:

At NYU, the NYUTron models were the best performing. The D0 and D2 models with an AUROC 0.87, AUPRC 0.79 and AUROC 0.89, AUPRC 0.86, respectively. The D1 model did not have sufficient performance for implementation. The EA0 and EA1 models also did not have adequate performance so the approach pivoted to create a binary EA2 model (i.e. EA2 vs not EA2) which had excellent performance with an AUROC 0.85 and AUPRC 0.80. At UC, the NER, D-logic-based model was the best performing D model. The F1-scores for the D model on the UC dataset were 0.80, 0.74, and 0.80 for D0, D1, D2, respectively. The UC fine tuning of NYU fine-tuned GatorTron EA2 model had an AUROC 0.75 and AUPRC 0.69.

Conclusions:

This is the first study to our knowledge to demonstrate the use of LLMs for assessment of CR documentation quality in the EHR across two institutions. Lessons learned can help promote implementation of these technologies across institutions with ranges of technical resources and enhance feedback on the essential skill of CR.

Citation

Please cite as:

Schaye V, DiTullio D, Guzman B, Vennemeyer S, Shih H, Reinstein I, Weber DE, Goodman A, Wu DTY, Sartori DJ, Santen SA, Gruppen L, Aphinyanaphongs Y, Rafel JB

Large Language Model–Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions: Development and Validation Study

J Med Internet Res 2025;27:e67967

DOI: 10.2196/67967

PMID: 40117575

PMCID: 11971582

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 24, 2024

Open Peer Review Period: Oct 31, 2024 - Dec 26, 2024

Date Accepted: Feb 25, 2025

(closed for review but you can still tweet)

Large Language Model Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions

ABSTRACT

Citation

Copyright