Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Oct 24, 2024
Open Peer Review Period: Oct 31, 2024 - Dec 26, 2024
Date Accepted: Feb 25, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Large Language Model Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions
ABSTRACT
Background:
Clinical reasoning is an essential skill, yet physicians receive limited feedback. Artificial intelligence holds promise to fill this gap.
Objective:
We report the development of both named entity recognition (NER), logic-based and large language model (LLM)-based assessments of CR documentation in the electronic health record (EHR) across two institutions.
Methods:
Two note sets were retrieved from the EHR at each institution (NYU Grossman School of Medicine (NYU) and University of Cincinnati College of Medicine (UC)): 1) retrospective dataset comprised of internal medicine resident admission notes from July 2020-December 2021 (n=700 NYU notes, n=450 UC notes) and 2) prospective validation dataset from July 2023-December 2023 (n=155 NYU notes, n=92 UC notes). Using a validated human gold standard for assessment of CR documentation, the R-DEA tool, clinicians rated notes for D (differential diagnosis) and EA (explanation of reasoning) quality, each on 3-point scales (D0, D1, D2 and EA0, EA1, EA2). Model training occurred accordingly on the retrospective datasets: 1) NYU development of NER, logic-based model with validation at UC, 2) NYU fine tune training of LLM NYUTron (a BERT-like (Bidirectional Encoder Representation with Transformer) LLM with about 110 million parameters that has been pre-trained on 7.25 million clinical notes), 3) NYU fine tune training of LLM GatorTron (an open source LLM with 345 million parameters that was pre-trained on over 82 billion words of de-identified clinical text), 4) UC fine tune training of NYU fine-tuned GatorTron, and 5) UC fine tune training of GatorTron. The best performing models were validated with the prospective datasets and performance assessed with F1 scores for the NER, logic-based model and AUROC and AUPRC for the LLMs.
Results:
At NYU, the NYUTron models were the best performing. The D0 and D2 models with an AUROC 0.87, AUPRC 0.79 and AUROC 0.89, AUPRC 0.86, respectively. The D1 model did not have sufficient performance for implementation. The EA0 and EA1 models also did not have adequate performance so the approach pivoted to create a binary EA2 model (i.e. EA2 vs not EA2) which had excellent performance with an AUROC 0.85 and AUPRC 0.80. At UC, the NER, D-logic-based model was the best performing D model. The F1-scores for the D model on the UC dataset were 0.80, 0.74, and 0.80 for D0, D1, D2, respectively. The UC fine tuning of NYU fine-tuned GatorTron EA2 model had an AUROC 0.75 and AUPRC 0.69.
Conclusions:
This is the first study to our knowledge to demonstrate the use of LLMs for assessment of CR documentation quality in the EHR across two institutions. Lessons learned can help promote implementation of these technologies across institutions with ranges of technical resources and enhance feedback on the essential skill of CR.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.