JMIR Preprints #71176: Language Models for Multi-label Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Language Models for Multi-label Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study

Jeremy A Balch;
Sasank S Desarju;
Victoria J Nolan;
Divya Vallanki;
Timothy R Buchanan;
Lindsey M Brinkley;
Yordan Penev;
Ahmet Bilgi;
Aashay Patel;
Corinne E Chatham;
David M Vanderbilt;
Rayon Uddin;
Azra Bihorac;
Philip Efron;
Tyler J Loftus;
Protiva Rahman;
Benjamin Shickel

ABSTRACT

Background:

Operative notes are mined for surgical concepts in patient care, research, performance improvement, and billing workflows, an endeavor which may be conceived as a multi-label document classification task.

Objective:

We developed and evaluated large language models (LLMs) for the purpose of expediting data extraction from surgical notes.

Methods:

388 exploratory laparotomy notes from a single institution were annotated for 21 concepts related to intraoperative findings, intraoperative techniques, and closure techniques. Annotation consistency was measured using the Cohen’s kappa statistic. We contrast conventional natural language processing (NLP) approaches––bag-of-words (BoW) and term frequency-inverse document frequency (tf-idf) with linear classifiers––with encoder-only (Clinical-Longformer, CL) and decoder-only (Llama 3.1 70b) LLMs. Multi-label classification performance was evaluated with 5-fold cross-validation with F1 score and hamming loss (HL). LLM prompting strategies were modified based on error analysis.

Results:

Prevalence of labels ranged from 0.05 (colostomy, ileostomy, active bleed from named vessel) to 0.50 (running fascial closure). Llama 3.1 70B was the overall best-performing model (micro-F1 0.86 [5-fold range: 0.85, 0.87], HL 0.14 [0.13, 0.15]). The BoW model (micro-F1 0.68 [0.64, 0.71], HL 0.14 [0.13, 0.16]) and Clinical-Longformer (micro-F1 0.73 [0.70, 0.74], HL 0.11 [0.10, 0.12]) had overall similar performance, with tf-idf models trailing (micro-F1 0.57 [0.55, 0.59], HL 0.27 [0.25, 0.29]). F1 scores varied across concepts in the Llama model, ranging from 0.21 [0.11, 0.30] for partial skin closure to 0.92 [0.88, 0.96] for bowel resection. Error analysis demonstrated semantic nuances and edge cases within operative notes.

Conclusions:

Off-the-shelf autoregressive LLMs outperformed fined-tuned, encoder-only transformers and traditional NLP techniques in classifying operative notes. Clinical Trial: n/a

Citation

Please cite as:

Balch JA, Desarju SS, Nolan VJ, Vallanki D, Buchanan TR, Brinkley LM, Penev Y, Bilgi A, Patel A, Chatham CE, Vanderbilt DM, Uddin R, Bihorac A, Efron P, Loftus TJ, Rahman P, Shickel B

Language Models for Multilabel Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study

JMIR Med Inform 2025;13:e71176

DOI: 10.2196/71176

PMID: 40632815

PMCID: 12266303

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jan 11, 2025

Open Peer Review Period: Jan 11, 2025 - Mar 8, 2025

Date Accepted: May 15, 2025

(closed for review but you can still tweet)

Language Models for Multi-label Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study

ABSTRACT

Citation

Copyright