Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jan 11, 2025
Open Peer Review Period: Jan 11, 2025 - Mar 8, 2025
Date Accepted: May 15, 2025
(closed for review but you can still tweet)
Language Models for Multi-label Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study
ABSTRACT
Background:
Operative notes are mined for surgical concepts in patient care, research, performance improvement, and billing workflows, an endeavor which may be conceived as a multi-label document classification task.
Objective:
We developed and evaluated large language models (LLMs) for the purpose of expediting data extraction from surgical notes.
Methods:
388 exploratory laparotomy notes from a single institution were annotated for 21 concepts related to intraoperative findings, intraoperative techniques, and closure techniques. Annotation consistency was measured using the Cohen’s kappa statistic. We contrast conventional natural language processing (NLP) approaches––bag-of-words (BoW) and term frequency-inverse document frequency (tf-idf) with linear classifiers––with encoder-only (Clinical-Longformer, CL) and decoder-only (Llama 3.1 70b) LLMs. Multi-label classification performance was evaluated with 5-fold cross-validation with F1 score and hamming loss (HL). LLM prompting strategies were modified based on error analysis.
Results:
Prevalence of labels ranged from 0.05 (colostomy, ileostomy, active bleed from named vessel) to 0.50 (running fascial closure). Llama 3.1 70B was the overall best-performing model (micro-F1 0.86 [5-fold range: 0.85, 0.87], HL 0.14 [0.13, 0.15]). The BoW model (micro-F1 0.68 [0.64, 0.71], HL 0.14 [0.13, 0.16]) and Clinical-Longformer (micro-F1 0.73 [0.70, 0.74], HL 0.11 [0.10, 0.12]) had overall similar performance, with tf-idf models trailing (micro-F1 0.57 [0.55, 0.59], HL 0.27 [0.25, 0.29]). F1 scores varied across concepts in the Llama model, ranging from 0.21 [0.11, 0.30] for partial skin closure to 0.92 [0.88, 0.96] for bowel resection. Error analysis demonstrated semantic nuances and edge cases within operative notes.
Conclusions:
Off-the-shelf autoregressive LLMs outperformed fined-tuned, encoder-only transformers and traditional NLP techniques in classifying operative notes. Clinical Trial: n/a
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.