Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
ASA Physical Status Classification: A Hybrid Machine Learning-Large Language Model Ensemble Retrospective Validation Study
ABSTRACT
Background:
The American Society of Anesthesiologists Physical Status (ASA-PS) Classification System fundamentally shapes perioperative care delivery but suffers from poor inter-rater reliability (0.4-0.6). Machine learning (ML) models process structured data consistently but lack clinical reasoning, while large language models (LLMs) provide explanations but may miss subtle patterns in structured data.
Objective:
This study aimed to develop and evaluate a parallel ML-LLM ensemble that combines the complementary strengths of both approaches for automated ASA-PS classification.
Methods:
We retrospectively analyzed 2,500 adult surgical encounters from the University of Arkansas for Medical Sciences (UAMS) between August 2024 and May 2025. Cases were randomly allocated to training (n=2,000) and test sets (n=500). We developed multiple architectures including traditional ML models (Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), ExtraTrees), an LLM-only baseline (Generative Pre-trained Transformer-4o, GPT-4o), and hybrid approaches. The parallel ensemble processed structured data through XGBoost and unstructured clinical notes through GPT-4o independently, with outputs combined via weighted averaging. Model performance was evaluated using macro-F1 score, exact match accuracy, and within-one-class accuracy.
Results:
A theoretical performance ceiling of macro-F1=0.59 was determined through evaluation by an expert panel of three board-certified anesthesiologists who independently rated 50 patient charts for ASA-PS scores. The parallel ensemble (α=0.30) achieved the highest macro-F1 score of 0.58, with 67% exact match accuracy and 98.4% within-one-class accuracy. This outperformed traditional ML models (XGBoost: F1=0.34), the LLM-only baseline (F1=0.64 but with potential overfitting), and sequential hybrid approaches (F1=0.41-0.46). The LLM component generated explanations detailing comorbidities, severity descriptors, and functional status indicators.
Conclusions:
The parallel ML-LLM ensemble achieved performance approaching the theoretical ceiling established by human inter-rater reliability while providing interpretable clinical explanations. The 98.4% within-one-class accuracy ensures operational safety by minimizing extreme misclassifications. This approach demonstrates how complementary AI architectures can enhance perioperative risk assessment, particularly valuable given current healthcare workforce shortages. Clinical Trial: N/A
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.