Currently submitted to: Journal of Medical Internet Research
Date Submitted: Mar 25, 2026
Open Peer Review Period: Mar 25, 2026 - May 20, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Standardized Performance Assessment Methodology and End-to-End Framework for Tactical Combat Casualty Care Autonomous Documentation Algorithms
ABSTRACT
Background:
The Military Healthcare System (MHS) mandates medical documentation at all echelons of care, however care providers in high-intensity combat situations must prioritize lifesaving measures over record-keeping, leading to information gaps across the care continuum. Effective human machine teaming (HMT) solutions designed to autonomously document care delivery will serve as future force multipliers in tactical combat casualty care (TCCC) environments.
Objective:
To address this challenge, the United States Army Institute of Surgical Research (USAISR) commenced an effort to prototype HMT systems designed to passively document care delivery within the TCCC environment. However, common artificial intelligence (AI) performance evaluation methods do not adequately represent the temporal, repetitive and context dependent nature of real-world TCCC delivery. Therefore, it is essential to conduct comprehensive assessments to ascertain that the AI tools function in a timely, synchronized manner within operational workflows. During the initial prototyping phase, five algorithm developers were provided with annotated datasets from seventy-five TCCC simulations and given six months to develop their algorithms.
Methods:
To assess the algorithms the research team leveraged a reserved dataset to perform evaluations. In the first phase of the assessment a standardized, repeatable performance methodology and framework was leveraged to evaluate individual algorithms that detect: (1) injury location on a casualty; (2) medical objects visible in the scene, and (3) treatments administered by the care provider. Detection effectiveness included four metrics: modified accuracy; precision; recall; and F1 scores. Algorithm processing efficiency was also evaluated by calculating lag time scores. A final composite score was used to quantify performance differences among the algorithms within a specific detection category. The second phase of the evaluation integrated multiple algorithms into a centralized orchestration framework to enable synchronized execution and consolidated outputs. System-level resource usage and throughput metrics were evaluated to characterize the computational efficiencies. Quantified memory consumption and central processing unit (CPU) and graphics processing unit (GPU) resource utilization were assessed, followed by benchmarking the edge compute orchestration framework utilization.
Results:
Results are presented for a representative algorithm in each category. Medical Object Detection achieved the highest performance (mean F1≈0.42, range 0–0.71). Injury Detection and Localization showed lower performance (mean F1≈0.27, range 0–0.60), with higher recall than precision. Medical Procedure Detection yielded procedure-level mean F1 scores from 0.00 to 0.31 and simulation-level means from 0.00 to 0.33. Stronger results were observed for Nasopharyngeal Airway (NPA) and Chest Seal Application (mean F1≈0.28, 0.31) medical procedures. The results to date are preliminary and serve as illustrative examples of the evaluation framework outputs.
Conclusions:
The preliminary results highlight the evaluation framework’s end-to-end, standardized results across core algorithm functions. While the algorithm performance to date is modest, the framework demonstrates its capacity to capture both variability and recurring patterns across simulations, thereby highlighting strengths, limitations, and areas requiring refinement. It enables reproducible, cross data set comparisons, allowing evaluators to quantify algorithm performance. By leveraging both simulation-level evaluations and detection specific performance aggregated across simulations, the framework enables targeted identification of underperforming areas, supporting iterative and strategic AI model refinement.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.