Currently submitted to: JMIR AI
Date Submitted: May 29, 2026
Open Peer Review Period: Jun 5, 2026 - Jul 31, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
A Governed Machine Learning Methodology for Clinical Screening in Latin American Health Systems: Development and Retrospective Evaluation
ABSTRACT
Background:
Clinical screening model development in low- and middle-income country (LMIC) health systems requires more than a well-performing algorithm. It requires reproducible cohort logic, leakage control, calibration, human-reviewed deployment decisions, and complete, auditable documentation aligned with TRIPOD+AI reporting standards. To our knowledge, few published methodological descriptions exist of an AutoML pipeline aligned with TRIPOD+AI reporting principles designed specifically for tabular electronic health record (EHR) data in Latin American settings.
Objective:
To describe the architecture, workflow, governance mechanisms, and operational evidence of Hippocrates, a governed AutoML methodology for supervised tabular clinical screening model development in Colombian and Latin American health systems.
Methods:
Hippocrates organizes clinical screening model development into 13 phases covering data ingestion, mandatory leakage gates, cohort definition, feature engineering, model selection, isotonic calibration, threshold selection, subgroup assessment, and TRIPOD+AI-conformant documentation. A mandatory calibration slope acceptance gate [0.85, 1.15] enforces model quality before deployment eligibility. Eighteen human-in-the-loop pause-points interrupt automated execution at governance decisions that cannot be reduced to a metric, including target definition, leakage handling, threshold selection, and deployment scope. All governance decisions are recorded in an append-only audit log. The methodology is encoded as reusable Markdown skill files and executed by a large language model agent (Claude Code, Anthropic). Functional testing used six synthetic edge-case datasets representing common clinical ML failure modes: data leakage, extreme class imbalance, impossible targets, and informative missingness.
Results:
Applied across five real-world sessions spanning CKD screening, COPD screening, and workforce retention, the methodology produced fully documented, calibrated model artifacts with complete governance trails. In a retrospective evaluation across four health institutions (combined n>53,000), the CKD model developed under the current methodology showed consistent improvement over a previously deployed model that presented near-chance discrimination (AUROC approximately 0.58) and a calibration slope of 0.04. The methodology identified non-obvious feature representations through metric-driven optimization, correcting miscalibration in a COPD screening case; the proposed feature-encoding changes were reviewed and approved within the human-in-the-loop governance workflow. It also detected temporal leakage in a workforce retention session that would have produced a misleadingly high-performing artifact under standard cross-validation. Across five random seeds on the Health System Dataset A (CKD in type 2 diabetes), AUROC ranged 0.7479 ± 0.0013; inter-operator variability (~0.029 AUROC) was the dominant source of variability.
Conclusions:
Few published methodological descriptions exist of a governed AutoML pipeline aligned with TRIPOD+AI reporting principles for clinical screening in Latin America. The framework’s value lies in its governance layer: mandatory leakage gates, calibration enforcement, human pause-points, and audit trails, rather than in any single model architecture. Prospective studies are needed to establish reproducibility under controlled conditions, clinical utility, and implementation outcomes. Clinical Trial: N/A
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.