JMIR Preprints #102826: A Governed Machine Learning Methodology for Clinical Screening in Latin American Health Systems: Development and Retrospective Evaluation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A Governed Machine Learning Methodology for Clinical Screening in Latin American Health Systems: Development and Retrospective Evaluation

Katherine Monsalve;
¹*, Laura V. Bellon-Padilla;
Jose Zea;
Natalia Castaño-Villegas;
Laura Velásquez

ABSTRACT

Background:

Clinical screening model development in low- and middle-income country (LMIC) health systems requires more than a well-performing algorithm. It requires reproducible cohort logic, leakage control, calibration, human-reviewed deployment decisions, and complete, auditable documentation aligned with TRIPOD+AI reporting standards. To our knowledge, few published methodological descriptions exist of an AutoML pipeline aligned with TRIPOD+AI reporting principles designed specifically for tabular electronic health record (EHR) data in Latin American settings.

Objective:

To describe the architecture, workflow, governance mechanisms, and operational evidence of Hippocrates, a governed AutoML methodology for supervised tabular clinical screening model development in Colombian and Latin American health systems.

Methods:

Hippocrates organizes clinical screening model development into 13 phases covering data ingestion, mandatory leakage gates, cohort definition, feature engineering, model selection, isotonic calibration, threshold selection, subgroup assessment, and TRIPOD+AI-conformant documentation. A mandatory calibration slope acceptance gate [0.85, 1.15] enforces model quality before deployment eligibility. Eighteen human-in-the-loop pause-points interrupt automated execution at governance decisions that cannot be reduced to a metric, including target definition, leakage handling, threshold selection, and deployment scope. All governance decisions are recorded in an append-only audit log. The methodology is encoded as reusable Markdown skill files and executed by a large language model agent (Claude Code, Anthropic). Functional testing used six synthetic edge-case datasets representing common clinical ML failure modes: data leakage, extreme class imbalance, impossible targets, and informative missingness.

Results:

Applied across five real-world sessions spanning CKD screening, COPD screening, and workforce retention, the methodology produced fully documented, calibrated model artifacts with complete governance trails. In a retrospective evaluation across four health institutions (combined n>53,000), the CKD model developed under the current methodology showed consistent improvement over a previously deployed model that presented near-chance discrimination (AUROC approximately 0.58) and a calibration slope of 0.04. The methodology identified non-obvious feature representations through metric-driven optimization, correcting miscalibration in a COPD screening case; the proposed feature-encoding changes were reviewed and approved within the human-in-the-loop governance workflow. It also detected temporal leakage in a workforce retention session that would have produced a misleadingly high-performing artifact under standard cross-validation. Across five random seeds on the Health System Dataset A (CKD in type 2 diabetes), AUROC ranged 0.7479 ± 0.0013; inter-operator variability (~0.029 AUROC) was the dominant source of variability.

Conclusions:

Few published methodological descriptions exist of a governed AutoML pipeline aligned with TRIPOD+AI reporting principles for clinical screening in Latin America. The framework’s value lies in its governance layer: mandatory leakage gates, calibration enforcement, human pause-points, and audit trails, rather than in any single model architecture. Prospective studies are needed to establish reproducibility under controlled conditions, clinical utility, and implementation outcomes. Clinical Trial: N/A

Citation

Please cite as:

Monsalve K, Bellon-Padilla �LV, Zea J, Castaño-Villegas N, Velásquez L

A Governed Machine Learning Methodology for Clinical Screening in Latin American Health Systems: Development and Retrospective Evaluation

JMIR Preprints. 29/05/2026:102826

DOI: 10.2196/preprints.102826

URL: https://preprints.jmir.org/preprint/102826

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR AI

Date Submitted: May 29, 2026

Open Peer Review Period: Jun 5, 2026 - Jul 31, 2026

(currently open for review)

A Governed Machine Learning Methodology for Clinical Screening in Latin American Health Systems: Development and Retrospective Evaluation

ABSTRACT

Citation

Copyright