Accepted for/Published in: JMIR Bioinformatics and Biotechnology
Date Submitted: Jul 17, 2025
Open Peer Review Period: Jul 17, 2025 - Sep 11, 2025
Date Accepted: Sep 13, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Paired-Sample and Pathway-Anchored MLOps Framework for Robust Transcriptomic Machine Learning in Small Cohorts
ABSTRACT
Background:
Ninety percent of the 65,000 human diseases are infrequent, collectively affecting ~ 400 million people, substantially limiting cohort accrual. This low prevalence constrains the development of robust transcriptome-based machine learning (ML) classifiers. Standard data-driven classifiers typically require cohorts of over 100 subjects per group to achieve clinical accuracy while managing high-dimensional input (~25,000 transcripts). These requirements are infeasible for micro-cohorts of ~20 individuals, where overfitting becomes pervasive
Objective:
To overcome these constraints, we developed a classification method that integrates three enabling strategies: (i) paired-sample transcriptome dynamics, (ii) N-of-1 pathway-based analytics, and (iii) reproducible machine learning operations (MLOps) for continuous model refinement.
Methods:
Unlike ML approaches relying on a single transcriptome per subject, within-subject paired-sample designs — such as pre- versus post-treatment or diseased versus adjacent-normal tissue —effectively control intra-individual variability under isogenic conditions and within-subject environmental exposures (e.g. smoking history, other medications, etc.), improve signal-to-noise ratios, and, when pre-processed as single-subject studies (N-of-1), can achieve statistical power comparable to that obtained in animal models. Pathway-level N-of-1 analytics further reduces each sample’s high-dimensional profile into ~4,000 biologically interpretable features, annotated with effect sizes, dispersion, and significance. Complementary MLOps practices—automated versioning, continuous monitoring, and adaptive hyperparameter tuning—improve model reproducibility and generalization.
Results:
In two case studies—human rhinovirus infection versus matched healthy controls (n=16 training; 3 test) and breast cancer tissues harboring TP53 or PIK3CA mutations versus adjacent normal tissue (n=27 training; 9 test)—this approach achieved 90% precision and recall on an unseen breast cancer test set and 92% precision with 90% recall in rhinovirus fivefold cross-validation. . Incorporating paired-sample dynamics boosted precision by up to 12% and recall by 13% in BC, and by 5% each in HRV. MLOps workflows yielded an additional ~14.5% accuracy improvement compared to traditional pipelines. Moreover, our method identified 42 critical gene-sets (pathways) for rhinovirus response and 21 for breast cancer mutation status, with retroactive ablation of top features reducing accuracy by ~25%.
Conclusions:
These proof-of-concept results support the utility of integrating intra-subject dynamics, “biological knowledge”-based feature reduction (pathway-level feature reduction grounded in prior biological knowledge; e.g., N-of-1-pathways analytics), and reproducible MLOps workflows can overcome cohort-size limitations in infrequent disease, offering a scalable, interpretable solution for high-dimensional transcriptomic classification. Future work will extend these advances across various therapeutic and small-cohort designs. Clinical Trial: not applicable
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.