Currently submitted to: JMIR Bioinformatics and Biotechnology
Date Submitted: Feb 2, 2026
Open Peer Review Period: Feb 18, 2026 - Apr 15, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
A Metadata-Driven Framework for Strengthening Pathogen Genomics: Lessons from SARS-CoV-2
ABSTRACT
During the COVID-19 pandemic, large-scale pathogen sequencing generated millions of SARS-CoV-2 genomes deposited in repositories like GenBank and GISAID. However, most of these records lack detailed patient metadata, such as demographics and clinical outcomes, which limits their utility for large-scale pathogen genomics analyses. While records that are linked to a journal publication might contain such metadata, systematic extraction and linkage to sequence records requires substantial manual effort. In this work, we assess the completeness of metadata in GenBank and demonstrate the value of enriched clinical and demographic annotations for genomic epidemiology. We found that on average GenBank records contained only 21.6% of host metadata, and during our study period ~0.02% of published articles provided accessible sequence-specific patient metadata. Additionally, using published SARS-CoV-2 genomes and their corresponding journal articles, we constructed an analytical use case in pathogen genomics in which host stratification by clinical and demographic factors enables examination of evolutionary dynamics and clinical outcomes. Our results demonstrate how metadata-enrichment enhances pathogen genomic studies and provide a framework applicable to other pathogens.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.