To combat the current COVID-19 pandemic, scientists around the world are sequencing viral genomes at an accelerated pace.
These sequences are then being deposited into a number of international databases, including the National Center for Biotechnology Information (NCBI; Figure 1). There are limitations to this approach as multiple databases makes it challenging for a single researcher on their own to consolidate data from different sources. These data were generated and processed by different research groups at different institutions resulting in batch effects when amalgamated — differences in signal across groups of viral genomes processed together that represent technical noise and not biological variation. The distributed nature of the data and the lack of uniformity in data generation and processing hinders the pace of scientific discovery. To accelerate discovery, we need to leverage the breadth of data available internationally which requires data consolidation from multiple sources and data “cleaning” to reduce technical artifacts introduced during data processing.