

We then aligned 5,686,715 runs (January 2021) against all known viral RdRP amino acid sequences using a specially optimized version of DIAMOND v2 (ref. First, to identify libraries that contain known or closely related viruses, we searched 3,837,755 (around May 2020) public RNA sequencing (RNA-seq), meta-genome, meta-transcriptome and meta-virome datasets (termed sequencing runs 1) against a nucleotide pangenome of all coronavirus sequences and RefSeq vertebrate viruses. We applied Serratus in two of many possible configurations. Our search space spans data deposited over 13 years from every continent and ocean, and all kingdoms of life (Fig. We used a widely available commercial computing service to deploy up to 22,250 virtual CPUs simultaneously (see Methods), leveraging SRA data mirrored onto cloud platforms as part of the NIH STRIDES initiative 13. Using Serratus, we aligned more than one million short-read sequencing datasets per day for less than 1 US cent per dataset (Extended Data Fig. Serratus is a free, open-source cloud-computing infrastructure optimized for petabase-scale sequence alignment against a set of query sequences.
Download msa device discovery tool free#
Altogether this captures the collective efforts of over a decade of sequencing studies in a free repository, available at. We lay the foundations for future research by enabling direct access to 883,502 RNA-dependent RNA polymerase (RdRP)-containing sequences, which include the RdRP from 131,957 novel RNA viruses (sequences with greater than 10% divergence from a known RdRP), including 9 novel coronaviruses. Identification of Earth’s virome is a fundamental step in preparing for the next pandemic. To catalyse global virus discovery, we developed the Serratus cloud computing infrastructure for ultra-high-throughput sequence alignment, screening 5.7 million ecologically diverse sequencing libraries or 10.2 petabases of data.

Download msa device discovery tool archive#
Petabases (1 × 10 15 bases) of sequencing data are freely available in public databases such as the Sequence Read Archive (SRA) 1, in which viral nucleic acids are often captured incidental to the goals of the original studies 12. Here we propose an alternative alignment-based strategy that is considerably cheaper than assembly and enables processing of massive datasets. Sequence analysis remains computationally expensive, in particular the assembly of short reads into contigs, which limits the breadth of samples analysed.

Pioneering works expanding the virome of the Earth have each uncovered thousands of novel viruses, with the rate of virus discovery increasing exponentially and driven largely by the increased availability of high-throughput sequencing 5, 6, 7, 8, 9, 10, 11. Global surveillance of virus diversity is required for improved prediction and prevention of future epidemics, and is the focus of international consortia and hundreds of research laboratories 3, 4. There are an estimated 3 × 10 5 mammalian virus species from which infectious diseases in humans may arise 2, of which only a fraction are known at present. Viral zoonotic disease has had a major impact on human health over the past century, with notable examples including the 1918 Spanish influenza, AIDS, SARS, Ebola and COVID-19. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 10 5 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale.

Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially 1.
