Why were the Qiagen spin columns contaminated with viral DNA? A search of the publicly available environmental metagenomic data sets revealed the presence of sequences highly related to this virus (87 to 99% nucleotide identity). The data sets containing these sequences were obtained from seawater collected off the Pacific coast of North America and coastal regions of Oregon and Chile. The source of contamination could be explained if the silica in the Qiagen spin columns was produced from ocean-dwelling diatoms that were infected with the virus.
In retrospect, it was easy to be fooled into believing that the novel virus might be a human pathogen because it was detected only in sick and not healthy patients. Why antibodies to the virus were detected in samples from both sick and healthy patients remains to be explained. However, the virus is not likely to be associated with any human illness: when non-Qiagen spin columns were used, the viral sequences were not found in any patient sample.
The lesson to be learned from this story is clear: high-throughput sequencing is a very powerful and sensitive method but must be applied with great care. Every step of the virus discovery process must be carefully controlled, from the water used to the plastic reagents. Most importantly, laboratories carrying out pathogen discovery must share their sequence data, something that took place during this study.
Naccache SN, Greninger AL, Lee D, Coffey LL, Phan T, Rein-Weston A, Aronsohn A, Hackett J, Jr, Delwart EL, Chiu CY. 2013. The perils of pathogen discovery: origin of a novel parvovirus-like hybrid genome traced to nucleic acid extraction spin columns. J Virol 87:11966–11977.
Xu B, Zhi N, Hu G, Wan Z, Zheng X, Liu X, Wong S, Kajigaya S, Zhao K, Mao Q, Young NS. 2013. Hybrid DNA virus in Chinese patients with seronegative hepatitis discovered by deep sequencing. Proc Natl Acad Sci U S A 110:10264–10269.
Computational biology. The generation of nucleotide sequences at an unprecedented rate has spawned a new branch of bioinformatics to develop algorithms for assembling sequence reads into continuous strings and to determine whether they are from a new or previously discovered virus. Storing, analyzing, and sharing massive quantities of data constitute an immense challenge: the number of bases in GenBank, an open-access, annotated collection of all publicly available nucleotide sequences produced and maintained by the National Center for Biotechnology Information, has doubled every 18 months since 1982. As of June 2019 GenBank held 329,835,282,370 bases.
Computational problems must be solved at multiple steps during the process of genome sequencing. The initial problem is that sequence reads are typically short, and there are many of them (e.g., high throughput). These short sequences must be overlapped and, if possible, mapped to a genome. Many computer programs have been developed to address this problem. Some carry out alignment of sequence reads to a reference genome, while others perform this process de novo, i.e., in the absence of a reference genome.
When clinical or environmental samples are subjected to high-throughput sequencing for pathogen discovery, it is essential to identify viral sequences in what is typically a mix of host, bacterial, and fungal sequences. This task relies on alignment of sequences to reference viral databases. However, such databases are limited because most of the sequences retrieved in metagenomic studies are unknown (so-called “dark matter”) and therefore cannot be annotated. Consequently, computational pipelines have been designed to analyze high-throughput sequencing data to search for those likely to be of viral origin.
Some computational pipelines are designed to define the abundance and types of viruses in a sample, such as Viral Informatics Resource for Metagenome Exploration (VIROME), the Viral MetaGenome Annotation Project (VMGAP), and Basic Local Alignment Search Tool (BLAST). Other virus discovery programs (MePIC, READSCAN, CaPSID, VirusFinder, and SRSA) rely on nucleotide sequence alignment and will work only for the detection of viruses with high sequence similarity to known viruses. PathSeq, SURPI, VirFind, and VirusHunter identify viruses by amino acid searches, a computationally demanding exercise that is critical for new virus identification. VirusSeeker-Virome (VS-Virome) is a computational pipeline designed for defining both the type and abundance of known and novel viral sequences in metagenomic data sets (Fig. 2.17).
Genome sequences can provide considerable insight into the evolutionary relationships among viruses. Such information can be used to understand the origin of viruses and how selection pressures change viral genomes and to assist in epidemiological investigations of viral outbreaks. When few viral genome sequences were available, pairwise homologies were often displayed in simple tables. As sequence databases increased in size, tables of multiple alignments were created, but these were still based only on pairwise comparisons. Today, phylogenetic trees are used to illustrate the relationships among numerous viruses or viral proteins (Box 2.10). Not only are such trees important tools for understanding evolutionary relationships, but they may allow conclusions to be drawn about biological functions: examination of a phylogenetic tree may allow determination of how closely or distantly a sequence relates to one of known function. Software programs such as AdaPatch, AntiPatch, and AntigenicTree have been developed to produce phylogenetic trees. However, these approaches do not account for horizontal gene transfer, recombination, or the evolutionary relationships between viruses and their hosts, which will require unconventional computational methods to resolve.
Algorithms have also been written to apply high-throughput sequencing methods to a variety of genome-wide analyses, including detection of single-nucleotide polymorphisms (SNP), RNA-seq, ChiP-seq, CLIP, and more (see below).
Viral Reproduction: the Burst Concept
A fundamental and important principle is that viruses are reproduced via the assembly of preformed components into particles: the parts are first made in cells and then assembled into the final product. This simple build-and-assemble strategy is unique to all viruses, but the details of how this process transpires are astonishingly diverse among members of different virus families. There are many ways to build a virus particle, and each one tells us something new about virus structure and assembly.
Modern investigations of viral reproduction strategies have their origins in the work of Max Delbrück and colleagues, who studied the T-even bacteriophages starting in 1937. Delbrück believed that these bacteriophages were perfect models for understanding the basis of heredity. He focused his attention on the fact that one bacterial cell usually makes hundreds of progeny virus particles. The yield from one cell is one viral generation; it was called the burst because the viruses that he studied literally burst from the infected cell. Under carefully controlled laboratory conditions, most cells make, on average, about the same number of bacteriophages per cell. For example, in one of Delbrück’s experiments, the average number of bacteriophage T4 particles produced from individual single-cell bursts from Escherichia coli cells was 150 particles per cell.
Another important implication of the burst is that a cell has a finite capacity to produce virus. Multiple parameters limit the number of particles produced per cell. These include metabolic resources, the number of sites for genome replication in the cell, the regulation of release of virus particles, and host defenses. In general, larger cells (e.g., eukaryotic cells) produce more virus particles per cell: yields of 1,000 to 10,000 virions per eukaryotic cell are not uncommon.
A burst occurs for viruses that kill the cell after infection, namely, cytopathic viruses. However, some viruses do not kill their host cells, and virus particles are produced as long as the cell is alive. Examples include filamentous bacteriophages, most retroviruses, and hepatitis viruses.
The One-Step Growth Cycle
The