Figure 3.10 Information retrieval from viral genomes. Different strategies for decoding the information in viral genomes are depicted. CBF, CCAAT-binding factor; USF, upstream stimulatory factor; IRES, internal ribosome entry site.
The utility of viral genome sequences extends well beyond building a catalog of viruses. These sequences are the primary basis for classification and also provide information on the origin and evolution of viruses. In outbreaks or epidemics of viral disease, even partial genome sequences can provide information about the identity of the infecting virus and its spread in different populations. New viral nucleic acid sequences can be associated with disease and characterized even in the absence of standard virological techniques (Volume II, Chapter 10). For example, human herpesvirus 8 was identified by comparing sequences present in diseased and nondiseased tissues, and a novel member of the parvovirus family was identified as the cause of unexpected deaths of laboratory mice in Australia and the United States.
Despite their utility, genome sequences cannot provide a complete understanding of how viruses reproduce. The genome sequence of a virus is at best a biological “parts list”: it provides some information about the intrinsic properties of a virus (for example, predicted sequences of viral proteins and particle composition), but says little or nothing about how the virus interacts with cells, hosts, and populations. This limitation is best illustrated by the results of environmental metagenomic analyses, which reveal that the number of viruses around us (especially in the sea) is astronomical. Most are uncharacterized and, because their hosts are also unknown, cannot be investigated. A reductionist study of individual components in isolation provides few answers. Although the reductionist approach is often the simplest experimentally, it is also important to understand how the genome behaves among others (population biology) and how the genome changes with time (evolution). Nevertheless, reductionism has provided much-needed detailed information for tractable virus-host systems. These systems allow genetic and biochemical analyses and provide models of infection in vivo and in cells in culture. Unfortunately, viruses and hosts that are difficult or impossible to manipulate in the laboratory remain understudied or ignored.
The “Big and Small” of Viral Genomes: Does Size Matter?
The question “does genome size matter” is difficult to answer considering the three orders of magnitude in genome length that separate the largest and the smallest viral genomes. The two largest viral genomes known are those of Pandoravirus salinus (2.4 million bases of dsDNA) and Pandoravirus dulcis (1.9 million bases of dsDNA), encoding 2,541 and 1,487 open reading frames, respectively. The largest RNA virus genomes are far behind (Box 3.4). At the other end are anelloviruses, with a 1,759-base ssDNA genome encoding two proteins (Fig. 3.3B), and viroids, circular, single-stranded RNA molecules of 246 to 401 nucleotides that encode no protein (Volume II, Chapter 13). Anelloviruses include agriculturally important pathogens of chickens and pigs and torque teno (TT) virus, which infects >90% of humans with no known consequence. Viroids cause economically important diseases of crop plants.
All viruses with genome sizes spanning the range from the biggest to the smallest are successful as they continue to reproduce and spread within their hosts. Despite detailed analyses, there is no evidence that one size is more advantageous than another. All viral genomes have evolved under relentless selection, so extremes of size must provide particular advantages. One feature distinguishing large genomes from smaller ones is the presence of many genes that encode proteins for viral genome replication, nucleic acid metabolism, and countering host defense systems. When mimiviruses were first discovered, the surprise was that their genomes encoded components of the protein synthesis system, such as tRNAs and aminoacyl-tRNA synthetases. Tupanviruses, isolated from soda lakes in Brazil and deep ocean sediments, encode all 20 aminoacyl-tRNA synthetases, 70 tRNAs, multiple translation proteins, and more. Only the ribosome is lacking. Why would large viral genomes carry these genes when they are available in their cellular hosts? Perhaps by producing a large part of the translational machinery, viral mRNAs can be more efficiently translated. This explanation is consistent with the finding that the codon and amino acid usage of tupanvirus is different from that of the amoeba that it infects.
EXPERIMENTS
Planaria and mollusks yield the biggest RNA genomes
In the past 20 years the development of high-throughput nucleic acid sequencing methods has rapidly increased the pace of virus discovery. Yet in that time, while the largest DNA genomes have increased nearly ten times, the largest known RNA viral genome has only increased in size by ten percent. This situation has now changed with the discovery of new RNA viruses of planarians and mollusks.
Until very recently, the biggest RNA virus genome known was 33.5 kb (ball python nidovirus), which is much larger than the average sized RNA virus genome of 10 kb. The reason for the difference is that RNA polymerases make errors, and most do not have proofreading capabilities. Nidovirus genomes encode a proofreading exoribonuclease which improves replication fidelity and presumably allows for larger genomes. Even with a proofreading enzyme, the biggest RNA virus genome is much smaller than the minimal cellular DNA genome, which is 200 kb. The results of two new studies show that we can find larger virus RNAs, suggesting that we have not yet reached the size limit of RNA genomes.
A close study of the transcriptome of a planarian revealed a new nidovirus, planarian secretory cell nidovirus, with an RNA genome of 41,103 nucleotides. This viral genome is unusual because it encodes a single, long open reading frame of 13,556 amino acids— the longest viral open reading frame (ORF) discovered so far. All the other known nidoviruses encode multiple open reading frames. Phylogenetic analysis of known nidoviruses suggests that the planarian virus arose from viruses with multiple ORFs, after which their single ORF expanded in size.
The other nidovirus with a large RNA genome was discovered by searching all the available RNA sequences of the mollusk Aplysia californica. With a simple nervous system of 20,000 neurons, this mollusk has been studied as a model system in many laboratories. Aplysia californica nido-like virus has an RNA genome of 35,906 nucleotides with ORFs that encode two polyproteins.
From the perspective of genome size, the discovery of these nidovirus genomes suggests that viruses with even larger RNAs remain to be discovered. In both cases the viruses were identified from sequences that had been deposited in public databases, although in both cases, infectious viruses were not reported. Nevertheless, many organisms have not yet had their genomes sequenced and it is likely that many RNA viruses remain to be discovered. Declaring an upper limit on RNA genome size does not seem reasonable if we have not sampled every species.
Saberi A, Gulyaeva AA, Brubacher JL, Newmark PA, Gorbalenya AE. 2018. A planarian nidovirus expands the limits of RNA genome size. PLoS Pathog 14:e1007314.
Debat HJ. 2018. Expanding the size limit of RNA viruses: