8 Green, E.D., Rubin, E.M., and Olson, M.V. (2017). The future of DNA sequencing. Nature. 550: 179–181.
9 Karsch-Mizrachi, I., Tagaki, T., and Cochrane, G., on behalf of the International Nucleotide Sequence Database Collaboration (2018). The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 46: D48–D51.
10 Kim, H.J., Kim, N.C., Wang, Y.D. et al. (2013). Mutations in prion-like domains in hnRNPA2B1 and hnRNPA1 cause multisystem proteinopathy and ALS. Nature. 495: 467–473.
11 Kodama, Y., Mashima, J., Kosuge, T. et al. (2018). DNA Data Bank of Japan: 30th anniversary. Nucleic Acids Res. 46: D30–D35.
12 Landrum, M.J., Lee, J.M., Benson, M. et al. (2016). ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44: D862–D868.
13 Lee, R.Y.N., Howe, K.L., Harris, T.W. et al. (2018). WormBase 2017: molting into a new stage. Nucleic Acids Res. 46: D869–D874.
14 Lipman, D.J. and Pearson, W.R. (1985). Rapid and sensitive protein similarity searches. Science. 227: 1435–1441.
15 Liu, Q., Shu, S., Wang, R.R. et al. (2016). Whole-exome sequencing identifies a missense mutation in hnRNPA1in a family with flail arm ALS. Neurology. 87: 1763–1769.
16 Rigden, D.J. and Fernández, X.M. (2018). The 2018 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res. 46: D1–D7.
17 Silvester, N., Alako, B., Amid, C. et al. (2018). The European Nucleotide Archive in 2017. Nucleic Acids Res. 46: D36–D40.
18 Smith, C.L., Blake, J.A., Kadin, J.A. et al., and The Mouse Genome Database Group (2018). Mouse Genome Database (MGD)-2018: knowledgebase for the laboratory mouse. Nucleic Acids Res. 46: D836–D842.
19 Suzek, B.E., Wang, Y., Huang, H. et al., and The UniProt Consortium (2015). UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 31: 926–932.
20 UniProt Consortium (2017). UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45: D158–D169.
This chapter was written by Dr. Andreas D. Baxevanis in his private capacity. No official support or endorsement by the National Institutes of Health or the United States Department of Health and Human Services is intended or should be inferred.
2 Information Retrieval from Biological Databases
Andreas D. Baxevanis
Introduction
On April 14, 2003, the biological community celebrated the achievement of the Human Genome Project's major goal: the complete, accurate, and high-quality sequencing of the human genome (International Human Genome Sequencing Consortium 2001; Schmutz et al. 2004). The attainment of this goal, which many have compared to landing a person on the moon, has had a profound effect on how biological and biomedical research is conducted and will undoubtedly continue to have a profound effect on its direction in the future. The availability of not just human genome data, but also human sequence variation data, model organism sequence data, and information on gene structure and function provides fertile ground for biologists to better design and interpret their experiments in the laboratory, fulfilling the promise of bioinformatics in advancing and accelerating biological discovery.
One of the most important databases available to biologists is GenBank, the annotated collection of all publicly available DNA and protein sequences (Benson et al. 2017; see Chapter 1). This database, maintained by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH), represents a collaborative effort between NCBI, the European Molecular Biology Laboratory (EMBL), and the DNA Data Bank of Japan (DDBJ). At the time of this writing, GenBank contained over 200 million sequences and over 300 trillion nucleotide bases. The completion of human genome sequencing and the sequencing of an ever-expanding number of model organism genomes, as well as the existence of a gargantuan number of sequences in general, provides a golden opportunity for biological scientists, owing to the inherent value of these data. However, at the same time, the sheer magnitude of data presents a conundrum to the inexperienced user, resulting not just from the size of the “sequence information space” but from the fact that the information space continues to get larger and larger – by leaps and bounds – at a pace that will continue to accelerate, even though human genome sequencing has long been “completed.”
The effect of the Human Genome Project and other systematic sequencing projects on the continued accumulation of sequence data is illustrated by the growth of GenBank, as shown in Figure 2.1; the exponential growth rate illustrated in the figure is expected to continue for some time to come. The continued expansion of not just the sequence space but of the myriad biological data now available because of the expansion of the sequence space underscores the necessity for all biologists to learn how to effectively navigate this information for effective use in their work – even allowing investigators to avoid performing expensive experiments themselves based on the data found within these virtual treasure troves.
GenBank (or any other biological database, for that matter) serves little purpose unless the data can be easily searched and entries retrievable in a useful, meaningful format. Otherwise, sequencing efforts such as those described above have no useful end – without effective search and retrieval tools, the biological community as a whole cannot make use of the information hidden within these millions of bases and amino acids, much less the structures they form or the mutations they harbor. Much effort has gone into making such data accessible to the biologist, and a selection of the programs and interfaces resulting from these efforts are the focus of this chapter. The discussion will center on querying databases maintained by NCBI, as these more “general” repositories are far and away the ones most often accessed by biologists, but attention will also be given to specialized databases that provide information not necessarily found through the use of Entrez, NCBI's integrated information retrieval system.
Figure 2.1 The exponential growth of GenBank in terms of number of nucleotides (squares, in millions) and number of sequences submitted (circles, in thousands). Source data for the figure have been obtained from the National Center for Biotechnology Information (NCBI) web site. Note that the period of accelerated growth after 1997 coincides with the completion of the Human Genome Project's genetic and physical mapping goals, setting the stage for high-accuracy, high-throughput sequencing, as well as the development of new sequencing technologies (Collins et al. 1998, 2003; Green et al. 2011).
Integrated Information Retrieval: The Entrez System
One of the most widely used interfaces for the retrieval of information from biological databases is the NCBI Entrez system. Entrez capitalizes on the fact that there are pre-existing, logical relationships between the individual entries found in numerous public databases. For example, a paper in PubMed may describe the sequencing of a gene whose sequence appears in GenBank. The nucleotide sequence, in turn, may code for a protein product whose sequence is stored in NCBI's Protein database. The three-dimensional structure of that protein may be known, and the coordinates for that structure may appear in NCBI's Structure database. Finally, there may be allelic or structural variants documented for the gene of interest, cataloged in databases such as the Single Nucleotide Polymorphism Database (called dbSNP) or the Database of Genomic Structural Variation (called dbVAR), respectively. The existence of such natural connections,