Entrez, to be clear, is not a database itself. Rather, it is the interface through which its component databases can be accessed and traversed – an integrated information retrieval system. The Entrez information space includes PubMed records, nucleotide and protein sequence data, information on conserved protein domains, three-dimensional structure information, and genomic variation data with potential clinical relevance, a good number of which will be touched upon in this chapter. The strength of Entrez lies in the fact that all of this information, across a large number of component databases, can be accessed by issuing one – and only one – query. This very powerful, integrated approach is made possible through the use of two general types of connections between database entries: neighboring and hard links.
Relationships Between Database Entries: Neighboring
The concept of neighboring enables entries within a given database to be connected to one another. If a user is looking at a particular PubMed entry, the user can then “ask” Entrez to find all of the other papers in PubMed that are similar in subject matter to the original paper. Likewise, if a user is looking at a sequence entry, Entrez can return a list of all other sequences that bear similarity to the original sequence. The establishment of neighboring relationships within a database is based on statistical measures of similarity, some of which are described in more detail below. While the term “neighboring” has traditionally been used to describe these connections, the terminology on the NCBI web site denotes neighbors as “related data.”
BLAST Biological sequence similarities are detected and sequence data are compared with one another using the Basic Local Alignment Search Tool, or BLAST (Altschul et al. 1990). This algorithm attempts to find high-scoring segment pairs – pairs of sequences that can be aligned with one another and, when aligned, meet certain scoring and statistical criteria. Chapter 3 discusses the family of BLAST algorithms and their application at length.
VAST Molecular structure similarities are detected and sets of coordinate data are compared using a vector-based method known as VAST (the Vector Alignment Search Tool; Gibrat et al. 1996). This methodology uses geometric criteria to assess similarity between three-dimensional domains, and there are three major steps that take place in the course of a VAST comparison:
First, based on known three-dimensional coordinate data, the alpha helices and beta strands that constitute the structural core of each protein are identified. Straight-line vectors are then calculated based on the position of these secondary structural elements. VAST keeps track of how one vector is connected to the next (that is, how the C-terminal end of one vector connects to the N-terminal end of the next vector), as well as whether each vector represents an alpha helix or a beta strand. Subsequent comparison steps use only these vectors in assessing structural similarity to other proteins – so, in effect, most of the painstakingly deduced atomic coordinate data are discarded at this step. The reason for this apparent oversimplification is simply due to the scale of the problem at hand; with the 150 000 structures in the Molecular Modeling Database (MMDB; Madej et al. 2014) available at the time of this writing, the time that it would take to do an in-depth comparison of each and every one of these structures with all of the other structures in MMDB would make the calculations both impractical and intractable.
Next, the algorithm attempts to optimally align these sets of vectors, looking for pairs of structural elements that are of the same type and relative orientation, with consistent connectivity between the individual elements. The object is to identify highly similar “core substructures,” pairs that represent a statistically significant match above that which would be obtained by comparing randomly chosen proteins with one another.
Finally, a refinement is done using Monte Carlo (random search) methods at each residue position to optimize the structural alignment. The resultant alignment need not be global, as matches may be between individual structural domains of the proteins being compared.
In 2014, a significant improvement to VAST was introduced. This new approach, called VAST+ (Madej et al. 2014), moves beyond assessing structural similarity by comparing individual three-dimensional domains with one another; instead, it considers the entire set of three-dimensional domains within a macromolecular complex. This approach essentially moves the comparison from the tertiary structure to the quaternary structure level, enabling the identification of similar functional, multi-subunit assemblies. In the VAST+ parlance, macromolecular complexes are referred to as a “biological unit” and can include not just the proteins that constitute the complex, but also nucleotides and chemicals where such structural information is available. The VAST+ comparison begins as described above for VAST and then marches through a number of steps that involve the identification of biological units that can be superimposed, calculation of root-mean-square deviations (RMSDs) of the superimposed structures as a quantitative measure of the superposition (see Box 12.1), and, finally, performs a refinement step to improve the RMSD values for the superposition. The result of this process is a global structural alignment where both the most and least similar parts of the aligned molecules can be identified and, from a biological standpoint, comparisons between similarly shaped proteins can be facilitated; it can also be used in the context of looking at conformational changes of a single complex under varying conditions. While VAST+ is now the default method for identifying structural neighbors within the Entrez system, keep in mind that the algorithm depends on biological units being explicitly identified within the source Protein Data Bank (PDB) coordinate data records that form the basis for MMDB records; if no such biological units are defined, the original VAST algorithm is then used for the comparisons.
By using approaches such as VAST and VAST+, it is possible to find structural relationships between proteins in cases where simply looking at sequence similarity may not suggest relatedness – information that could, with additional data and insights, be used to help inform the question of functional similarity. More information on additional structure prediction methods based on X-ray or nuclear magnetic resonance (NMR) coordinate data can be found in Chapter 12.
Weighted Key Terms The problem of comparing sequence or structure data somewhat pales next to that of comparing PubMed entries, which consist of free text whose rules of syntax are not necessarily fixed. Given that no two people's writing styles are exactly the same, finding a way to compare seemingly disparate blocks of text poses a substantial problem. Entrez employs a method known as the relevance pairs model of retrieval to make such comparisons, relying on weighted key terms (Wilbur and Coffee 1994; Wilbur and Yang 1996). This concept is best described by example. Consider two manuscripts with the following titles:
BRCA1 as a Genetic Marker for Breast Cancer
Genetic Factors in the Familial Transmission of the Breast Cancer BRCA1 Gene
Both titles contain the terms BRCA1, Breast, and Cancer, and the presence of these common terms may indicate that the manuscripts are similar in subject matter. The proximity between the words is also considered, so that words common to two records that are closer together are scored higher than common words that are further apart. In the example, the terms Breast and Cancer are always next to each other, so they would score higher based on proximity than either of those words would against BRCA1. Common words found in a title score higher than those found in an abstract, since title words are presumed to be “more important” than those found in the body of an abstract. Overall, weighting depends inversely on the frequency of a given word among all the entries in PubMed, with words that occur infrequently in the database assigned a higher weight while common words are down-weighted.
Hard Links
The hard link concept is simpler