Given the pace of scientific discovery, many more protein sequences were available in 1992 than in 1978, providing for a more robust base set of data from which to derive these new matrices. However, the most important distinction between the BLOSUM and PAM matrices is that the BLOSUM matrices are directly calculated across varying evolutionary distances and are not extrapolated, providing a more accurate view of substitution patterns (and, in turn, evolutionary forces) at those various distances. The fact that the BLOSUM matrices are calculated directly based only on conserved regions makes these matrices more sensitive to detecting structural or functional substitutions; therefore, the BLOSUM matrices perform demonstrably better than the PAM matrices for local similarity searches (Henikoff and Henikoff 1993).
Returning to the point of directly deriving the various matrices, each BLOSUM matrix is assigned a number (BLOSUMn), and that number represents the conservation level of the sequences that were used to derive that particular matrix. For example, the BLOSUM62 matrix is calculated from sequences sharing no more than 62% identity; sequences with more than 62% identity are clustered and their contribution is weighted to 1. The clustering reduces the contribution of closely related sequences, meaning that there is less bias toward substitutions that occur (and may be over-represented) in the most closely related members of a family. Reducing the value of n yields more distantly related sequences.
Which Matrices Should be Used When?
Although most bioinformatic software will provide users with a default choice of a scoring matrix, the default may not necessarily be the most appropriate choice for the biological question being asked. Table 3.1 is intended to provide some guidance as to the proper selection of scoring matrix, based on studies that have examined the effectiveness of these matrices to detect known biological relationships (Altschul 1991; Henikoff and Henikoff 1993; Wheeler 2003). Note that the numbering schemes for the two matrix families move in opposite directions: more divergent sequences are found using higher numbered PAM matrices and lower numbered BLOSUM matrices. The following equivalencies are useful in relating PAM matrices to BLOSUM matrices (Wheeler 2003):
PAM250 is equivalent to BLOSUM45
PAM160 is equivalent to BLOSUM62
PAM120 is equivalent to BLOSUM80.
In addition to the protein matrices discussed here, there are numerous specialized matrices that are either specific to a particular species, concentrate on particular classes of proteins (e.g. transmembrane proteins), focus on structural substitutions, or use hydrophobicity measures in attempting to assess similarity (see Wheeler 2003). Given this landscape, the most important take-home message for the reader is that no single matrix is the complete answer for all sequence comparisons. A thorough understanding of what each matrix represents is critical to performing proper sequence-based analyses.
Table 3.1 Selecting an appropriate scoring matrix.
Matrix | Best use | Similarity |
PAM40 | Short alignments that are highly similar | 70–90% |
PAM160 | Detecting members of a protein family | 50–60% |
PAM250 | Longer alignments of more divergent sequences | ∼30% |
BLOSUM90 | Short alignments that are highly similar | 70–90% |
BLOSUM80 | Detecting members of a protein family | 50–60% |
BLOSUM62 | Most effective in finding all potential similarities | 30–40% |
BLOSUM30 | Longer alignments of more divergent sequences | <30% |
The Similarity column gives the range of similarities that the matrix is able to best detect (Wheeler 2003).
Nucleotide Scoring Matrices
At the nucleotide level, the scoring landscape is much simpler. More often than not, the matrices used here simply count matches and mismatches. These matrices also assume that each of the possible four nucleotide bases occurs with equal frequency (25% of the time). In some cases, ambiguities or chemical similarities between the bases are also considered; this type of matrix is shown in Figure 3.2. The basic differences in the construction of nucleotide and protein scoring matrices should make obvious the fact that protein-based searches are always more powerful than nucleotide-based searches of coding DNA sequences in determining similarity and inferring homology, given the inherently higher information content of the 20-letter amino acid alphabet versus the four-letter nucleotide alphabet.
Gaps and Gap Penalties
Often times, gaps are introduced to improve the alignment between two nucleotide or protein sequences. These gaps compensate for insertions and deletions between the sequences being studied so, in essence, these gaps represent biological events. As such, the number of gaps introduced into a pairwise sequence alignment needs to be kept to a reasonable number so as to not yield a biologically implausible scenario.
The scoring of gaps in pairwise sequence alignments is different from scoring approaches discussed to this point, as no comparison between characters is possible – one sequence has a residue at some position and the other sequence has nothing. The most widely used method for scoring gaps involves a quantity known as the affine gap penalty. Here, a fixed deduction is made for introducing the gap; an additional deduction is made that is proportional to the length of the gap. The formula for the affine gap penalty is G + Ln, where G is the gap-opening penalty (the cost of creating the gap), L is the gap-extension penalty, and n is the length of the gap, with G > L. This last condition is important: given that the gap-opening penalty is larger than the gap-extension penalty, lengthening existing gaps would be favored over creating new ones. The values of G and L can be adjusted manually in most programs to make the insertion of gaps either more or less permissive, but most methods automatically adjust both G and L to the most appropriate values for the scoring matrix being used.
Figure 3.2 A nucleotide scoring table. The scoring for the four nucleotide bases is shown in the upper left of the figure, with the remaining one-letter codes specifying the IUPAC/UBMB codes