Andreas D. Baxevanis
Introduction
One of the cornerstones of bioinformatics is the process of comparing nucleotide or protein sequences in order to deduce how the sequences are related to one another. Through this type of comparative analysis, one can draw inferences regarding whether two proteins have similar function, contain similar structural motifs, or have a discernible evolutionary relationship. This chapter focuses on pairwise alignments, where two sequences are directly compared, position by position, to deduce these relationships. Another approach, multiple sequence alignment, is used to identify important features common to three or more sequences; this approach, which is often used to predict secondary structure and functional motifs and to identify conserved positions and residues important to both structure and function, is discussed in Chapter 8.
Before entering into any discussion of how relatedness between nucleotide or protein sequences is assessed, two important terms need to be defined: similarity and homology. These terms tend to be used interchangeably when, in fact, they mean quite different things and imply quite different biological relationships.
Similarity is a quantitative measure of how related two sequences are to one another. Similarity is always based on an observable – usually pairwise alignment of two sequences. When two sequences are aligned, one can simply count how many residues line up with one another, and this raw count can then be converted to the most commonly used measure of similarity: percent identity. Measures of similarity are used to quantify changes that occur as two sequences diverge over evolutionary time, considering the effect of substitutions, insertions, or deletions. They can also be used to identify residues that are crucial for maintaining a protein's structure or function. In short, a high percentage of sequence similarity may imply a common evolutionary history or a possible commonality in biological function.
In contrast, homology implies an evolutionary relationship and is the putative conclusion reached based on examining the optimal alignment between two sequences and assessing their similarity. Genes (and their protein products) either are or are not homologous – homology is not measured in degrees or percentages. The concept of homology and the term homolog may apply to two different types of relationships, as follows.
If genes are separated by the event of speciation, they are termed orthologous. Orthologs are direct descendants of a sequence in a common ancestor, and they may have similar domain structure, three-dimensional structure, and biological function. Put simply, orthologs can be thought of as the same gene (or protein) in different species.
If genes within the same species are separated by a genetic duplication event, they are termed paralogous. The examination of paralogs provides insight into how pre-existing genes may have been adapted or co-opted toward providing a new or modified function within a given species.
The concepts of homology, orthology, and paralogy and methods for determining the evolutionary relationships between sequences are covered in much greater detail in Chapter 9.
Global Versus Local Sequence Alignments
The methods used to assess similarity (and, in turn, infer homology) can be grouped into two types: global sequence alignment and local sequence alignment. Global sequence alignment methods take two sequences and try to come up with the best alignment of the two sequences across their entire length. In general, global sequence alignment methods are most applicable to highly similar sequences of approximately the same length. Although these methods can be applied to any two sequences, as the degree of sequence similarity declines, they will tend to miss important biological relationships between sequences that may not be apparent when considering the sequences in their entirety.
Most biologists instead depend on the second class of alignment algorithm – local sequence alignments. In these methods, the sequence comparison is intended to find the most similar regions within the two sequences being aligned, rather than finding (or forcing) an alignment over the entire length of the two sequences being compared. As such, and by focusing on subsequences of high similarity that are more easily alignable, determining putative biological relationships between the two sequences being compared becomes a much easier proposition. This makes local alignment methods one of the approaches of choice for biological discovery. Often times, these methods will return more than one result for the two sequences being compared, as there may be more than one domain or subsequence common to the sequences being analyzed. Local sequence alignment methods are best for sequences that share some degree of similarity or for sequences of different lengths, and the ensuing discussion will focus mostly on these methods.
Scoring Matrices
Whether one uses a global or local alignment method, once the two sequences under consideration are aligned, how does one actually measure how good the alignment is between “sequence A” and “sequence B”? The first step toward answering that question involves numerical methods that consider not just the position-by-position overlap between two sequences but also the nature and characteristics of the residues or nucleotides being aligned.
Much effort has been devoted to the development of constructs called scoring matrices. These matrices are empirical weighting schemes that appear in all analyses involving the comparison of two or more sequences, so it is important to understand how these matrices are constructed and how to choose between matrices. The choice of matrix can (and does) strongly influence the results obtained with most sequence comparison methods.
The most commonly used protein scoring matrices consider the following three major biological factors.
1 Conservation. The matrices need to consider absolute conservation between protein sequences and also need to provide a way to assess conservative amino acid substitutions. The numbers within the scoring matrix provide a way of representing what amino acid residues are capable of substituting for other residues while not adversely affecting the function of the native protein. From a physicochemical standpoint, characteristics such as residue charge, size, and hydrophobicity (among others) need to be similar.Figure 3.1 The BLOSUM62 scoring matrix (Henikoff and Henikoff 1992). BLOSUM62 is the most widely used scoring matrix for protein analysis and provides best coverage for general-use cases. Standard single-letter codes to the left of each row and at the top of each column specify each of the 20 amino acids. The ambiguity codes B (for asparagine or aspartic acid; Asx) and Z (for glutamine or glutamic acid; Glx) also appear, as well as an X (denoting any amino acid). Note that the matrix is a mirror image of itself with respect to the diagonal. See text for details.
2 Frequency. In the same way that amino acid residues cannot freely substitute for one another, the matrices also need to reflect how often particular residues occur among the entire constellation of proteins. Residues that are rare are given more weight than residues that are more common.
3 Evolution. By design, scoring matrices implicitly represent evolutionary patterns, and matrices can be adjusted to favor the detection of closely related or more distantly related proteins. The choice of matrices for different evolutionary distances is discussed below.
There are also subtle nuances that go into constructing a scoring matrix, and these are described in an excellent review by Henikoff and Henikoff (2000).
How these various factors are actually represented within a scoring matrix can be best demonstrated by deconstructing the most commonly used scoring matrix, called BLOSUM62 (Figure 3.1). Each of the 20 amino acids (as well as the standard ambiguity codes) is shown along the top and down the side of a matrix. The scores in the matrix actually represent the logarithm of an odds ratio (Box