2.6. Effect on the peptide backbone
In the calculation methods of ΔΔG mentioned above, the main chain is fixed; only the side chains are allowed to move. The range of motion of the side chain is extracted from the rotamer libraries (Dunbrack 2002), and their degrees of freedom strongly depend on the conformation of the main chain. Taking the deformation of the main chain into account should improve our understanding of the consequences of substitution. Nevertheless, it is legitimate to ask whether the backbone is really disrupted by a simple substitution, at least in the one we can observe, and whether this disruption is measurable with current techniques. Nowadays, the resolution of the structure of proteins determined by crystallography and X-ray diffraction is of the order 1 Å, which means that the precision of the structures obtained is of the order 1/10 Å (Luzzati 1952). The displacement of the backbone atoms caused by substitution is typically also 1/10 Å, which puts us at the limit of resolution.
In addition, protein structures of identical sequences can vary greatly due to different protein–protein interactions, interactions with different ligands or solvents (Kosloff and Kolodny 2008). However, provided that a sufficient number of structures are available, the effect of the mutation can be distinguished from “noise”, in other words, from statistical fluctuations due to other sources, such as exposure to solvents or belonging to a secondary structure (Shanthirabalan et al. 2018). This requires measuring the variations between the native and mutated proteins, and considering both the global variability and local flexibility of the structure.
The objective is therefore to measure whether a given structure (generally the mutant structure) is different at certain places (locally) – for example, at the place of mutation – from another so-called “reference” structure (generally the native structure). These two proteins have extremely close sequences (for a punctual mutation), so it is easy to find the alignment (correspondence) between the amino acids of the two proteins. Then, the two structures (in green and red, Figure 2.6(a)) are superimposed to minimize the distances between the corresponding pairs of Cα. We will then measure the distances between the Cαs of the two structures for small fragments (in blue, Figure 2.6(a)) of three consecutive Cαs by calculating the RMSD (Root-Mean-Square Deviation). A length of three residuals was chosen because it is the length for which the measured local effect is the strongest (Shanthirabalan et al. 2018). This calculation is performed for all residues and we obtain a “profile” of the RMSDs (Figure 2.6(b), graph) for a given protein pair.
In order to determine whether RMSDs tend to be larger in terms of mutations, it is necessary to be able to compare several structures that only differ from each other by a single point mutation. In PDB, there are 11 families with at least 20 mutants with a single substitution from a reference structure, for a total of 580 mutants and 11 reference structures (Shanthirabalan et al. 2018). When comparing the profiles between proteins of the same family, it can be observed that the regions with the largest RMSDs are often the same, regardless of the position of the mutation. Overall, some proteins “move” very little, while others “move” a lot (Figure 2.7). It is therefore difficult to know whether a large RMSD is due to the mutation, to the intrinsic flexibility of the protein or to an overall variability resulting from the various crystallization conditions. Thus, if a comparison is made between the distribution of RMSDs centered on mutated residues and the distribution of RMSDs for all positions of the structures of the 11 families, one will observe that the two distributions significantly overlap (Figure 2.8(a)). To better differentiate the effect due to mutations, the two other sources of variability must be neutralized.
Figure 2.6. Procedure for calculating “local” RMSDs. For a color version of this figure, see www.iste.co.uk/grandcolas/systematics.zip
COMMENT ON FIGURE 2.6. – a) Superimposition of two lysozyme structures. b) Their difference is measured by the RMSD calculated for the fragments of three successive residues, such as the fragment in blue. The RMSD is the root of the sum of the N distances D between the alpha carbon pairs ai and b’i (i.e. bi after superimposition) divided by the N number of Cα pairs (three in our case). This calculation is performed on the whole protein. There are as many RMSDs as residues in the protein (except for the two Cαs at the extremities); a profile is obtained (graph below). The mutation is localized at the location of the cross (protein 2hef chain A, mutation I89A).
Figure 2.7. RMSD calculated for 78 mutants of a transferase, the reference being that of Pyrococcus horikoshii (PDB code 2dek chain A). For a color version of this figure, see www.iste.co.uk/grandcolas/systematics.zip
Figure 2.8. Distribution of RMSD, p-values and p-ranks. For a color version of this figure, see www.iste.co.uk/grandcolas/systematics.zip
COMMENT ON FIGURE 2.8. – a) Distribution of RMSDs for all positions (crosshatched) or mutated residues alone (blue) for 580 mutations distributed in 11 families. b) Distribution of empirical p-values (crosshatched) and p-ranks (red) of mutated residues. The black line represents the uniform distribution of p-values or p-rank, in other words, the distribution expected if 580 RMSDs are randomly drawn from the crosshatched distribution of (a).
In order to take into account the global variability due to variations in experimental conditions, the gross RMSD should not be used, but a transformation of it. Considering ranks instead of values is a robust transformation used in many statistical tests. The RMSDs in each profile are first ranked in ascending order, and then the ranks are divided by the number of RMSDs in the profile (in other words, the length of the chain). The result is dimensionless, stacked values that allow the characterization of each protein in the family. If the mutations had no particular effect on the RMSD, the distribution of these p-values should be uniform, which is not what is observed (Figure 2.8(b)). This first transformation allows the experimental variability, but not the intrinsic flexibility of the molecule, to be taken into account. Indeed, in very flexible regions, seeing as the RMSD is large, the first ranking, and thus the empirical p-value, will also always be large. It is therefore necessary to make a second classification, that of the empirical p-value, for each position in each family. The new empirical p-value is then called the p-rank in order to differentiate it from the first one.
The place where the mutation takes place is the one most likely to be disturbed, at least in intensity (with a high RMSD value). Among all the calculated RMSDs, there are 580 positions corresponding to a mutation. Also, the p-rank distribution of RMSDs centered on mutated residues is not uniform (Figure 2.8(b)). Among the top 5% of the largest RMSDs, 12% are mutation-centered; among the top 5% of empirical p-ranks, 15% are mutation-centered;