THE GENETIC CODE
Amino Acids and Proteins
Proteins are made up of individual amino acid building blocks. Amino acids contain both a carboxyl group, which readily gives an H+ to water and is therefore acidic, and a basic amino group (‐NH2), which readily accepts H+ to become ‐NH3 +. Figure 3.7a shows two amino acids, leucine and γ‐aminobutyric acid( GABA), in the form in which they are found at normal pH: the carboxyl groups have each lost an H+ and the amino groups have each gained one, so that the molecules bear both a negative and a positive charge.
We name organic acids by labeling the carbon adjacent to the carboxyl group α, the next one β, and so on. When we add an amino group, making an amino acid, we state the letter of the carbon to which the amino group is attached. Hence leucine is an α‐amino acid while GABA stands for gamma‐aminobutyric acid. α‐Amino acids are the building blocks of proteins. They have the general structure shown in Figure 3.7b where R is the side chain. Leucine has a simple side chain of carbon and hydrogen. Other amino acids have different side chains and so have different properties. It is the diversity of amino side chains that give proteins their characteristic properties (page 104).
α‐Amino acids can link together to form long chains through the formation of a peptide bond between the carboxyl group of one amino acid and the amino group of the next. Figure 3.7c shows the generalized structure of such a chain of α‐amino acids. If there are fewer than about 50 amino acids in a polymer we tend to call it a peptide. More and it is a polypeptide. Polypeptides that fold into a specific shape are proteins.
Reading the Genetic Code
It is the sequence of bases along the DNA strand that determines the sequence of the amino acids in proteins. There are four different bases in DNA (G, A, T, and C). Each amino acid is specified by a codon, a group of three bases. Because there are four bases in DNA, a three‐letter code gives 64 (4 × 4 × 4) possible codons. These 64 codons form the genetic code – the set of instructions that tells a cell the order in which amino acids are to be joined together to form a protein (Figure 3.8). Despite the fact that the linear sequence of codons in DNA determines the linear sequence of amino acids in proteins, the DNA helix does not itself play a role in protein synthesis. The translation of the sequence from codons into amino acids occurs through the intervention of members of a third class of molecule – mRNA. Messenger RNA acts as a template, guiding the assembly of amino acids into a polypeptide chain. Messenger RNA uses the same code as the one used in DNA with one difference: in mRNA the base uracil (U) is used in place of thymine (T). When we write the genetic code we usually use the RNA format, that is, we use U instead of T.
The code is read in sequential groups of three, codon by codon. Adjacent codons do not overlap and each triplet of bases specifies one particular amino acid. This discovery was made by Sydney Brenner, Francis Crick, and their colleagues by studying the effect of various mutations (changes in the DNA sequence) on the bacteriophage T4, which infects the common bacterium E. coli . If a mutation caused either one or two nucleotides to be added or deleted from one end of the T4 DNA, then a defective polypeptide was produced, with a completely different sequence of amino acids. However, if three bases were added or deleted, then the protein made often retained its normal function. These proteins were found to be identical to the original protein, except for the addition or loss of one amino acid.
The identification of the triplets encoding each amino acid began in 1961. This was made possible by using a cell‐free protein synthesis system prepared by breaking open E. coli cells. Synthetic RNA polymers, of known sequence, were added to the cell‐free system together with the 20 amino acids. When the RNA template contained only uridine residues (poly‐U) the polypeptide produced contained only phenylalanine – therefore codon UUU must specify phenylalanine. A poly(A) template produced a polypeptide of lysine and poly‐C one of proline: AAA and CCC must therefore specify lysine and proline, respectively. Synthetic RNA polymers containing all possible combinations of the bases G, A, U, and C, were added to the cell‐free system to determine the codons for the other amino acids. A template made of the repeating unit CU gave a polypeptide with the alternating sequence leucine–serine. Because the first amino acid in the chain was found to be leucine, CUC must code for leucine and UCU must code for serine. Although much of the genetic code was read in this way, the amino acids defined by some codons were particularly hard to determine. Only when specific transfer RNA molecules (page 85) were used was it possible to demonstrate that GUU codes for valine. The genetic code was finally solved by the combined efforts of several research teams. The leaders of two of these, Marshall Nirenberg and Har Gobind Khorana, received the Nobel prize in 1968 for their part in cracking the code.
Amino Acid Names Are Abbreviated
To save time we usually write an amino acid as either a three‐letter abbreviation, for example, glycine is written as Gly and leucine as Leu, or as a one‐letter code, for example, glycine is G and leucine is L. Figure 3.9 shows the full name and the three‐ and one‐letter abbreviations used for each of the 20 amino acids found in proteins.
The Code Is Degenerate but Unambiguous
To introduce the terms degenerate and ambiguous, consider the English language. English shows considerable degeneracy, meaning that the same concept can be indicated using a number of different words – think, for example, of lockup, cell, pen, pound, brig, and dungeon. English also shows ambiguity, so that it is only by context that one can tell whether cell means a lockup or a living aqueous droplet enclosed by a membrane. Like the English language the genetic code shows degeneracy but, unlike language, the genetic code is unambiguous.