Then, the perplexity PP is
where n is the length of the total sequences.
A simple uni-gram (context-independent amino acid) model was trained by the 90 percent proteins from Borrelia_burgdorferi. The perplexity of the other 10 percent proteins and proteins from the other 19 organisms were calculated. Table 2-2 provides detailed results. Different organisms have different perplexities, which indicates that the different “dialects” in proteins may be embodied in the organisms. Another important phenomenon is that the perplexity is independent of the size of the testing proteins. To validate this, another experiment was carried out. The proteins of A_thaliana were used to train the uni-gram model, and then the human proteins were split into 10 shares randomly so as to calculate the perplexity. The results obtained were: 18.2049 18.2091 18.153 18.1905 18.2698 18.2101 18.1556 18.1495 18.3173 18.1925. These perplexities change at a small scale. So, the perplexity is related to the uni-gram model and the organism used to test it and has no relation to the size of testing proteins.
Table 2-2 The perplexities of different organisms.
2.5Conclusions
In this chapter, the n-gram and linguistic features of whole genome protein sequences have been analyzed. The results show that (1) the n-grams of whole genome protein sequences approximately follow Zipf’s law when n is larger than 4, (2) the Shannon n-gram entropy of the natural genome proteins is lower than that of artificial proteins, (3) a simple uni-gram model can distinguish different organisms, (4) there are organism-specific usages of “phrases” in protein sequences. Further work will aim at detailed identification of these “phrases” and the building of a “biological language” which has special words, phrases and syntaxes to map out the relationship of protein sequence, structure and function.
References
[1]Anfinsen C.B. Principles that govern the folding of protein chains. Science, 1973, 181(4096): 223–230.
[2]Mantegna R.N., Buldyrev S.V., Goldberger A.L., Havlin S., Peng C.K., Simons M., Stanley H.E. Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics, 1995, 52(3): 2939–2950.
[3]Chatzidimitriou-Dreismann C.A., Streffer R.M., Larhammar D. Lack of biological significance in the ‘linguistic features’ of noncoding DNA — A quantitative analysis. Nucleic Acids Res, 1996, 24(9): 1676–1681.
[4]Tsonis A.A., Elsner J.B., Tsonis P.A. Is DNA a language? J Theor Biol, 1997, 184(1): 25–29.
[5]Voss R.F. Comment on “Linguistic features of noncoding DNA sequences”. Phys Rev Lett, 1996, 76(11): 1978.
[6]Burge C., Campbell A.M., Karlin S. Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci USA, 1992, 89(4): 1358–1362.
[7]Ganapathiraju M., Weisser D., Rosenfeld R., Carbonell J., Reddy R., Klein-Seetharaman J. Comparative n-gram analysis of whole-genome protein sequences. In Proceedings of the Human Language Technologies Conference, San Diego 2002, pp. 1367–1375.
[8]Rigoutsos I., Huynh T., Floratos A., Parida L., Platt D. Dictionary-driven protein annotation. Nucleic Acids Res, 2002, 30(17): 3901–3916.
[9]Ganapathiraju M., Klein-Seetharaman J., Balakrishnan N., Reddy R. Characterization of protein secondary structure — Application of latent semantic analysis using different vocabularies. IEEE Signal Processing Magazine, 2004, 21: 78–87.
[10]Coin L., Bateman A., Durbin R. Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc Natl Acad Sci USA, 2003, 100(8): 4516–4520.
[11]Charniak E. Statistical Language Learning. 1996. Cambridge, MA: MIT Press, p. 192.
[12]Manber U., Myers G. Suffix arrays: A new method for on-line string searches. SIAM J Comput, 1993, 22(5): 935–948.
[13]Kasai T., Lee G., Arimura H., Arikawa S., Park K. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching. 2001, Jerusalem, Israel: Springer-Verlag, pp. 181–192.
[14]Boeckmann B., Bairoch A., Apweiler R., Blatter M.C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O’Donovan C., Phan I., Pilbout S., Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, 2003, 31(1): 365–370.
[15]Kingsley Zipf G. Human Behavior and the Principle of Least Effort. SERBIULA (sistema Librum 2.0), 1948, II.
Chapter 3
Amino Acid Encoding for Protein Sequence
3.1Motivation and Basic Idea
The digital representation of amino acids is usually called feature extraction, the amino acid encoding scheme, the residue encoding scheme, etc. Here, we use amino acid encoding as the terminology of choice. It should be noted that amino acid encoding is different from protein sequence encoding. Protein sequence encoding represents the entire protein sequence by using an n-dimensional vector, such as the n-gram,1 pseudo amino acid composition,2,3 etc. Since the amino acid-specific information is lost, protein sequence encoding can be only used to predict sequence-level properties (i.e. protein fold recognition). Amino acid encoding represents each amino acid of a protein sequence by using different n-dimensional vectors; thus, its vector space for a protein sequence is n∗L (L denotes the length of the protein sequence). By combining with different machine learning methods, amino acid encoding can be used in protein property prediction both at the residue level and the sequence level (i.e. protein fold recognition, secondary structure prediction, etc). In the past decades, various amino acid encoding methods have been proposed from different perspectives.4–6 The most widely used encodings are one-hot encoding, position-specific scoring matrix (PSSM) encoding, and some physicochemical property encodings. In addition to those encodings, some other encodings have also been proposed, such as the encoding estimated from interresidue contact energies,7 the encoding learned from protein structure alignments8 and the encoding learned from sequence context.9 These encoding methods explore amino acid encoding from new perspectives, and can be the complement of the above encodings. Kawashima et al.10 have proposed a database of numerical indices of amino acids and amino acid pairs, and this contains information on the physicochemical and biochemical properties of amino acids.
3.2Related Work
3.2.1 Binary encoding
The binary encoding methods use multidimensional binary digits (0 and 1) to represent amino acids in protein sequences. The most commonly used binary encoding is one-hot encoding, which is also called orthogonal encoding.5 For one-hot encoding, each of the 20 amino acids is represented by a 20-dimensional binary vector. Specifically, the 20 standard amino acids are fixed in a specific order, and then the ith amino acid type is represented by 20 binary bits with the ith bit set to “1” and others to “0”. There is only one bit equal to “1” for each vector; hence, it is called “one-hot”. For example, the twenty standard amino acids are sorted as [A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y]; the one-hot code of “A” is 100000000000000000000,