For the Normal distribution the normalization is easiest to get via complex integration (so we'll skip that). With mean zero and variance equal one (Figure 2.4) we get:
2.6.2.3 Significant Distributions That Are Not Gaussian or Geometric
Nongeometric duration distributions occur in many familiar areas, such as the length of spoken words in phone conversation, as well as other areas in voice recognition. Although the Gaussian distribution occurs in many scientific fields (an observed embodiment of the LLN, among other things), there are a huge number of significant (observed) skewed distributions, such as heavy‐tailed (or long‐tailed) distributions, multimodal distributions, etc.
Heavy‐tailed distributions are widespread in describing phenomena across the sciences. The log‐normal and Pareto distributions are heavy‐tailed distributions that are almost as common as the normal and geometric distributions in descriptions of physical phenomena or man‐made phenomena. Pareto distribution was originally used to describe the allocation of wealth of the society, known as the famous 80–20 rule, namely, about 80% of the wealth was owned by a small amount of people, while “the tail,” the large part of people only have the rest 20% wealth. Pareto distribution has been extended to many other areas. For example, internet file‐size traffic is a long‐tailed distribution, that is, there are a few large sized files and many small sized files to be transferred. This distribution assumption is an important factor that must be considered to design a robust and reliable network and Pareto distribution could be a suitable choice to model such traffic. (Internet applications have many other heavy‐tailed distribution phenomena.) Pareto distributions can also be found in a lot of other fields, such as economics.
Figure 2.4 The Gaussian distribution, aka Normal, shown with mean zero and variance equal to one: Nx(μ,σ2) = Nx(0,1).
Log‐normal distributions are used in geology and mining, medicine, environment, atmospheric science, and so on, where skewed distribution occurrences are very common. In Geology, the concentration of elements and their radioactivity in the Earth's crust are often shown to be log‐normal distributed. The infection latent period, the time from being infected to disease symptoms occurs, is often modeled as a log‐normal distribution. In the environment, the distribution of particles, chemicals, and organisms is often log‐normal distributed. Many atmospheric physical and chemical properties obey the log‐normal distribution. The density of bacteria population often follows the log‐normal distribution law. In linguistics, the number of letters per words and the number of words per sentence fit the log‐normal distribution. The length distribution for introns, in particular, has very strong support in an extended heavy‐tail region, likewise for the length distribution on exons or open reading frames (ORFs) in genomic deoxyribonucleic acid (DNA). The anomalously long‐tailed aspect of the ORF‐length distribution is the key distinguishing feature of this distribution, and has been the key attribute used by biologists using ORF finders to identify likely protein‐coding regions in genomic DNA since the early days of (manual) gene structure identification.
2.6.3 Series
A series is a mathematical object consisting of a series of numbers, variables, or observation values. When observations describe equilibrium or “steady state,” emergent phenomenon familiar from physical reality, we often see series phenomena that are martingale. The martingale sequence property can be seen in systems reaching equilibrium in both the physical setting and algorithmic learning setting.
A discrete‐time martingale is a stochastic process where a sequence of r.v. {X1, …, Xn} has conditional expected value of the next observation equal to the last observation: E(Xn+1 | X1, … Xn) = Xn, where E(|Xn|) < ∞. Similarly, one sequence, say {Y1,…, Yn}, is said to be martingale with respect to another, say {X1,…, Xn}, if for all n: E(Yn+1 | X1, … Xn) = Yn, where E(|Yn|) < ∞. Examples of martingales are rife in gambling. For our purposes, the most critical example is the likelihood‐ratio testing in statistics, with test‐statistic, the “likelihood ratio” given as: Yn = Πn i=1 g (Xi)/ f (Xi), where the population densities considered for the data are f and g . If the better (actual) distribution is f , then Yn is martingale with respect to Xn. This scenario arises throughout the hidden Markov models (HMM) Viterbi derivation if local “sensors” are used, such as with profile‐HMM's or position‐dependent Markov models in the vicinity of transition between states. This scenario also arises in the HMM Viterbi recognition of regions (versus transition out of those regions), where length‐martingale side information will be explicitly shown in Chapter 7, providing a pathway for incorporation of any martingale‐series side information (this fits naturally with the clique‐HMM generalizations described in Chapter 7). Given that the core ratio of cumulant probabilities that is employed is itself a martingale, this then provides a means for incorporation of side‐information in general (further details in Appendix C).
2.7 Exercises
1 2.1 Evaluate the Shannon Entropy, by hand, for the fair die probability distribution: (1/6,1/6,1/6,1/6,1/6,1/6), for the probability of rolling a 1 thru a 6 (all are the same, 1/6, for uniform prob. Dist). Also evaluate for loaded die: (1/10,1/10,1/10,1/10,1/10,1/2).
2 2.2 Evaluate the Shannon Entropy for the fair and loaded probability distribution in Exercise 2.1 computationally, by running the program described in Section 2.1.
3 2.3 Now consider you have two dice, where each separately rolls “fair,” but together they do not roll “fair,” i.e. each specific pair of die rolls does not have probability 1/36, but instead has probability:Die 1 rollDie 2 rollProbability11(1/6)*(0.001)12(1/6)*(0.125)13(1/6)*(0.125)14(1/6)*(0.125)15(1/6)*(0.124)16(1/6)*(0.5)2Any(1/6)*(1/6)3Any(1/6)*(1/6)4Any(1/6)*(1/6)5Any(1/6)*(1/6)61(1/6)*(0.5)62(1/6)*(0.125)63(1/6)*(0.125)64(1/6)*(0.125)65(1/6)*(0.124)66(1/6)*(0.001)What is Shannon Entropy for the Die 1 outcomes? (call H(1)) What is the Shannon entropy of the Die 2 outcomes (refer to as H(2))? What is the Shannon entropy on the two‐dice outcomes with probabilities shown in the table above (denote (H(1,2))?Compute the function MI(Die 1,Die 2) = H(1) + H(2) − H(1,2). Is it positive?
4 2.4 Go to genbank (https://www.ncbi.nlm.nih.gov/genbank/) and select the genome of a small virus (~10 kb). Using the Python code shown in Section 2.1, determine the base frequencies for {a,c,g,t}. What is the shannon entropy (if those frequencies are taken to be the probabilities on the associated outcomes)?
5 2.5 Go to genbank (https://www.ncbi.nlm.nih.gov/genbank/) and select the genome of three medium‐sized viruses (~100 kb). Using the Python code shown in Section 2.1, determine the trinucleotide frequencies.