In this chapter, a review is given of statistics and probability concepts, with implementation of many of the concepts in Python. Python scripts are then used to do a preliminary examination of the randomness of genomic (virus) sequence data. A short review of Linux OS setup (with Python automatically installed) and Python syntax is given in Appendix A.
Numerous prior book, journal, and patent publications by the author [1–68] are drawn upon extensively throughout the text. Almost all of the journal publications are open access. These publications can typically be found online at either the author's personal website (www.meta‐ or with one of the following online publishers: www.m‐ or
2.1 Python Shell Scripting
A “fair” die has equal probability of rolling a 1, 2, 3, 4, 5, or 6, i.e. a probability of 1/6 for each of the outcomes. Notice how the sum of all of the discrete probabilities for the different outcomes all add up to 1, this is always the case for probabilities describing a complete set of outcomes.
A “loaded” die has a non‐uniform distribution, for prob = 0.5 to roll a “6” and uniform on the other die rolls you have loaded die_roll_probability = (1/10,1/10,1/10,1/10,1/10,1/2).
The first program to be discussed is named and will introduce the notion of discrete probability distributions in the context of rolling the familiar six‐sided die. Comments in Python are the portion of a line to the right of any “#” symbol (except for the first line of code with “#!.....”, that is explained later).
The Shannon entropy of a discrete probability distribution is the measure of its amount of randomness, with the uniform probability distribution having the greatest randomness (e.g. it is most lacking in any statistical “structure” or “information”). Shannon entropy is the sum of each outcome probability times its log probability, with an overall negative placed in front to arrive at a definition involving a positive value. Further details on the mathematical formalism will be given in Chapter 3, but for now we can implement this in our first Python program:
-------------------------- ------------------------- #!/usr/bin/python import numpy as np import math import re arr = np.array([1.0/10,1.0/10,1.0/10,1.0/10,1.0/10,1.0/2]) # print(arr[0]) shannon_entropy = 0 numterms = len(arr) print(numterms) index = 0 for index in range(0, numterms): shannon_entropy += arr[index]*math.log(arr[index]) shannon_entropy = -shannon_entropy print(shannon_entropy) ----------------------- end ------------------------
The maximum Shannon entropy on a system with six outcomes, uniformly distributed (a fair die), is log(6). In the program above we evaluate the Shannon entropy for a loaded die: (1/10.1/10,1/10,1/10,1/10,1/2). Notice in the code, however, that I use “1.0” not “1”. This is because if the expression only involves integers the mathematics will be done as integer operations (returning an integer, thus truncation of some sort). An expression that is mixed, some integer terms, some floating point (with a decimal), will be evaluated as a floating point number. So, to force recognition of the numbers as floating point the “1” value in the terms is entered as “1.0”. Further tests are left to the Exercises (Section 2.7).
A basic review of getting a linux system running, with it is standard Python installed, is described in the Appendix, along with a discussion of how to install added Python modules (added code blocks with very useful, pre‐built, data structures, and subroutines), particularly “numpy,” which is indicated as a module to be imported (accessed) by the program by the first Python command: “import numpy as np.” (We will see in the Appendix that the first line is not a Python command but a shell directive as to what program to use to process the commands that follow, and this is the mechanism whereby a system level call on the Python script can be done.)
Let us now move on to some basic statistical concepts. How do we know the probabilities for the outcomes of the die roll? In practice, you would observe numerous die rolls and get counts of how many times the various outcomes were observed. Once you have counts, you can divide by the total counts to have the frequency of occurrence of the different outcomes. If you have enough observational data, the frequencies then become better and better estimates of the true underlying probabilities for those outcomes for the system observed (a result due to the law of large numbers (LLN), which is rederived in Section 2.6.1). Let us proceed with adding more code in that begins with counts on the different die rolls:
------------------ addendum 1 ----------------------- rolls = np.array([3435.0,3566,3245,3600,3544,3427]) numterms = len(rolls) total_count = 0 for index in range(0,numterms): total_count += rolls[index] print(total_count) probs = np.array([0.0,0,0,0,0,0]) for index in range(0,numterms): probs[index] = rolls[index]/total_count; print(probs) -------------------- end addendum 1 -----------------
Some notes on syntax: “len” is a Python function that returns the length of (number of items in) an array (from the numpy module). Notice how the probs array initialization has one entry as 1.0 and the others just 1. Again, this is an instance where the data structure must have components of the same type and if presented with mixed type will promote to a default type that represents the least loss of information (typically), in this instance, the “1.0” forces the array to be an array of floating point (decimal) numbers, with floating point arithmetic (for the division in the frequency evaluation used as the estimate of the probability in the “for loop”).
At this point we can estimate a new probability distribution based on the rolls observed, for which we are interested in evaluating the Shannon entropy. To avoid repeatedly copying and pasting the above code for evaluating the Shannon entropy, let us create a subroutine, called “shannon” that will do this standard computation. This is a core software engineering process, whereby tasks that are done repeatedly become recognized as such, and become rewritten as subroutines, and then need no longer be rewritten. Subroutines also avoid clashes in variable usage, compartmentalizing their variables (whose scope is only in their subroutine), and more clearly delineate what information is “fed in” and what information is returned (e.g. the application programming interface, or API).
----------------------- addendum 2 ------------------ def shannon( probs ): shannon_entropy = 0 numterms = len(probs) print(numterms) for index in range(0, numterms): print(probs[index]) shannon_entropy += probs[index]*math.log(probs[index]) shannon_entropy = -shannon_entropy print(shannon_entropy) return shannon_entropy shannon(probs) value = shannon(probs) print(value) -------------------- end addendum 2 -----------------
If we do another set of observations, getting counts on the different rolls, we then need to repeat the process of converting those counts to frequencies… so it is time to elevate the count‐to‐frequency computation to subroutine status as well, as is done next. The standard syntactical structure for defining a subroutine in Python is hopefully starting to become apparent (more detailed Python notes are in Appendix A).
------------------- addendum 3 ---------------------- def count_to_freq( counts ): numterms = len(counts) total_count=0 for index in range(0,numterms): total_count+=counts[index] probs = counts # to get memory allocation for index in range(0,numterms): probs[index] = counts[index]/total_count return probs probs = count_to_freq(rolls) print(probs) ----------------- end addendum 3 --------------------
Is genomic DNA random? Let us read thru a dna file, consisting of a sequence of a,c,g, and