Informatics and Machine Learning. Stephen Winters-Hilt. Читать онлайн. Newlib. NEWLIB.NET

Автор: Stephen Winters-Hilt
Издательство: John Wiley & Sons Limited
Серия:
Жанр произведения: Математика
Год издания: 0
isbn: 9781119716761
Скачать книгу

      In this chapter, a review is given of statistics and probability concepts, with implementation of many of the concepts in Python. Python scripts are then used to do a preliminary examination of the randomness of genomic (virus) sequence data. A short review of Linux OS setup (with Python automatically installed) and Python syntax is given in Appendix A.

      Numerous prior book, journal, and patent publications by the author [1–68] are drawn upon extensively throughout the text. Almost all of the journal publications are open access. These publications can typically be found online at either the author's personal website (www.meta‐logos.com) or with one of the following online publishers: www.m‐hikari.com or bmcbioinformatics.biomedcentral.com.

      A “fair” die has equal probability of rolling a 1, 2, 3, 4, 5, or 6, i.e. a probability of 1/6 for each of the outcomes. Notice how the sum of all of the discrete probabilities for the different outcomes all add up to 1, this is always the case for probabilities describing a complete set of outcomes.

      A “loaded” die has a non‐uniform distribution, for prob = 0.5 to roll a “6” and uniform on the other die rolls you have loaded die_roll_probability = (1/10,1/10,1/10,1/10,1/10,1/2).

      The first program to be discussed is named prog1.py and will introduce the notion of discrete probability distributions in the context of rolling the familiar six‐sided die. Comments in Python are the portion of a line to the right of any “#” symbol (except for the first line of code with “#!.....”, that is explained later).

       -------------------------- prog1.py ------------------------- #!/usr/bin/python import numpy as np import math import re arr = np.array([1.0/10,1.0/10,1.0/10,1.0/10,1.0/10,1.0/2]) # print(arr[0]) shannon_entropy = 0 numterms = len(arr) print(numterms) index = 0 for index in range(0, numterms): shannon_entropy += arr[index]*math.log(arr[index]) shannon_entropy = -shannon_entropy print(shannon_entropy) ----------------------- end prog1.py ------------------------

      The maximum Shannon entropy on a system with six outcomes, uniformly distributed (a fair die), is log(6). In the prog1.py program above we evaluate the Shannon entropy for a loaded die: (1/10.1/10,1/10,1/10,1/10,1/2). Notice in the code, however, that I use “1.0” not “1”. This is because if the expression only involves integers the mathematics will be done as integer operations (returning an integer, thus truncation of some sort). An expression that is mixed, some integer terms, some floating point (with a decimal), will be evaluated as a floating point number. So, to force recognition of the numbers as floating point the “1” value in the terms is entered as “1.0”. Further tests are left to the Exercises (Section 2.7).

      Let us now move on to some basic statistical concepts. How do we know the probabilities for the outcomes of the die roll? In practice, you would observe numerous die rolls and get counts of how many times the various outcomes were observed. Once you have counts, you can divide by the total counts to have the frequency of occurrence of the different outcomes. If you have enough observational data, the frequencies then become better and better estimates of the true underlying probabilities for those outcomes for the system observed (a result due to the law of large numbers (LLN), which is rederived in Section 2.6.1). Let us proceed with adding more code in prog1.py that begins with counts on the different die rolls:

       ------------------ prog1.py addendum 1 ----------------------- rolls = np.array([3435.0,3566,3245,3600,3544,3427]) numterms = len(rolls) total_count = 0 for index in range(0,numterms): total_count += rolls[index] print(total_count) probs = np.array([0.0,0,0,0,0,0]) for index in range(0,numterms): probs[index] = rolls[index]/total_count; print(probs) -------------------- end prog1.py addendum 1 -----------------

      At this point we can estimate a new probability distribution based on the rolls observed, for which we are interested in evaluating the Shannon entropy. To avoid repeatedly copying and pasting the above code for evaluating the Shannon entropy, let us create a subroutine, called “shannon” that will do this standard computation. This is a core software engineering process, whereby tasks that are done repeatedly become recognized as such, and become rewritten as subroutines, and then need no longer be rewritten. Subroutines also avoid clashes in variable usage, compartmentalizing their variables (whose scope is only in their subroutine), and more clearly delineate what information is “fed in” and what information is returned (e.g. the application programming interface, or API).

       ----------------------- prog1.py addendum 2 ------------------ def shannon( probs ): shannon_entropy = 0 numterms = len(probs) print(numterms) for index in range(0, numterms): print(probs[index]) shannon_entropy += probs[index]*math.log(probs[index]) shannon_entropy = -shannon_entropy print(shannon_entropy) return shannon_entropy shannon(probs) value = shannon(probs) print(value) -------------------- end prog1.py addendum 2 -----------------

       ------------------- prog1.py addendum 3 ---------------------- def count_to_freq( counts ): numterms = len(counts) total_count=0 for index in range(0,numterms): total_count+=counts[index] probs = counts # to get memory allocation for index in range(0,numterms): probs[index] = counts[index]/total_count return probs probs = count_to_freq(rolls) print(probs) ----------------- end prog1.py addendum 3 --------------------

      Is genomic DNA random? Let us read thru a dna file, consisting of a sequence of a,c,g, and