3.4.1 Markov Chains
A Markov chain is a sequence of random variables S1; S2; S3; … with the Markov property of limited memory, where a first‐order Markov assumption on the probability for observing a sequence “s1 s2 s3 s4…sn” is:
In the Markov chain model, the states are also the observables. For a HMM we generalize to where the states are no longer directly observable (but still first‐order Markov), and for each state, say S1, we have a statistical linkage to a random variable, O1, that has an observable base emission, with the standard (0th‐order) Markov assumption on prior emissions.
The key “short‐term memory” property of a Markov chain is: P(xi ∣ xi − 1, …, x1) = P(xi ∣ xi − 1) = ai − 1, i, where ai − 1, i are sometimes referred to as “transition probabilities”, and we have: P(x) = P(xL, xL − 1, …, x1) = P(x1)∏i = 2…L ai − 1, i. If we denote Cy for the count of events y, Cxy for the count of simultaneous events x and y, Ty for the count of strings of length one, and Txy for the count of strings of length two: ai − 1, i = P(x ∣ y) = P(x, y)/P(y) = [Cxy/Txy]/[ Cy/Ty]. Note: since Txy + 1 = Ty → Txy ≅ Ty (sequential data sample property if one long training block), ai − 1, i ≅ Cxy/Cy = Cxy/∑x Cxy, so counts Cxy is complete information for determining transition probabilities.
For prokaryotic gene prediction much of the problem with obtaining high‐confidence training data can be circumvented by using a bootstrap gene‐prediction approach. This is possible in prokaryotes because of their simpler and more compact genomic structure: simpler in that long ORFs are usually long genes, and compact in that motif searches upstream usually range over hundreds of bases rather than thousands (as in human).
Interpolated Markov Model (IMM): the order of the MM can be interpolated according to a globally imposed cutoff criterion (see Figure 3.5), such as a minimum sub‐sequence count: fourth‐order passes if Counts (x0; x−1; x−2; x−3; x−4)>cutoff for all x−4…x0 sub‐sequences (100, for example), the utility of this becomes apparent with the following reexpression:
TotalCounts(length5)/TotalCounts(length4)
Figure 3.5 Hash interpolated Markov model (hIMM) and gap/hash interpolated Markov model (ghIMM): now no longer employ a global cutoff criterion – count cutoff criterion applied at the sub‐sequence level. Given the current state and the emitted sequence as x1, …, xL; compute: P(xL ∣ x1, …, xL − 1) ≈ Count(x1, …, xL)/Count(x1, …, xL − 1). If Count(x1, …, xL − 1) ≥ 400 i.e. only if the parental sequence shows statistical significance (consider 400 = 4 × 100, or requiring >100 counts on observations assuming uniform distribution for count cutoff determination), store P(xL ∣ x1, …, xL − 1) in the hash. (If gene finding, have at least five states, e.g. need to maintain a separate hash for each of the following states – Junk, Intron, and Exon0, 1, 2.)
Suppose Counts (x0; x−1; x−2; x−3; x−4; x−5) <cutoff for some x−5…x0 sub‐sequence, then the interpolation would halt (globally), and the order of MM used would be fourth‐order.
Gap Interpolated Markov Model (gIMM): like IMM with its count cutoff, but when going to higher order in the interpolation there is no constraint to contiguous sequence elements – i.e. “gaps” are allowed. The resolution of what gap‐size to choose when going to the next higher order is resolved by evaluating the Mutual Information. i.e. when going to 3rd order in the Markov context, P(x0 ∣ x−5; x−2; x−1) is chosen over P(x0 ∣ x−3; x−2; x−1) if