Source: Based on Winters‐Hilt [1–3].
In adopting any model with “more parameters,” such as a HMMBD over a HMM, there is potentially a problem with having sufficient data to support the additional modeling. This is generally not a problem in any HMM model that requires thousands of samples of non‐self transitions for sensor modeling, such as for the gene‐finding that is described in what follows, since knowing the boundary positions allows the regions of self‐transitions (the durations) to be extracted with similar sample number as well, which is typically sufficient for effective modeling of the duration distributions in a HMMD.
Improvement to overall HMM application rests not only with the aforementioned improvements to the HMM/HMMBD, but also with improvements to the hidden state model and emission model. This is because standard HMMs are at low Markov order in transitions (first) and in emissions (zeroth), and transitions are decoupled from emissions (which can miss critical structure in the model, such as state transition probabilities that are sequence dependent). This weakness is eliminated if we generalize to the largest state‐emission clique possible, fully interpolated on the data set, as is done with the generalized‐clique HMM, where gene finding is performed on the Caenorhabditis elegans genome. The clique generalization improves the modeling of the critical signal information at the transitions between exon regions and noncoding regions, e.g. intron and junk regions. In doing this we arrive at a HMM structure identification platform that is novel, and robustly performing, in a number of ways.
Prior HMM‐based systems for SSA had undesirable limitations and disadvantages. For example, the speed of operation made such systems difficult, if not impossible, to use for real‐time analysis of information. In the SSA Protocol described here, distributed generalized HMM processing together with the use of the SVM‐based Classification and Clustering Methods (described next) permit the general use of the SSA Protocol free of the usual limitations. After the HMM and SSA methods are described, their synergistic union is used to convey a new approach to signal analysis with HMM methods, including a new form of stochastic‐carrier wave (SCW) communication.
1.6 Theoretical Foundations for Learning
Before moving on to classification and clustering (Chapter 10), a brief description is given of some of the theoretical foundations for learning, starting with the foundation for the choice of information measures used in Chapters 2–4, and this is shown in Chapter 8. In Chapter 9 we then describe the theory of NNs. The Chapter 9 background is not meant to be a complete exposition on NN learning (the opposite), but merely goes through a few specific analyses in the area of Loss Bounds analysis to give a sense of what makes a good classification method.
1.7 Classification and Clustering
SVMs can be used for classification and clustering (to be described in detail in Chapter 10), as well as aiding with signal analysis and pattern recognition on stochastic sequential data. The signal processing material described next, and in detail later, mainly draw from prior journal publications [159–189]. Analysis tools for stochastic sequential data have broad‐ranging application by making any device producing a sequence of measurements more sensitive, or “smarter,” by efficient learning of measured signal/pattern characteristics. The SVM and HMM/SVM application areas described in this book include cheminformatics, biophysics, and bioinformatics. The cheminformatics application examples pertain to channel current analysis on the alpha‐hemolysin nanopore detector (Chapter 14).
The biophysics and “information flows” associated with the nanopore transduction detector (NTD) in Chapter 14 are analyzed using a generalized set of HMM and SVM‐based tools, as well as ad hoc FSAs‐based methods, and a collection of distributed genetic algorithm methods for tuning and selection. Used with a nanopore detector, the channel current cheminformatics (CCC) for the stationary signal channel blockades (with “stationary statistics”) enables a method for a highly sensitive nanopore detector for single molecule biophysical analysis.
The SVM implementations described involve SVM algorithmic variants, kernel variants, and chunking variants; as well as SVM classification tuning metaheuristics; and SVM clustering metaheuristics. The SVM tuning metaheuristics typically enable use of the SVM’s confidence parameter to bootstrap from a strong classification engine to a strong clustering engine via use of label changes, and repeated SVM training processes with the new label information obtained.
SVM Methods and Systems are given in Chapter 10 for classification, clustering, and SSA in general, with a broad range of applications:
sequential‐structure identification
pattern recognition
knowledge discovery
bioinformatics
nanopore detector cheminformatics
computational engineering with information flows
“SSA” Architectures favoring Deep Learning (see next section)
SVM binary discrimination outperforms other classification methods with or without dropping weak data (while many other methods cannot even identify weak data).
1.8 Search
All of the core methods described thus far (FSA, HMM, SVM) require some amount of parameter “tuning” for good performance. In essence, tuning is a search through parameter space (of the method) for best performance (according to a variety of metrics). The tuning on acquisition parameters in an FSA, or choice of states in a HMM, or SVM Kernels and Kernel parameters, is often not terribly complicated allowing for a “brute‐force” search over a set of parameters, choosing the best from that set. On occasion, however, a more elaborate, and fully automated, search‐optimization is sought (or just search problem in general), For more complex search tasks it is good to know the modern search methodologies and what they are capable of, so these are described in Chapter 11.
1.9 Stochastic Sequential Analysis (SSA) Protocol (Deep Learning Without NNs)
The SSA protocol is shown in Figure 1.5 (from prior publications and patent work, see [1–3]) and is a general signal‐processing flow topology and database schema (Left Panel), with specialized variants for CCC (Center) and kinetic feature extraction based on blockade‐level duration observations (Right). The SSA Protocol allows for the discovery, characterization, and classification of localizable, approximately‐stationary, statistical signal structures in channel current data, or genomic data, or sequential data in general. The core signal processing stage in Figure 1.5 is usually the feature extraction stage, where central to the signal processing protocol is a generalized HMM. The SSA Protocol also has a built‐in recovery protocol for weak signal handling, outlined next, where the HMM methods are complemented by the strengths of other ML methods.