Acquisition is often all that is needed in a signal analysis problem, where a basic means to acquire the signals is sought, to be followed by a basic statistical analysis on those signals and their occurrences. Various methods for signal acquisition using FSA constructs are described in what follows, with focus on statistical anomalies to identify the presence of signal and “lock on” [1, 3]. The signal acquisition is initially only guided by use of statistical measures to recognize anomalies. Informatics methods and information theory measures are central to the design of a good FSA acquisition method, however, and will be reviewed in the signal acquisition context [1, 3], along with HMMs.
Thus, FSA processes allow signal regions to be identified, or “acquired,” in O(L) time. Furthermore, in that same order of time complexity, an entire panoply of statistical moments can also be computed on the signals (and used in a bootstrap learning process). The O(L) feature extraction of statistical moments on the signal region acquired may suffice for localized events and structures. For sequential information or events, however, there is often a non‐local, or extended structural, aspect to the signal sought. In these situations we need a general, powerful, way to analyze sequential signal data that is stochastic (random, but with statistics, such as average, that may be unchanging over time if “stationary,” for example). The general method for performing stochastic sequential analysis (SSA) is via HMMs, as will be extensively described in Chapters 6 and 7, and briefly summarized in Section 1.5 that follows. HMM approaches require an identification of “states” in the signal analysis. If an identification of states is difficult, such as in situations where there can be changes in meaning according to context, e.g. language, then HMMs may not be useful. Text and language analytics are described in Chapters 5 and 13, and briefly outlined in the next section.
1.4 Feature Extraction and Language Analytics
The FSA sequential‐data signal processing, and extraction of statistical moments on windowed data, will be shown in Chapter 2 to be O(L) with L the size of the data (double the data and you double the processing time). If HMMs can be used, with their introduction of states (the sequential data is described as a sequentence of “hidden” states), then the computational cost goes as O(LN2). If N = 10, then this could be 100 times more computational time to process than that of a FSA‐based O(L) computation, so the HMMs can generally be a lot more expensive in terms of computational time. Even so, if you can benefit from a HMM it is generally possible to do so, even if hardware specialization (CPU farm utilization, etc.) is required. The problem is if you do not have a strong basis for a HMM application, e.g. when there is no strong basis for delineating the states of the system of communication under study. This is the problem encounterd in the study of natural languages (where there is significant context dependency). In Chapter 5 we look into FSA analysis for language by doing some basic text analytics.
Chapter 5 shows some (very) basic extensions to an FSA analysis in applications to text. This begins with a simple frequency analysis on words, which for some classics (in their original languages) reveal important word‐frequency results with implied meanings meant by the author (polysemy word usage by Machiavelli, for example). The frequency on word groupings in a given text can be studied as well, with some useful results from texts of sufficient size with clear stylistic conventions by the author. Authors that structure their lines of text according to iambic pentameter (Shakespeare, for example) can also be identified according to the profile (histogram) of syllables used on each line (i.e. 10 for iambic pentameter will dominate).
Text analytics can also take what is still O(L) processing into mapping the mood or sentiment of text samples by use of word‐scored sentiment tables. The generation and use of such sentiment tables is its own craft, usually proprietary, so only minimal examples are given. Thus Chapter 5 shows an elaboration of FSA‐based analysis that might be done when there is no clear definition of state, such as in language. NLP processing in general encompasses a much more complete grammatical knowledge of the language, but in the end the NLP and the FSA‐based “add‐on” still suffer from not being able to manage word context easily (the states cannot simply be words since the words can have different meaning according to context). The inability to use HMMs has been a blockade to a “universal translator” that has since been overcome with use of Deep Learning using NNs (Chapter 13) – where immense amounts of translation data, such as the massive corpus of dual language Canadian Government proceedings, is sufficient to train a translator (English–French). Most of the remaining Chapters focus on situations where a clear delinaeation of signal state can be given, and thus benefit from the use of HMMs.
1.5 Feature Extraction and Gene Structure Identification
HMMs offer a more sophisticated signal recognition process than FSAs, but with greater computational space and time complexity [125, 126]. Like electrical engineering signal processing, HMMs usually involve preprocessing that assumes linear system properties or assumes observation is frequency band limited and not time limited, and thereby inherit the time‐frequency uncertainty relations, Gabor limit, and Nyquist sampling relations. FSA methods can be used to recover (or extract) signal features missed by HMM or classical electrical engineering signal processing. Even if the signal sought is well understood, and a purely HMM‐based approach is possible, this is often needlessly computationally intensive (slow), especially in areas where there is no signal. To address this there are numerous hybrid FSA/HMM approaches (such as BLAST [127] ) that benefit from the O(L) complexity on length L signal with FSA processing, with more targeted processing at O(LN2) complexity with HMM processing (where there are N states in the HMM model).
Figure 1.2 The Viterbi path. (Left) The Viterbi path is recursively defined, thus tabulatable, with one column only, recursively, dependent on the prior column. (Right) A related recursive algorithm used to perform sequence alignment extensions with gaps (the Smith–Waterman algorithm) is provided by the neighbor‐cell recursively‐defined relation shown.
HMMs, unlike tFSAs, have a straightforward mathematical and computational foundation at the nexus where Bayesian probability and Markov models meet dynamic programming. To properly define or choose the HMM model in a ML context, however, further generalization is usually required. This is because the “bare‐bones” HMM description has critical weaknesses in most applications, which are described in Chapter 7, along with their “fixes.” Fortunately, each of the standard HMM weaknesses can be addressed in computationally efficient ways. The generalized