Library of Congress Cataloging‐in‐Publication Data applied for:
ISBN: 9781119716747
Cover Design: Wiley
Cover Image: © agsandrew/Shutterstock
This book is dedicated to my family: Cindy, Nathaniel, Zachary, Sybil, Teresa, Eric, and Josh.
Preface
The material in this book draws from undergraduate and graduate coursework taken while I was a student at Caltech, and from further graduate coursework and studies at Oxford, University of Wisconsin, and University of California, Santa Cruz. The material also draws upon my teaching experience and research efforts while a tenured professor of computer science at the University of New Orleans and jointly appointed as principal investigator and director of a protein channel biosensor lab at the Research Institute for Children at Children’s Hospital in New Orleans.
1 Introduction
Informatics provides new avenues of understanding and inquiry in any medium that can be captured in digital form. Areas as diverse as text analysis, signal analysis, and genome analysis, to name a few, can be studied with informatics tools. Computationally powered informatics tools are having a phenomenal impact in many fields, including engineering, nanotechnology, and the biological sciences (Figure 1.1).
In this text I provide a background on various methods from Informatics and Machine Learning (ML) that together comprise a “complete toolset” for doing data analytics work at all levels – from a first year undergraduate introductory level to advanced topics in subsections suitable for graduate students seeking a deeper understanding (or a more detailed example). Numerous prior book, journal, and patent publications by the author are drawn upon extensively throughout the text [1–68]. Part of the objective of this book is to bring these examples together and demonstrate their combined use in typical signal processing situations. Numerous other journal and patent publications by the author [69–100] provide related material, but are not directly drawn upon this text. The application domain is practically everything in the digital domain, as mentioned above, but in this text the focus will be on core methodologies with specific application in informatics, bioinformatics, and cheminformatics (nanopore detection, in particular). Other disciplines can also be analyzed with informatics tools. Basic questions about human origins (anthrogenomics) and behavior (econometrics) can also be explored with informatics‐based pattern recognition methods, with a huge impact on new research directions in anthropology, sociology, political science, economics, and psychology. The complete toolset of statistical learning tools can be used in any of these domains.
In the chapter that follows an overview is given of the various information processing stages to be discussed in the text, with some highlights to help explain the order and connectivity of topics, as well as motivate their presentation in further detail in what is to come.
Figure 1.1 A Penrose tiling. A non‐repeating tiling with two shapes of tiles, with 5‐point local symmetry and both local and global (emergent) golden ratio.
1.1 Data Science: Statistics, Probability, Calculus … Python (or Perl) and Linux
Knowledge construction using statistical and computational methods is at the heart of data science and informatics. Counts on data features (or events) are typically gathered as a starting point in many analyses [101, 102]. Computer hardware is very well suited to such counting tasks. Basic operating system commands and a popular scripting language (Python) will be taught to enable doing these tasks easily. Computer software methods will also be shown that allow easy implementation and understanding of basic statistical methods, whereby the counts, for example, can be used to determine event frequencies, from which statistical anomalies can be subsequently identified. The computational implementation of basic statistics methods then provides the framework to perform more sophisticated knowledge construction and discovery by use of information theory and basic ML methods. ML can be thought of as a specialized branch of statistics where there is minimal assumption of a statistical “model” based on prior human learning. This book shows how to use computational, statistical, and informatics/algorithmic methods to analyze any data that is captured in digital form, whether it be text, sequential data in general (such as experimental observations over time, or stock market/econometric histories), symbolic data (genomes), or image data. Along the way there will be a brief introduction to probability and statistics concepts (Chapter 2) and basic Python/Linux system programming methods (Chapter 2 and Appendix A).
1.2 Informatics and Data Analytics
It is common to need to acquire a signal where the signal properties are not known, or the signal is only suspected and not discovered yet, or the signal properties are known but they may be too much trouble to fully enumerate. There is no common solution, however, to the acquisition task. For this reason the initial phases of acquisition methods unavoidably tend to be ad hoc. As with data dependency in non‐evolutionary search metaheuristics (where there is no optimal search method that is guaranteed to always work well), here there is no optimal signal acquisition method known in advance. In what follows methods are described for bootstrap optimization in signal acquisition to enable the most general‐use, almost “common,” solution possible. The bootstrap algorithmic method involves repeated passes over the data sequence, with improved priors, and trained filters, among other things, to have improved signal acquisition on subsequent passes. The signal acquisition is guided by statistical measures to recognize anomalies. Informatics methods and information theory measures are central to the design of a good finite state automata (FSAs) acquisition method, and will be reviewed in signal acquisition context in Chapters 2–4. Code examples are given in Python and C (with introductory Python described in Chapter 2 and Appendix A). Bootstrap acquisition methods may not automatically provide a common solution, but appear to offer a process whereby a solution can be improved to some desirable level of general‐data applicability.
The signal analysis and pattern recognition methods described in this book are mainly applied to problems involving stochastic sequential data: power signals and genomic sequences in particular. The information modeling, feature selection/extraction, and feature‐vector discrimination, however, were each developed separately in a general‐use context. Details on the theoretical underpinnings are given in Chapter 3, including a collection of ab initio information theory tools to help “find your way around in the dark.” One of the main ab initio approaches is to search for statistical anomalies using information measures, so various information measures