1.3 How to Use This Book
This book is intended for students, engineers, and researchers who are interested in using modern statistical methods for variation modeling, diagnosis, and prediction in industrial systems.
This book can be used as a textbook for a graduate level or advanced undergraduate level courses on industrial data analytics. The book is fairly self-contained, although background in basic probability and statistics such as the concept of random variable, probability distribution, moments, and basic knowledge in linear algebra such as matrix operations and matrix decomposition would be useful. The appendix at the end of the book provides a summary of the necessary concepts and results in linear space and matrix theory. The materials in Part II of the book are relatively independent. So the instructor could combine selected chapters in Part II with Part I as the basic materials for different courses. For example, topics in Part I can be used for an advanced undergraduate level course on introduction to industrial data analytics. The materials in Part I and some selected chapters in Part II (e.g., Chapters 7, 8, and 9) can be used in a master’s level statistical quality control course. Similarly, materials in Part I and selected later chapters in Part II (e.g., Chapters 10, 11, 12) can be used in a master’s level course with emphasis on prognosis and reliability applications. Finally, Part II alone can be used as the textbook for an advanced graduate level course on diagnosis and prognosis.
One important feature of this book is that we provide detailed descriptions of software implementation for most of the methods and algorithms. We adopt the statistical programming language R in this book. R language is versatile and has a very large number of up-to-date packages implementing various statistical methods [R Core Team, 2020]. This feature makes this book fit well with the needs of practitioners in engineering fields to self study and implement the statistical modeling and analysis methods. All the R codes and data sets used in this book can be found at the book companion website.
Bibliographic Notes
Some examples of good books on system diagnosis and prognosis in engineering area are Lewis et al. [2011], Niu [2017], Wu et al. [2006], Talebi et al. [2009], Gertler [1998], Chen and Patton [2012], Witczak [2007], Isermann [2011], Ding [2008], Si et al. [2017]. Many good textbooks are available on industrial statistics. For example, Montgomery [2009], DeVor et al. [2007], Colosimo and Del Castillo [2006], Wu and Hamada [2011] are on statistical monitoring and design. On the failure event analysis and prognosis, Meeker and Escobar [2014], Rausand et al. [2004], Elsayed [2012] are commonly cited references.
2 Introduction to Data Visualization and Characterization
Before making a chess move, an experienced chess player first explores the positions of the pieces on the chess board for noticeable patterns such as opponent’s threats, special relationships between chess pieces, and the strengths and weaknesses of both sides, before digging into in-depth calculation of move sequences to find the optimal move. Similarly, a data scientist should also start with an exploration of the data set for noticeable patterns before conducting any in-depth analysis by building a sophisticated mathematical model or running a computationally intensive algorithm. Simple data exploration methods can help understand the basic data structure such as dimension and types of variables; discover initial patterns such as relationships among variables; identify missing values, outliers, and skewed distribution for the needs of data pre-processing and transformation. This chapter focuses on basic graphical and numerical methods for data description and exploration. We first look at a data set in the following example.
Example 2.1 (auto_spec
data) The data set in auto_spec.csv
, which is from the UCI Machine Learning Repository [Dua and Graff, 2017], contains the specifications of a sample of cars. The following R
codes can be used to read the data file and obtain information on basic characteristics and structure of the data set.
# load data
auto.spec.df <- read.csv ("auto_spec.csv", header = T)
# show basic information of data set
dim (auto.spec.df)
names (auto.spec.df)
head(auto.spec.df)
summary(auto.spec.df)> dim(auto.spec.df) [1] 205 23 > names(auto.spec.df) [1] "make" "fuel.type" "aspiration" [4] "num.of.doors" "body.style" "drive.wheels" [7] "engine.location" "wheel.base" "length" [10] "width" "height" "curb.weight" [13] "engine.type" "num.of.cylinders" "engine.size" [16] "fuel.system" "bore" "stroke" [19] "compression.ratio" "horsepower" "peak.rpm" [22] "city.mpg" "highway.mpg" > head(auto.spec.df) Make Fuel.type Aspiration Num.of.doors Body.style Drive wheels 1 Alfa-Romeo Gas Std Two Convertible Rwd 2 Alfa-Romeo Gas Std Two Convertible Rwd 3 Alfa-Romeo Gas Std Two Hatchback Rwd 4 Audi Gas Std Four Sedan Fwd 5 Audi Gas Std Four Sedan Fwd 6 Audi Gas Std Two Sedan Fwd .... Horsepower Peak.rpm City.mpg Highway.mpg 1 111 5000 21 27 2 111 5000 21 27 3 154 5000 19 26 4 102 5500 24 30 5 115 5500 18 22 6 110 5500 19 25 > summary(auto.spec.df) Make Fuel.type Aspiration Num.of.doors Body.style Toyota : 32 Diesel: 20 Std :168 Four:114 Convertible: 6 Nissan : 18 Gas : 185 Turbo: 37 two : 89 Hardtop : 8 Mazda : 17 NA’s: 2 Hatchback :70 Honda : 13 Sedan :96 Mitsubishi : 13 Wagon :25 Subaru : 12 (Other) : 100 .... City.mpg Highway.mpg Min. :13.00 Min. :16.00 1st Qu.:19.00 1st Qu.:25.00 Median :24.00 Median :30.00 Mean :25.22 Mean :30.75 3rd Qu.:30.00 3rd Qu.:34.00 Max. :49.00 Max. :54.00
From the R
outputs, we see that this data set contains 205 observations on 23 variables including manufacturer, fuel type, body style, dimension, horsepower, miles per gallon, and other specifications of a car. In statistics and data mining literature, an observation is also called a record, a data point, a case, a sample, an entity, an instance, or a subject, etc. The variables associated with an observation are also called attributes, fields, characteristics, or features, etc. The summary()
function shows the basic summary information of each variable such as the mean, median, and range of values. From the summary information, it is obvious that there are two types of variable. A variable such as fuel.type
and body.style
has a finite number of possible values, and there is no numerical relationship among the values. Such a variable is referred to as a categorical variable. On the other hand, a variable such as highway.mpg
and horsepower
has continuous numerical values, and is referred to as a numerical variable. Beyond the basic data summary, graphical methods can be used to show more patterns of both types of variables, as discussed in the following subsection.
Note from the results of summary()
, several variables in the auto_spec
data set have missing values, which are represented by NA
. Missing values are a common occurrence in real world data sets. There are various ways to handle the missing values in a data set. If the number of observations with missing values is small, those observations might be simply omitted. To do this, we can use the R
function na.omit()
. From the following R
codes we can see that there are 205 − 197 = 8 observations with missing values in this data set. So simply removing these observations is a reasonable way to handle the missing values for this data set.
> dim(na.omit(auto.spec.df))
[1] 197 23
If a significant number of observations in a data set have missing values, an alternative to simply removing observations with missing values is imputation, which is a process of replacing missing values with substituted values. A simple method of imputation is to replace missing values with a mean or median of the