Exploratory analysis is a very important part of getting to know your data in an “informal” way. It allows gaining some initial insights into the data, which can then be usefully adopted throughout the analytical modeling stage. Different plots/graphs can be useful here such as bar charts, pie charts, and scatter plots, for example. A next step is to summarize the data by using some descriptive statistics, which all summarize or provide information with respect to a particular characteristic of the data. Hence, they should be assessed together (i.e., in support and completion of each other). Basic descriptive statistics are the mean and median values of continuous variables, with the median value less sensitive to extreme values but then, as well, not providing as much information with respect to the full distribution. Complementary to the mean value, the variation or the standard deviation provide insight with respect to how much the data are spread around the mean value. Likewise, percentile values such as the 10th, 25th, 75th, and 90th percentile provide further information with respect to the distribution and as a complement to the median value. For categorical variables, other measures need to be considered such as the mode or most frequently occurring value.
Missing values (see Table 2.1) can occur for various reasons. The information can be nonapplicable – for example, when modeling the amount of fraud, this information is only available for the fraudulent accounts and not for the nonfraudulent accounts since it is not applicable there (Baesens et al. 2015). The information can also be undisclosed. For example, a customer decided not to disclose his or her income because of privacy. Missing data can also originate because of an error during merging (e.g., typos in name or ID). Missing values can be very meaningful from an analytical perspective since they may indicate a particular pattern. As an example, a missing value for income could imply unemployment, which may be related to, for example, default or churn. Some analytical techniques (e.g., decision trees) can directly deal with missing values. Other techniques need some additional preprocessing. Popular missing value handling schemes are removal of the observation or variable, and replacement (e.g., by the mean/median for continuous variables and by the mode for categorical variables).
Table 2.1 Missing Values in a Dataset
Outliers are extreme observations that are very dissimilar to the rest of the population. Two types of outliers can be considered: valid observations (e.g., salary of boss is €1.000.000) and invalid observations (e.g., age is 300 years). Two important steps in dealing with outliers are detection and treatment. A first obvious check for outliers is to calculate the minimum and maximum values for each of the data elements. Various graphical tools can also be used to detect outliers, such as histograms, box plots, and scatter plots. Some analytical techniques (e.g., decision trees) are fairly robust with respect to outliers. Others (e.g., linear/logistic regression) are more sensitive to them. Various schemes exist to deal with outliers; these are highly dependent on whether the outlier represents a valid or an invalid observation. For invalid observations (e.g., age is 300 years), one could treat the outlier as a missing value by using any of the schemes (i.e., removal or replacement) mentioned in the previous section. For valid observations (e.g., income is €1,000,000), other schemes are needed such as capping whereby lower and upper limits are defined for each data element.
A popular technique for reducing dimensionality, studying linear correlations, and visualizing complex datasets is principal component analysis (PCA). This technique has been known since the beginning of the last century (Jolliffe 2002), and it is based on the concept of constructing an uncorrelated, orthogonal basis of the original dataset.
Throughout this section, we will assume that the observation matrix X is normalized to zero mean, so that
. We do this so the covariance matrix of X is exactly equal to XTX. In case the matrix is not normalized, then the only consequence is that the calculations have an extra (constant) term, so assuming a centered dataset will simplify the analyses.The idea for PCA is simple: is it possible to engulf our data in an ellipsoid? If so, what would that ellipsoid look like? We would like four properties to hold:
1. Each principal component should capture as much variance as possible.
2. The variance that each principal component captures should decrease in each step.
3. The transformation should respect the distances between the observations and the angles that they form (i.e., should be orthogonal).
4. The coordinates should not be correlated with each other.
The answer to these questions lies in the eigenvectors and eigenvalues of the data matrix. The orthogonal basis of a matrix is the set of eigenvectors (coordinates) so that each one is orthogonal to each other, or, from a statistical point of view, uncorrelated with each other. The order of the components comes from a property of the covariance matrix XTX: if the eigenvectors are ordered by the eigenvalues of XTX, then the highest eigenvalue will be associated with the coordinate that represents the most variance. Another interesting property of the eigenvalues and the eigenvectors, proven below, is that the eigenvalues of XTX are equal to the square of the eigenvalues of X, and that the eigenvectors of X and XTX are the same. This will simplify our analyses, as finding the orthogonal basis of X will be the same as finding the orthogonal basis of XTX.
The principal component transformation of X will then calculate a new matrix P from the eigenvectors of X (or XTX). If V is the matrix with the eigenvectors of X, then the transformation will calculate a new matrix
. The question is how to calculate this orthogonal basis in an efficient way.The singular value decomposition (SVD) of the original dataset X is the most efficient method of obtaining its principal components. The idea of the SVD is to decompose the dataset (matrix) X into a set of three matrices, U, D, and V, such that
, where VT is the transpose of the matrix V1, and U and V are unitary matrices, so . The matrix D is a diagonal matrix so that each element di is the singular value of matrix X.Now we can calculate the principal component transformation P of X. If
, then we can calculate the expression , and identifying terms we can see that matrix V is composed by the eigenvectors of XTX, which are equal to the eigenvectors of X, and the eigenvalues of X will be equal to the square root of the eigenvalues of XTX, D2, as we previously stated. Thus,