Predictive features should be independent of each other.
Logistic regression requires a greater number of observations than linear regression to produce a reliable result. The rule of thumb is that you should have at least 50 observations per predictive feature if you expect to generate reliable results.
Predicting survivors on the Titanic is the classic practice problem for newcomers to learn logistic regression. You can practice it and see lots of examples of this problem worked out over on Kaggle. (
www.kaggle.com/c/titanic
).
Ordinary least squares (OLS) regression methods
Ordinary least squares (OLS) is a statistical method that fits a linear regression line to a dataset. With OLS, you do this by squaring the vertical distance values that describe the distances between the data points and the best-fit line, adding up those squared distances, and then adjusting the placement of the best-fit line so that the summed squared distance value is minimized. Use OLS if you want to construct a function that’s a close approximation to your data.
As always, don’t expect the actual value to be identical to the value predicted by the regression. Values predicted by the regression are simply estimates that are most similar to the actual values in the model.
OLS is particularly useful for fitting a regression line to models containing more than one independent variable. In this way, you can use OLS to estimate the target from dataset features.
When using OLS regression methods to fit a regression line that has more than one independent variable, two or more of the variables may be interrelated. When two or more independent variables are strongly correlated with each other, this is called multicollinearity. Multicollinearity tends to adversely affect the reliability of the variables as predictors when they’re examined apart from one another. Luckily, however, multicollinearity doesn’t decrease the overall predictive reliability of the model when it’s considered collectively.
Detecting Outliers
Many statistical and machine learning approaches assume that your data has no outliers. Outlier removal is an important part of preparing your data for analysis. In this section, you see a variety of methods you can use to discover outliers in your data.
Analyzing extreme values
Outliers are data points with values that are significantly different from the majority of data points comprising a variable. It’s important to find and remove outliers because, left untreated, they skew variable distribution, make variance appear falsely high, and cause a misrepresentation of intervariable correlations.
You can use outlier detection to spot anomalies that represent fraud, equipment failure, or cybersecurity attacks. In other words, outlier detection is a data preparation method and an analytical method in its own right.
Outliers fall into the following three categories:
Point: Point outliers are data points with anomalous values compared to the normal range of values in a feature.
Contextual: Contextual outliers are data points that are anomalous only within a specific context. To illustrate, if you’re inspecting weather station data from January in Orlando, Florida, and you see a temperature reading of 23 degrees F, this would be quite anomalous because the average temperature there is 70 degrees F in January. But consider if you were looking at data from January at a weather station in Anchorage, Alaska — a temperature reading of 23 degrees F in this context isn’t anomalous at all.
Collective: These outliers appear nearby one another, all having similar values that are anomalous to the majority of values in the feature.
You can detect outliers using either a univariate or multivariate approach, as spelled out in the next two sections.
Detecting outliers with univariate analysis
Univariate outlier detection is where you look at features in your dataset and inspect them individually for anomalous values. You can choose from two simple methods for doing this:
Tukey outlier labeling
Tukey boxplotting
Tukey boxplotting is an exploratory data analysis technique that’s useful for visualizing the distribution of data within a numeric variable by visualizing that distribution with quartiles. As you might guess, the Tukey boxplot was named after its inventor, John Tukey, an American mathematician who did most of his work back in the 1960s and 70s. Tukey outlier labeling refers to labeling data points (that lie beyond the minimum and maximum extremes of a box plot) as outliers.
It is cumbersome to use the Tukey method to manually calculate, identify, and label outliers, but if you want to do it, the trick is to look at how far the minimum and maximum values are from the 25 and 75 percentiles. The distance between the 1st quartile (at 25 percent) and the 3rd quartile (at 75 percent) is called the inter-quartile range (IQR), and it describes the data’s spread. When you look at a variable, consider its spread, its Q1 / Q3 values, and its minimum and maximum values to decide whether the variable is suspect for outliers.
Here’s a good rule of thumb:
a = Q1 – 1.5*IQR
and
b = Q3 + 1.5*IQR.
If your minimum value is less than a, or your maximum value is greater than b, the variable probably has outliers.
On the other hand, it is quite easy to generate a Tukey boxplot and spot outliers using Python or R. Each boxplot has whiskers that are set at 1.5*IQR. Any values that lie beyond these whiskers are outliers. Figure 4-7 shows outliers as they appear within a Tukey boxplot that was generated in Python.
Credit: Python for Data Science Essential Training Part 1, LinkedIn.com
FIGURE 4-7: Spotting outliers with a Tukey boxplot.
Detecting outliers with multivariate analysis
Sometimes outliers show up only within combinations of data points from disparate variables. These outliers wreak havoc on machine learning algorithms, so it’s important to detect and remove them. You can use multivariate analysis of outliers to do this. A multivariate approach to outlier detection involves considering two or more variables at a time and inspecting them together for outliers. You can use one of several methods, including:
A scatter-plot matrix
Boxplotting
Density-based spatial clustering of applications with noise (DBScan) — as discussed in Chapter 5
Principal component analysis (PCA, as shown in Figure 4-8)