“The goal is to turn data into information and information into insight.”
— Carly Fiorina
Data preparation involves four phases (also see Figure 1.4):
1 Data collection or data selection. The performance of the model solely depends on how effectively it can achieve accuracy, precision, classification, anomaly detection, and recommendation. This step involves gathering the subset of all available data from multiple resources within your organization. It's not always a good idea to include all the data that is available. The key is to select quality data. People always think that more is better. But while selecting data, you need to strongly address the problem you are working on. For example, you need your model to predict the travel time information of a vehicle for a given expressway by predicting the average speed of the expressway in the future. FIGURE 1.4 Data preparation process.This problem definition does not provide any specific inputs that should be taken into consideration, so it is important to create a list of feasible inputs and output variables. To estimate the average speed of the expressway in the future, you need to map past data in the model. Past data should include the theoretical knowledge of the transportation system, user behavior, user statistics, traffic on business days and on weekends, traffic at daytime and nighttime, and so on.To select quality data, make some assumptions and be sure of the solution you want from your model. However, the process of collecting data is actually a really important step in the ML lifecycle. And all the different decisions that you have to make while collecting data can end up having a pretty big effect on your model results.
2 Data cleansing. To cleanse your data, you first need to understand your data. To gain valuable insights, you need quality data, but the raw data consist of incorrect or missing values. So relevant data must be sourced and cleansed. There are two key stages of selecting relevant data from raw data:Data assessment. It is a task to evaluate the feasibility of available data and how it aligns to your business problem. To understand the data, you need to label your examples in an appropriate category so that the machine learning model should learn to predict. There should be sufficient data to build a machine learning model because the usage of data depends on the algorithms you are using and the complexity of the problem.Data exploration. This is the phase to test your assumption and to create meaningful summaries of your data. In this phase, you search for missing values, outliers, and unbalanced data. Data cleansing is a process that deals with issues occurring in raw data such as missing values, outliers, unbalanced data, typos in data, duplicate data, and so on. Let's see how data is cleaned in this stage:Missing values. If your data has missing values, then you should explore the reason behind it to decide whether or not it is possible to drop these data points as a whole. If dropping data points is not an option, then impute the substitute values in place of missing data.Outliers. Outliers are the values that differ from other observations in the data. To deal with outliers, you need to impute alternate values to make it useful for your machine learning model. You can use algorithms such as random forest or gradient boosted trees.Unbalanced data. Unbalanced data refers to the lack of a similar number of examples. For instance, to detect fraud, you need to have the same number of fraud cases as normal cases, as we know that machine learning learns from examples. So, if your data lack in fraud cases, it will be difficult for a ML model to identify it. So, it is necessary to have balanced data.
3 Data transformation. Data transformation plays an important role in making data constructive so that it can be used in the model. It is performed to increase the probability of algorithms making precise and meaningful predictions. There are several data transformation techniques such as:Categorical encoding. Machine learning models are mathematical models that understand numeric representation, but categorical data has label values. Some of the ML algorithms require input and output variables to be numeric. So, for this reason, categorical data is converted to numerical data using either of these two steps:Integer encoding. Also known as label coding, it assigns an integer value to every unique category. Machine learning algorithms are capable of understanding this relationship between integer and category. For example, your qualification can be high school, college, or postgraduate. If we assign an integer to each of these categories like high school = 1, college = 2, postgraduate = 3, this data will become machine‐readable.One‐hot encoding. One‐hot encoding is used when the categorical variables don't have an ordinal relationship. In one‐hot encoding, the binary value is assigned to each unique value and is separated by different columns. For example, Facebook has a relationship status feature—engaged, married, separated, divorced, or widowed. Then each status will have different columns with “engaged” status assigned value one in the “engaged” column and zero in the remaining four columns.Dealing with skewed data. If your data is not symmetric, meaning if one half of your data distribution is not the mirror image of the other half, then the data is considered as asymmetric or skewed data. So, to discover patterns in skewed data, you need to apply a log transformation or reciprocals (i.e., positive or negative) or Box‐Cox transformation over the whole set of values. This way, you can use it for the statistical model.Bias mitigation. Bias mitigation can be done by alternating current values and labels to get a more unbiased model. Some algorithms that can help in this process are reweighing, optimized preprocessing, learning fair representations, and disparate impact remover.Scaling. When you use regression algorithms and algorithms using Euclidean distances, you need to transform your data into a particular range. This can be done by altering the values of each numerical feature and setting it to a common scale. This can be done by using normalization (min‐max scaling) or z‐score standardization.
4 Feature engineering. Feature engineering is a process of constructing explanatory variables and features from raw data to turn our inputs into a machine‐readable format. These variables and features are used to train the model. To follow this step, one should have a clear understanding of the data. Feature engineering can be achieved in two activities:Feature extraction. It is done to reduce the number of processing resources without losing relevant information. This activity combines variables to features, thus reducing the amount of processing data while still accurately describing the original dataset.Capturing feature relationships. If you have a better understanding of your data, you can find out the relationship between two features so as to help your algorithm focus on what you know is important.
Select Algorithm and Model (Modeling)
After the completion of the tough part, that is, data selection and data pre‐processing, we are now moving to the interesting part: modeling.
Modeling is an iterative process of creating a smart model by continuously training and testing the model until you discover the one with high accuracy and performance.
To train an ML model, you need to provide an ML algorithm with a clean training dataset to learn from. Choosing a learning algorithm depends on the problem at hand. The training data that you are planning to feed to the ML algorithm must contain the target attribute. ML algorithms find patterns in training data and learn from it. This ML model is then tested with new data to make predictions for the unknown target attribute. Let's understand it with an example.
You want to train your model to separate spam mails from your regular emails. To do so, you need to provide your learning algorithm with the training data that contains a white list and black list. The white list contains email addresses of people you tend to receive email from. The blacklist contains all the addresses of users that you want to avoid receiving email from. So, the ML algorithm will learn from this training data and predict if the new mail is from black list or white list. If it's from the black list, it automatically labels it as spam.
To create an effective model, it is important to select an accurate algorithm that can find predictable, repeatable patterns. On the one hand, some problems that need ML are very specific and require a unique approach to solve the problem, On the other hand, some problems need a trial‐and‐error approach.
Machine learning algorithms are divided into four main types (see Figure