To find a weak rule, we apply base learning (ML) algorithms with a different distribution. Each time a base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.
For choosing the right distribution, here are the steps: (i) The base learner takes all the distributions and assigns equal weights or attention to each observation. (ii) If any prediction error is caused by the first base learning algorithm, we pay greater attention to observations having prediction error. Then, we apply the next base learning algorithm. (iii) Iterate Step 2 until the limit of the base learning algorithm is reached or higher accuracy is achieved.
Finally, boosting combines the outputs from weak learners and creates a strong learner, which eventually improves the prediction power of the model. Boosting pays greater attention to examples that are misclassified or have higher errors generated by preceding weak learners.
There are many boosting algorithms that enhance a model’s accuracy. Next, we will present more details about the two most commonly used algorithms: Gradient Boosting (GBM) and XGBoost.
GBM versus XGBoost:
Standard GBM implementation has no regularization as in XGBoost, and therefore it also helps to reduce overfitting.
XGBoost is also known as a “regularized boosting” technique.
XGBoost implements parallel processing and is much faster than GBM.
XGBoost also supports implementation on Hadoop.
XGBoost allow users to define custom optimization objectives and evaluation criteria. This adds a whole new dimension to the model, and there is no limit to what we can do.
XGBoost has an in‐built routine to handle missing values.
The user is required to supply a value that is different from other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in the future:
A GBM would stop splitting a node when it encounters a negative loss in the split. Thus, it is more of a greedy algorithm.
XGBoost, on the other hand, make splits up to the maximum depth specified and then starts pruning the tree backward, removing splits beyond which there is no positive gain. Another advantage is that sometimes a split of negative loss, say −2, may be followed by a split of positive loss, +10. GBM would stop as soon as it encounters −2. However, XGBoost will go deeper, and it will see a combined effect of +8 of the split and keep both.
XGBoost allows user to run a cross‐validation at each iteration of the boosting process, and thus it is easy to obtain the exact optimum number of boosting iterations in a single run. This is unlike GBM, where we have to run a grid search, and only limited values can be tested.
User can start training an XGBoost model from its last iteration of the previous run. This can be a significant advantage in certain specific applications. GBM implementation of sklearn also has this feature, so they are evenly matched in this respect.
GBM in R and Python: Let us first start with the overall pseudocode of the GBM algorithm for two classes:
1 Initialize the outcome.
2 Iterate from 1 to total number of trees.2.1 Update the weights for targets based on previous run (higher for the ones misclassified).2.2 Fit the model on selected subsample of data.2.3 Make predictions on the full set of observations.2.4 Update the output with current results taking into account the learning rate.
3 Return the final output.
GBM in R:
> library(caret) > fitControl <- trainControl(method = "cv", number = 10, #5folds) > tune_Grid <- expand.grid(interaction.depth = 2, n.trees = 500, shrinkage = 0.1, n.minobsinnode = 10) > set.seed(825) > fit <- train(y_train ~ ., data = train, method = "gbm", trControl = fitControl, verbose = FALSE, tuneGrid = gbmGrid) > predicted= predict(fit,test,type= "prob")[,2]
For GBM and XGBoost in Python, see [2].
2.1.7 Support Vector Machine
Support vector machine (SVM) is a supervised ML algorithm that can be used for both classification and regression challenges [48, 49]. However, it is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n‐dimensional space (where n is the number of features) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyperplane that differentiates the two classes very well, as illustrated in Figure 2.5.
Support vectors are simply the coordinates of individual observations. The SVM classifier is a frontier that best segregates the two classes (hyperplane Figure 2.5 (upper part)/line Figure 2.5 (lower part)). The next question is how we identify the right hyperplane. In general, as a rule of thumb, we select the hyperplane that better segregates the two classes, that is,
If multiple choices are available, choose the option that maximizes the distances between nearest data point (either class) and the hyperplane. This distance is called the margin.Figure 2.5 Data classification.Figure 2.6 Classification with outliers.
If the two classes cannot be segregated using a straight line as one of the elements lies in the territory of other class as an outlier (see Figure 2.6), the SVM algorithm has a feature to ignore outliers and find the hyperplane that has the maximum margin. Hence, we can state that SVM classification is robust to outliers.
In some scenarios, we cannot have a linear hyperplane between the two classes. Linear classifiers are not complex enough sometimes. In such cases, SVM maps data into a richer feature space including nonlinear features, and then constructs a hyperplane in that space so that all other equations are identical (Figure 2.7). Formally, it preprocesses data with: x → Φ(x) and then learns the map from Φ(x) to y as follows:
Figure 2.7 Classifiers with nonlinear transformations.
The e1071 package in R is used to create SVMs.
It has helper functions as well as code for the naive Bayes classifier. The creation of an SVM in R and Python follows similar approaches:
#Import Library require(e1071) #Contains the SVM Train <- read.csv(file.choose()) Test <- read.csv(file.choose()) # there are various options associated with SVM training; like changing kernel, gamma and C value. # create model model <- svm(Target~Predictor1+Predictor2+Predictor3,data=Train,kernel='linear',gamma=0.2,cost=100) #Predict Output preds <- predict(model,Test) table(preds)
2.1.8