1.6.1 Introduction to Machine Learning
Disease prediction and diagnosis can be made with the help of machine learning models. Disease diagnosis applications have been developed and used extensively, especially with controlled machine learning techniques. This technique has enabled models to be created from historical data and these models have sometimes been used in diagnosis and treatment. Developing a system based on machine learning is not just about developing machine learning algorithms, but rather it is done by working on data step by step from start to finish in a way similar to the data mining process. For example, determining which variables are important and which are not important in the solution of a problem directly affects the quality of the solution. This process, called feature selection, determines which parameters will be used in the system to be installed. The feature selection process is often achieved by establishing the correct relationships between targeted data and predictive data.
The feature selection phase is followed by feature transformation. Data transformation, which is a method that improves data quality, has recently emerged as feature engineering, which includes feature studies performed to increase prediction success. Like feature selection, feature engineering will also affect the success of the result. Both feature selection and feature engineering also solve the problem of high dimensionality in data. Loss of data and methods of combating loss are also important. The fight against lost values is sometimes carried out by estimating the lost value and sometimes by replacing it with other values.
It is also important to choose the learning models to be used when creating machine learning systems. There are four basic learning methods under the title of machine learning: supervised learning, unsupervised learning, semi-supervised learning, and reinforced learning. The latter two methods, semi-supervised learning is a type of learning that occurs with the combination of supervised and unsupervised techniques; reinforced learning is an agent-based learning technique where the decision is made according to the rewarding mechanism, and it is used to reveal the most possible solution to a subject that we did not have any previous knowledge about.
The most prominent techniques – supervised and unsupervised learning – will be explained in the following paragraphs.
1.6.2 Machine Learning Algorithms
There are many machine learning algorithms developed to date. To understand algorithms, the two learning methods of supervised learning and unsupervised learning (see Figure 1.2) need to be understood.
Figure 1.2 Supervised and unsupervised machine learning.
Supervised learning algorithms are often used in machine learning and perform two basic tasks: classification and regression. The output of classification operations is a nominal value, while the output of regression operations is usually a continuous value. Although supervised learning algorithms are often used to diagnose diseases, the field of machine learning is, as we have seen, wider and includes the additional methods we have mentioned, which are used for jobs where supervised learning is insufficient.
1.6.2.1 Supervised Learning
Supervised learning consists of two basic steps: creating a model with labeled data and testing with untagged data, the two prominent techniques in the category of supervised learning algorithms. Classification, one of the two prominent techniques in the category of supervised learning algorithms, is a supervised learning technique in which the target variable is of categorical type, while regression, the other prominent technique, is of the numerical type of the target variable. Operations are performed based on a model in which the target variable calculated from predictive variables is estimated. The purpose of classification is to assign records seen for the first time to one of the predefined categories. Identification and modeling of categories takes place with the help of training data. Training data and machine learning algorithms come together to form machine learning models. Machine learning models also match the records to the classroom that suits them best.The most important feature that distinguishes supervised learning from unsupervised learning is label information in supervised data. It is the class label in the data that provides the control. Although the output of both methods is different, the goal is to estimate the value of the output variable based on input variables.
1.6.2.1.1 Decision Trees
One of the prominent algorithms related to the classification task is the decision tree. The model presented with the tree data structure is learned directly from the data. Through the tree induction process, the characteristics of the training data are processed on the tree. The decision tree algorithm we often use and are used to seeing is the C4.5 algorithm [29]. Since the C4.5 algorithm is suitable for working with both numerical and categorical input variables, it can be used in many data sets.
1.6.2.1.2 Naive Bayes
The Naive Bayes classifier is a probability-based classifier. It is an algorithm that tries to find the final probabilities P (Cj | A) of the test data with the help of the preliminary probabilities P (A | Cj) learned from the training data. The algorithm is based on the Bayes’ theorem [30]. According to Bayes’ theorem, events are interrelated and there is a relationship between probabilities P (A | C) and P (C | A), P (A), P (C). Therefore, while calculating the value of P (A | C) with the help of Bayes, we use the equation P (C | A) = (P (A | C) P (C)) / (P (A)). The Naive Bayes approach is used to solve the zero probability problem of Bayesian approach. Thanks to the naive approach, it is assumed that there is no relationship between the events and the process is shortened. Thus, it is possible to get rid of sparsity in the data relatively.
1.6.2.1.3 Support Vector Machines
SVMs were first introduced by Vapnik [31]. The technique uses what we call support vectors to distinguish between data points belonging to different classes. The method aims to find the hyperplane that will best distinguish (margin maximization) different classes from each other. In its simplest form, it distinguishes two-class spaces from each other with the help of two equations wTx + b = + 1 and wTx + b = -1. SVMs were first developed in accordance with linear classification and, later, kernel functions for nonlinear spaces were developed. Kernel functions express a transformation between linear and nonlinear spaces. There are types such as linear, polynomial, radial basis function, and sigmoid. Depending on the nature of the data used, kernel functions can be superior to each other.
1.6.2.1.4 K-Nearest Neighbor
The k-nearest neighbor (k-NN) algorithm is a distance-based classifier, which looks at the neighbors of the data point to classify a data object whose class is unknown. A majority vote is made for the classification decision. The two prominent parameters for the algorithm are the k (neighbor) number and the distance (distance) function. There is no exact method for determining the number of neighbors, so the ideal k value is often found after trials. The cosine similarity, Manhattan, Euclidian, or Chebyshev distance is used as the distance function. One of the problems with the k-NN algorithm is the scale problem. When the method based on operating in geometric space gives a scale problem, the problem is solved by feature engineering.
1.6.2.1.5 Neural Nets
An ANN is a machine learning method that emulates human learning. ANNs,