Bioinformatics and Medical Applications. Группа авторов. Читать онлайн. Newlib. NEWLIB.NET

Автор: Группа авторов
Издательство: John Wiley & Sons Limited
Серия:
Жанр произведения: Программы
Год издания: 0
isbn: 9781119792659
Скачать книгу
a product may be regarded as an apple if possibly it is red in color, round in shape, and around 10 cm wide.

      Let D be the training dataset, y be the variable for class and the attributes represented as X hence according to Bayes theorem

image

      where

image

      So, replacing the X and applying the chain rule, we get

image

      Since the denominator remains same, removing it from the dependency

image

      Therefore, to find the category y with high probability, we use the following function:

image

      Some of the advantages of Naive Bayes algorithm are as follows:

       • Easy to execute.

       • Requires a limited amount of training data to measure parameters.

       • High computational efficiency.

      However, there are some disadvantages too, as follows:

       • It is thought that all aspects are independent and equally important which is virtually impossible in real applications.

       • The tendency to bias when increasing the number of training sets.

      K means, an unsupervised algorithm, endeavors to iteratively segment the dataset into K pre-characterized and nonoverlapping data groups with the end goal that one data point can have a place with just one bunch. It attempts to make the intra-group data as similar as could reasonably be expected while keeping the bunches as various (far) as could be expected under the circumstances. It appoints data points to a cluster with the end goal that the entirety of the squared separation between the data points and the group’s centroid is at the minimum. The less variety we have inside bunches, the more homogeneous the data points are inside a similar group.

      1.3.7 Ensemble Method

Schematic illustration of the ensemble methods.

      Bagging or bootstrap aggregation assigns equal weights to each model in the ensemble. It trains each model of the ensemble separately using random subset of training data in order to promote variance. Random Forest is a classical example of bagging technique where multiple random decision trees are combined to achieve high accuracy. Samples are generated in such a manner that the samples are different from each other and replacement is permitted.

      1.3.7.2 Boosting

      The term “Boosting” implies a gathering of calculations which changes a weak learner to strong learner. It is an ensemble technique for improving the model predictions of some random learning algorithm. It trains weak learners consecutively, each attempting to address its predecessor. There are three kinds of boosting in particular, namely, AdaBoost that assigns more weight to the incorrectly classified data that would be passed on to the next model, Gradient Boosting which uses the residual errors made by previous predictor to fit the new predictor, and Extreme Gradient Boosting which overcomes drawbacks of Gradient Boosting by using parallelization, distributed computing, out-of-core computing, and cache optimization.

      1.3.7.3 Stacking

      It utilizes meta-learning calculations to discover how to join the forecasts more readily from at least two basic algorithms. A meta model is a two-level engineering with Level 0 models which are alluded to as base models and Level 1 model which are alluded to as Meta model. Meta-model depends on forecasts made by basic models on out of sample data. The yields from the base models utilized as contribution to the meta-model might be in the form of real values in the case of regression and probability values in the case of classification. A standard method for setting up a meta-model training database is with k-fold cross-validation of basic models.

      1.3.7.4 Majority Vote

      Each model makes a forecast (votes) in favor of each test occurrence and the final output prediction is the one that gets the greater part of the votes. Suppose for a specific order issue we are given three diverse classification rules, c1(X); c2(X); c3(X), we join these rules by majority voting as

image

      1.4.1 Experiment and Analysis

      Naive Bayes multi-model decision-making system, which is our proposed method uses ensemble method of type majority voting using a combination of Naive Bayes, Decision Tree, and Random Forest for analytics in the database of heart disease patients and attains an accuracy that outperforms any of the individual methods. Additionally, it uses K means along with the combination of the above methods for further increase the accuracy.

      The data pertains to Kaggle dataset for cardiovascular disease which contains 12 attributes. Whether or not cardiovascular disease is present is contained in column carrying target value which is a binary type having values 0 and 1 indicating absence or presence respectively. There are a total of 70,000 records having attributes for age, tallness, weight, gender, systolic and diastolic blood pressure, cholesterol, glucose, smoking, alcohol intake, and physical activity.

      Training and testing data is divided in the ratio 70:30. During training and testing, we tried various combinations to see their effect of accuracy of predictions. Also, we took data in chunks of 1000, 5000, 10,000, 50,000 and 70,000, respectively, and observed the change in patterns. We tried various combinations to check on the accuracy.

       • NB: Only Naive Bayes algorithm is applied.

       • DT: Only Decision Tree algorithm is applied.

       • RF: Only Random Forest algorithm is applied.

       • Serial: