1.5 Data Warehouse
It is a warehouse which means it collects data from multiple heterogeneous sources. It supports analytical data processing and helps in decision-making. As data is collected from various sources, before storing this data into the warehouse (Table 1.1), data cleaning, data integration, and data consolidation, etc., steps must be performed and represented in Figure 1.3 [18]. Data warehouse properties are as follows:
Table 1.1 Comparison in a data warehouse—OLTP.
Figure 1.3 Data warehouse.
Subject-oriented—designed for a specific subject/s
Integrated—integrates different data from multiple sources.
Non-volatile—data once stored remains stable and does not change over time.
Time-variant—it looks at change over time.
One can compare data warehouse and OLTP as follows:
1.6 Data Mining Techniques
Decision trees: It is a tree-like structure that helps identify the possible outcomes/results/consequences, etc. It is usually used in a decision support system. One can say it can be used in classification and prediction. It resembles a tree-like structure where leaf nodes represent the outcomes/results, etc. as shown in Figure 1.4. As it is a tree-like structure, classification/prediction starts from the root node and traverses through the leaf nodes. Its benefit is there is no need for high computation to find perfect predictions [1–6].
If there are ‘n’ nodes (root node and leaf nodes) in a sorted manner, then the best option/desired option can be found within less time.
Genetic algorithms (GAs): It helps in finding possible solutions. These algorithms help to optimize the given problem and find better solutions. One can categorize the identified solutions into optimal and near-optimal solutions. It may comprise of ‘n’ computations and hence known as an evolutionary approach to find the perfect solution. In NP-hard problems, it has been proven that usable near-optimal solutions can be found using GAs. This concept is related to biology, i.e., chromosomes, genes, and population. These terms can be described in the computations as follows:Figure 1.4 Decision Tree.Chromosome—one possible solutionPopulations—set and subset of all possible solutionsGenes—one element of the chromosome
GAs could have the following steps involved—
Population initialization
Fitness function calculation
Crossover (finding the probabilities)
Mutation (a method to get a new solution)
Survivor selection (selecting the required and removing the unwanted)
Return the best solution.
Nearest neighbor method: As its name suggests, the nearest neighbor method tries to find the new possible solution, data based on some similarity. It classifies the given data and predicts the possible new data. Proximity among the given objects is calculated and as per the set threshold, objects close to each other are selected. E.g., KNN—‘k’ nearest neighbor algorithm. One has to decide the value of ‘k’ for better involvement of the objects. If someone decides the value of k = 1, possible outcomes become unstable, and as the value of ‘k’ increases, it involves the majority of objects which results in better predictions. Such algorithms can be used in Banking and financial systems and To calculate the credit of the users.
1.7 Data Mining Tools
Various data mining tools are available for researchers and organizations. We will discuss the hands-on process of installing three major tools, namely Python, KNIME, and Rapid Miner [19–25].
1.7.1 Python for Data Mining
We will discuss Python for data mining in this last section with various techniques. Regression is a technique to reduce errors by estimating the relationship that may exist between variables. It is also possible to form clusters in Python. One can implement this regression method using Python as follows:
User can develop a regression model for given variables and helps researchers, students to estimate the relationship exists between them. It also helps in classifying the given objects, analyze the clusters formed, etc., using tools provided in Python [24].
Panda,” a library supported by Python, helps to clean and process the input data.
NumPy—a package supported by Python to perform computations.
Matplotlib—once the data is processed, there is a need to visualize this data, and it is possible using this package supported by Python.
Scikit-learn—a library supported by Python to model the data.
Python used in data mining, and machine learning executes the following steps:
1 Import the required libraries
2 Dataset loading (import)
3 If the dataset consists of missing data, then it must handle this missing data
4 Classifying or handling categorical data
5 Dividing the dataset into training and testing dataset
6 Features scaling (actually, it is a transformation of variables).
Installation and Setup of Python
1) Click on the link below and select OS: https://www.anaconda.com/download/ [24]
2) Download Python 3.7 version (around 500 MB)
3) Once installed, launch the Anaconda Navigator (search by clicking the windows button)
4) Run the required Application (Jupyter, Spyder, etc.)
Make sure you constantly update the entire Anaconda distribution as it takes care of updating all the modules and dependencies inside (For more on installation, go to https://docs.anaconda.com/anaconda/install/windows/ for Windows version).
1.7.2