Educational Purpose◦ Using data mining, one can identify the student’s interests in different fields. It also helps in improving teaching methodology with new trends [13].
Crime Investigation◦ Data mining helps in identifying different patterns applied in other crimes. Crimes, criminals, and their crime characteristics are analyzed under this category. A large volume of (stored data) can be processed to identify different relationships with criminals. In this category, face recognition, fingerprint recognition, etc., are considered and used in the investigation [14].
1.2.3 Databases
It is a collection of records. With databases and their structures, records may vary with the applications. Here are the following types of databases that can be used in many applications [15].
Transactional Database: It is a popular type of database that consists of rows and columns, i.e., known as transactions. The transaction has the following parameters.Transaction idTimestampList of itemsItem descriptionThe transaction id is a unique identifier generated by the system. Transactional databases are mostly related to financial matters such as banking transactions, booking a movie ticket, booking a flight, etc. [16].
Multimedia Database: The data integration phase from the KDD process integrates data from multiple sources, and that data could be in the form of text, document, video, image, audio, etc. Storing these different data types (multimedia data) requires high dimensional space, which is a characteristic of a multimedia database [17]. Its examples areVideo-on-demandDigital librariesAnimationsImages.
Spatial Database: Similar to multimedia and transactional database, there is a spatial database which can store geographical information. This information maps, positioning of the object, etc. Geographic coordinates are handy in determining the topographic data [17].Figure 1.2 Time series database.
Time-Series Database: As its name suggests time-series database—holds information related to a specific item w.r.t. time. E.g., weekly, monthly, yearly, etc. Such patterns help predict the trends and movements of an item in a particular time zone and are represented in Figure 1.2.
1.3 Issues in Data Mining
Data mining consists of tasks like user interfacing, mining, security, performance, and data source. The following is a discussion on various tradeoffs of data mining [3–5, 14].
◦ User interface designAs discussed in the KDD process where discovered knowledge needs to be represented using good, accurate visualization. The user interface design issue addresses the interaction required within users and the systems, information rendering. This issue requires analysts, programmers to work on different conceptual levels.
◦ Mining methodologies issuesThis issue addresses the following sub-points:Algorithms to be usedError-free dataLess time complexityMetadata processing.
◦ Security issuesSecurity is a very important issue in data mining. Data collection, data processing requires maintaining the integrity, confidentiality of the data. Data mining systems deal with the private and sensitive information of the users and hence providing security to this data is a primary objective of this method.
◦ Performance issuesThere are many data mining applications existing in the market that are used in different sectors. These applications process a large volume of data and hence data mining algorithms; applications must process this data without compromising the performance of the system.
◦ Data source issuesData is collected from different sources, and it’s an incremental process. The number of data mining applications is increasing, which produces a large volume of data. It became a necessary task to store, process and categorized this large volume of data is a necessary task.
1.4 Data Mining Algorithms
Adaboost, KNN, PageRank, Naïve Bayes, Support Vector Machine (SVM), Apriori, and C 4.5 are some data mining algorithms. Data mining algorithms are primarily used for predictive modeling, which includes clustering and classification problems. Let us discuss each of them in detail [1–6].
Classification
It is a task in data mining where data can be modeled and distinguished into classes. One can say it is a process where given objects are classified/categorized to form a new class. Initially, the training set is identified, and new observations are derived. Hence, this task is classified into two phases, i.e., the learning/training phase and the classification of the given objects. E.g., a bank manager can wish to classify the loans borrowed by customers based on risky category, less risky category and trustworthiness, etc. To execute this classification technique on the given objects, the idea is to use classifier/s—where rules are applied, training is given, and given data is classified into the desired classes. The following are the classification algorithms that can be used in data mining:
Logistics regression
Naïve Bayes
K nearest
Decision tree
Random forest
Support Vector Model.
Clustering
It is a grouping of objects based on similarity. A threshold is applied, and an object can be added to the specific cluster where the criteria can be satisfied. This technique is helpful in various applications such as—
Market basket analysis
Pattern recognition
Image processing
Financial analysis.
It is categorized as unsupervised learning, where the given data is used to compare with the threshold (predefined value). The clustering approach can be categorized into intra-cluster and inter-cluster.
Types of Clustering
Clustering is nothing but a grouping of elements based on similarity and its unsupervised learning technique. One can apply partition clustering, which is also known as non-hierarchical clustering, to classify the data/records/values into ‘k’ groups/clusters. This is an iterative process and works until the last element is processed. Users can use the SVM model—support vector machine, where ‘n’ features will be identified in the initial phase, and then those features will be processed to identify the relevant results.
◦ K-means clustering algorithm can be used to train the samples. Using this clustering method, it is possible to identify the nearest cluster by training the samples. Training the samples is nothing but finding the distance between samples and the nearest clusters. Distance is calculated between the samples, and the sample with a larger distance is likely to be selected as a center point. (One can use Euclidean distance metric in this case). K-means stores centroids (‘k’ points) that it uses to define the clusters to be formed. An object/value is considered to be in a specific cluster if it is closer to that cluster’s centroid.
◦ Hierarchical: It is one of the popular algorithms used in data mining and machine learning. The idea is to find the two clusters which are closer to each other and merge them to form a single cluster. Repeat this process until all the desired clusters are merged. This is categorized into top-down and bottom-up approaches, i.e., known as agglomerative and divisive approaches. We can define this type as the nesting of clusters that can be nested together to form a tree (merged cluster).
◦ Fuzzy: Clusters