1.3.1. Techniques based on decision trees
Decision trees are powerful and widespread nonparametric learning tools used for classification and prediction problems. Their purpose is to create a model that predicts the values of the target variable, relying on a set of sequences of decision rules deduced from learning data. Rai et al. (2016) have developed an algorithm based on the C4.5 decision tree approach. The most relevant characteristics are selected by means of information gain and the fractional value is selected so that it renders the classifier unbiased with respect to the most frequent values. In the work of Sahu and Babu (2015), a database referred to as ”Kyoto 2006+” is used for the experiments. In Kyoto 2006+, each instance is labeled as “normal” (no attack), “attack” (known attack) and “unknown attack”. The Decision Tree algorithm (J48) is used to classify the packets. Experiments confirm that the generated rules operate with 97.2% accuracy. Moon et al. (2017) proposed an intrusion detection system based on decision trees using packet behavior analysis to detect the attacks. Peng et al. (2018) proposed a technique that involves a preprocessing for data digitization, followed by their normalization, in order to improve detection efficiency. Then a method based on decision trees is used.
1.3.2. Techniques based on data exploration
Data exploration aims to eliminate the manual elements used for the design of intrusion detection systems. Various data exploration techniques have been developed and widely used. The main data exploration techniques are presented in the following sections.
1.3.2.1. Fuzzy logic
Fuzzy logic has been used in the field of computer networks security, particularly for intrusion detection (Idris and Shanmugam 2005; Shanmugavadivu and Nagarajan 2011; Balan et al. 2015; Kudłacik et al. 2016; Sai Satyanarayana Reddy et al. 2019), for two main reasons. First, several quantitative parameters used in the context of intrusion detection, for example processor use time and connection interval, can be potentially considered as fuzzy variables. Second, the security concept is itself fuzzy. To put it differently, the fuzzy concept helps in preventing a sharp distinction between normal and abnormal behaviors. Kudłacik et al. (2016) have applied fuzzy logic for intrusion detection. The proposed solution analyzes the user activity over a relatively short period of time, creating a local user profile. A more in-depth analysis involves the creation of a more general structure based on a defined number of local user profiles, known as a “fuzzy profile”. The fuzzy profile represents the behavior of the computer system user. Fuzzy profiles are directly used in order to detect user behavior anomalies, and therefore potential intrusions. Idris and Shanmugam (2005) proposed a modified FIRE system. It is a mechanism for the automation of the fuzzy rule generation process and the reduction of human intervention making use of AI techniques.
1.3.2.2. Genetic algorithms
Genetic algorithms are techniques derived from genetics and natural evolution, which have been used to find approximate solutions to optimization and search problems. The main advantages of genetic algorithms are their flexibility and robustness as global search method. As for drawbacks, they are computationally time-consuming, as they handle several solutions simultaneously. Genetic algorithms have been used in various manners in the field of intrusion detection (Hoque et al. 2012; Aslahi-Shahri et al. 2016; Hamamoto et al. 2018). Hoque et al. (2012) presented an intrusion detection system using a genetic algorithm to effectively detect anomalies in the network. Aslahi-Shahri et al. (2016) proposed a hybrid method that uses support vector machines and genetic algorithms for intrusion detection. The results indicate that this algorithm can reach a 97.3% true positive rate and a 1.7% false positive rate.
1.3.3. Rule-based techniques
Rule-based techniques (Li et al. 2010; Yang et al. 2013) generally involve the application of a set of association rules for data classification. In this context, if a rule stipulates that if event X occurs, then event Y is likely to occur, events X and Y can be described as sets of pairs (variable, value). The advantage of using rules is that they tend to be simple and intuitive, unstructured and less rigid. Nevertheless, a drawback is that rules are difficult to preserve and, in certain cases, inadequate for the representation of various types of information.
Turner et al. (2016) developed an algorithm for monitoring the enabled/disabled state of the rules of an intrusion detection system based on signatures. The algorithm is implemented in Python and runs on Snort (Roesch 1999). Agarwal and Joshi (2000) proposed a general framework in two stages for learning a rule-based model (PNrule) in order to learn classifier models on a set of data. They extensively used various distributions of classes in the learning data. The KDD Cups database was used for learning and testing their system.
1.3.4. Machine learning-based techniques
Machine learning can be defined as the capacity of a program to learn and improve the performances of a series of tasks in time. Machine learning techniques focus on the creation of a system model that improves its performances relying on the previous results. Furthermore, it can be said that machine learning–based systems have the capacity to handle the execution strategy depending on the new inputs. The main machine learning techniques are presented in the following sections.
1.3.4.1. Artificial neural networks
Artificial neural networks learn to predict the behavior of various system users. If correctly designed and implemented, neural networks can potentially solve several problems encountered by rule-based approaches. The main advantage of neural networks is their tolerance to inaccurate data and uncertain information and their capacity to deduce solutions without previous knowledge on data regularities. Cunningham and Lippmann (2000) of MIT Lincoln Laboratory conducted a number of tests using neural networks. The system searched for attack-specific key words specific in the network traffic. In Ponkarthika and Saraswathy (2018), a model of intrusion detection system is explored as a function of deep learning. Long–short term memory (LSTM) architecture was applied to a recurrent neural network for the learning of an intrusion detection system using the KDD Cup 1999 dataset.
1.3.4.2. Bayesian networks
A Bayesian network is a probabilistic graphical model representing a set of random variables in the form of an acyclic oriented graph. This technique is generally used for intrusion detection in combination with statistical diagrams. It has several advantages, notably the capacity to code the interdependences between variables and to