Error‐based pruning (ebp): This is an evolution of pessimistic pruning. As in pessimistic pruning‚ the error rate is estimated using the upper bound of the statistical confidence interval for proportions:
(2.25)
where ε(T, S) denotes the misclassification rate of the tree T on the training set S. Z is the inverse of the standard normal cumulative distribution, and α is the desired significance level. Let subtree (T, t) denote the subtree rooted by the node t. Let maxchild (T, t) denote the most frequent child node of t (namely, most of the instances in S reach this particular child), and let St denote all instances in S that reach the node. The procedure performs bottom‐up traversal over all nodes and compares the following values:
According to the lowest value, the procedure either leaves the tree as is, prunes the node, or replaces the node t with the subtree rooted by maxchild (T, t).
Optimal pruning (opt): Bohanec and Bratko [17] introduced an algorithm guaranteeing optimality called optimal pruning (opt). This algorithm finds the optimal pruning based on dynamic programming, with a complexity of θ(∣l(T)|2), where T is the initial decision tree. Almuallim [18] introduced an improvement of opt called opt‐2, which also performs optimal pruning using dynamic programming. However, the time and space complexities of opt‐2 are both Θ( ∣ l(T*) ∣ ∣internal (T) ∣ ), where T * is the target (pruned) decision tree, and T is the initial decision tree.
Since the pruned tree is usually much smaller than the initial tree and the number of internal nodes is smaller than the number of leaves, opt‐2 is usually more efficient than opt in terms of computational complexity.
Minimum description length (MDL) pruning: Rissanen [19], Quinlan and Rivest [20], and Mehta et al. [21] used the MDL to evaluate the generalized accuracy of a node. This method measures the size of a decision tree by means of the number of bits required to encode the tree. The MDL method prefers decision trees that can be encoded with fewer bits. Mehta et al. [21] indicate that the cost of a split at a leaf t can be estimated as
(2.26)
where ∣St∣ denote the number of instances that have reached the node.
The splitting cost of an internal node is calculated based on the cost aggregation of its children.
2.2.3 Dimensionality Reduction Techniques
In this section, we provide an overview of the mathematical properties and foundations of the various dimensionality reduction techniques [22–24]
There are several dimensionality reduction techniques specifically designed for time series. These methods specifically exploit the frequential content of the signal and its usual sparseness in the frequency space. The most popular methods are those based on wavelets [25, 26], and a distant second is empirical mode decomposition [27, 28] (the reader is referred to the references above for further details). We do not cover these techniques here since they are not usually applied for the general‐purpose dimensionality reduction of data. From a general point of view, we may say that wavelets project the input time series onto a fixed dictionary (see Section 2.3). This dictionary has the property of making the projection sparse (only a few coefficients are sufficiently large), and the dimensionality reduction is obtained by setting most coefficients (the small ones) to zero. Empirical mode decomposition instead constructs a dictionary specially adapted to each input signal.
To maintain the consistency of this review, we do not cover those dimensionality reduction techniques that take into account the class of observations; that is, there are observations from a class A of objects, observations from a class B, … and the dimensionality reduction technique should maintain, to the extent possible, the separability of the original classes. Fisher’s Linear Discriminant Analysis ) was one of the first techniques to address this issue [29, 30]. Many other works have followed since then; for the most recent works and for a bibliographical review, see [31, 35]. Next, we will focus on vector quantization and mixture models and PCA, which was already introduced to some extent in the previous section.
In the following, we will refer to the observations as input vectors x, whose dimension is M. We will assume that we have N observations, and we will refer to the nth observation as xn. The whole dataset of observations will be X, whereas X will be a M × N matrix with all the observations as columns. Note that non‐bold small letters represent vectors (x), whereas capital, non‐bold letters (X) represent matrices.
The goal of the dimensionality reduction is to find another representation χ of a smaller dimension m such that as much information as possible is retained from the original set of observations. This involves some transformation operators from the original vectors onto the new vectors, χ = T(x). These projected vectors are sometimes called feature vectors, and the projection of xn will be denoted as χ n. There might not be an inverse for this projection, but there must be a way of recovering an approximate value of the original