Design Example 2.3
The example in Figure 2.16 describes a decision tree that reasons whether or not a potential customer will respond to a direct mailing.
Decision tree induction is closely related to rule induction. Each path from the root of a decision tree to one of its leaves can be transformed into a rule simply by conjoining the tests along the path to form the antecedent part, and taking the leaf’s class prediction as the class value. For example, one of the paths in Figure 2.16 can be transformed into the rule: “If customer age <30, and the gender of the customer is “male,” then the customer will respond to the mail.”
Figure 2.16 Decision tree presenting response to direct mailing.
The goal was to predict whether an email message is spam (junk email) or good.
Input features: Relative frequencies in a message of 57 of the most commonly occurring words and punctuation marks in all the training email messages.
For this problem, not all errors are equal; we want to avoid filtering out good email, while letting spam get through is not desirable but less serious in its consequences.
The spam is coded as 1 and email as 0.
A system like this would be trained for each user separately (e.g. their word lists would be different).
Forty‐eight quantitative predictors – the percentage of words in the email that match a given word. Examples include business, address, Internet, free, and George. The idea was that these could be customized for individual users.
Six quantitative predictors – the percentage of characters in the email that match a given character. The characters are ch;, ch(, ch[, ch!, ch$, and ch#.
The average length of uninterrupted sequences of capital letters: CAPAVE.
The length of the longest uninterrupted sequence of capital letters: CAPMAX.
The sum of the length of uninterrupted sequences of capital letters: CAPTOT.
A test set of size 1536 was randomly chosen, leaving 3065 observations in the training set.
A full tree was grown on the training set, with splitting continuing until a minimum bucket size of 5 was reached.
Algorithmic framework: Decision tree inducers are algorithms that automatically construct a decision tree from a given dataset. Typically, the goal is to find the optimal decision tree by minimizing the generalization error. However, other target functions can be also defined, for instance, minimizing the number of nodes or minimizing the average depth. Induction of an optimal decision tree from a given dataset is a hard task resulting in an NP hard problem [13–16], which is feasible only in small problems. Consequently, heuristic methods are required for solving the problem. Roughly speaking, these methods can be divided into two groups: top‐down and bottom‐up, with a clear preference in the literature to the first group. Figure 2.18 presents a typical algorithmic framework for top‐down decision tree induction [8].
Design Example 2.4
A slightly more sophisticated tree for predicting email spam is shown in Figure 2.17. The results for data from 4601 email messages have been presented in [12].
Figure 2.17 Predicting email spam.
Source: Trevor Hastie [12].
Splitting criteria: In most of the cases, the discrete splitting functions are univariate. Univariate means that an internal node is split according to the value of a single attribute. Consequently, the inducer searches for the best attribute upon which to split. There are various univariate criteria. These criteria can be characterized in different ways, such as according to the origin of the measure (information theory, dependence, and distance) and according to the measure structure (impurity‐based criteria, normalized impurity‐based criteria, and binary criteria). Next, we describe the most common criteria in the literature.
Figure 2.18 Top‐down algorithmic framework for decision tree induction. The inputs are S (training set), A (input feature set), and y (target feature) [8].
Impurity‐based criteria: Given a random variable
(2.12)
The goodness of split due to the discrete attribute ai is defined as a reduction in impurity of the target attribute after partitioning S according to the values vi, j ∈ dom(ai):
(2.13)