CPSC 340 Decision Trees

Posted by Luna on January 31, 2022

Decision Trees are simple proagrams consisting of:

  • A nested sequence of “if-else” decisions (splitting rules)
  • A class lable as return value

Example:

Decision Stump

Decision Stump – A simple decision tree with 1 splitting rule based on 1 feature. Below is the sample of a decision stump:

To learn a decision stump, we need to find 3 things:

  • which feature we should use to split data
  • what value should be used as threshold
  • what classes should we used for the leaves

Accuracy Score

To find out the best rule, we should define a score for each rule

The most intuitive score is classification accuray – if we use this rule, how many examples do we labe correctly?

Milk Fish Egg Sick?
0.7 0 1 1
0.7 0 2 1
0 0 0 0
0.7 1.2 0 0
0 1.2 2 1
0 0 0 0

If we use below decision stump:

While egg > 1, sick: 2/2

egg <=1, not sick: 3/4

Accuracy: 5/6

We “learn” a decision stump by finding rule with the best score

Decision Stump Learning Pseudo-Code

Implementation

Github decision stump

Cost

  • n examples
  • d features
  • k threshholds (>0, >1, >2, …) for each feature

We have $O(dk)$ rules, and for each rule, we should go through n examples to find most common labels, then go through n examples again to compute the accuracy.
So the total cost is $O(ndk)$
If the features are binary, which means k =1, then the cost is $O(nd)$
However if features are numerical, then k = n, the cost is $O(n^2d)$

Decision Trees

Decision Stump VS Decison Trees

Decsion stumps have only one rule based on one feature, decision trees allow sequence of splits based on multiple features.

Decision Tree Learning - Greedy Recursive Splitting

  1. Start with a full dataset, find the decision stump with the best score, split into 2 smaller datasets based on the stump.

  2. Now we have a decision stump and 2 sub-datasets. Fit a decision stump to each leaf’s data. And add these stumps to the tree

    After adding stumps to the tree, we have a decision tree with depth 2:

    It divide the original dataset as below subsets:

We might continue splitting until:

  • the leaves each has only one label.
  • we reach the defined maximum depth.

Score Function

We cannot use accuracy score:
for leafs: yes, just maximize accuracy
for internal nodes: not necessary
and maybe no simple rule like (egg >0.5) improves accuracy, but we should not stop.
(Example where accuracy fails can be found in the slides)

We can use the “information gain” as score function: choose split that decreases entropy of labels the most:

$information \space gain = entropy(y) - \frac{n_{yes}}{n}entropy(y_{yes}) - \frac{n_{no}}{n}entropy(y_{no})$

Infogain is large if labels are “more predictable” next layer.

Entropy Function

Code Implementation

We should use infogain as the score function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class DecisionStumpInfoGain(DecisionStumpErrorRate):
    y_hat_yes = None
    y_hat_no = None
    j_best = None
    t_best = None

    def fit(self, X, y):
        n, d = X.shape
        class_count = np.unique(y).size
        self.y_hat_yes = utils.mode(y)
        
        # If all ys are the same
        if (class_count == 1):
            return
        p = np.bincount(y, minlength = class_count) / n
        prev_entropy = entropy(p)
        maxInfo = 0

        for j in range(d):
            thresholds = np.unique(X[:, j])
            for i in range(0, len(thresholds)):
                threshold = thresholds[i]

                y_yes = y[X[:, j] > threshold]
                p_yes = np.bincount(y_yes, minlength=class_count) / len(y_yes)
                y_yes_mode = utils.mode(y_yes)


                y_no = y[X[:, j] <= threshold]
                p_no = np.bincount(y_no, minlength=class_count) / len(y_no)
                y_no_mode = utils.mode(y_no)

                # Make prediction
                y_pred = y_yes_mode * np.ones(n)
                y_pred[X[:, j] <= threshold] = y_no_mode

                # score function
                new_entropy = len(y_yes) / n * entropy(p_yes) + len(y_no) / n * entropy(p_no)
                
                if prev_entropy - new_entropy > maxInfo:
                    maxInfo = prev_entropy - new_entropy
                    self.j_best = j
                    self.t_best = threshold
                    self.y_hat_yes = y_yes_mode
                    self.y_hat_no = y_no_mode

Then split recursively until the designed depth or it has already reached the best.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
    def fit(self, X, y):
        # Fits a decision tree using greedy recursive splitting

        # Learn a decision stump
        stump_model = self.stump_class()
        stump_model.fit(X, y)

        if self.max_depth <= 1 or stump_model.j_best is None:
            # If we have reached the maximum depth or the decision stump does
            # nothing, use the decision stump

            self.stump_model = stump_model
            self.submodel_yes = None
            self.submodel_no = None
            return

        # Fit a decision tree to each split, decreasing maximum depth by 1
        j = stump_model.j_best
        value = stump_model.t_best

        # Find indices of examples in each split
        yes = X[:, j] > value
        no = X[:, j] <= value

        # Fit decision tree to each split
        self.stump_model = stump_model
        self.submodel_yes = DecisionTree(
            self.max_depth - 1, stump_class=self.stump_class
        )
        self.submodel_yes.fit(X[yes], y[yes])
        self.submodel_no = DecisionTree(
            self.max_depth - 1, stump_class=self.stump_class
        )
        self.submodel_no.fit(X[no], y[no])