Decision Trees

How complex decisions can be divided & conquered

Decision tree is a classification technique. Recursive nature of the decision tree makes it easier to reason out about the problem and how this technique solves it.

Let's look at the structure of the decision tree.

                 ----------------------
                |          Node        |
                |        Feature 0     |      
                 ----------------------
                 /          |          \
               Outcome 0   ...    Outcome n
               /            |            \
         ---------                    -----------
        |   Leaf  |                  |    Node   | 
        | Class A |                  | Feature m |
         ---------                    -----------

  • Each node represents a feature to be evaluated.
  • Each edge from that node represents the outcomes of evaluating the feature.
  • Each leaf node represents the class label.

Decision trees can be used for regression tasks and for multi-class classification tasks. But here we consider only binary classification.

Training a Decision Tree Classifier

The training set consists of instances represented by feature vectors labelled into either A or B.

At each node, we select a feature to split the training set such that the Information Gain is maximised. When we reach a point where a node only contains a homogenious set of instances (i.e. all instances that reached that node are either class A or class B) we stop further splitting. This is also called the purity of the split.

Information Gain = Entropy Before Split - Entropy After Split

Entropy measures the impurity of the split and is quantified for binary classes A,B as:

$Entropy = - P(A)log_2 P(A) - P(B)log_2 P(B) \\ Where \\ P(A) = Probability\ of\ an\ instance\ being\ class\ A \\ P(B) = Probability\ of\ an\ instance\ being\ class\ B \\ $

So in general at any node

$ \qquad\qquad\qquad\qquad Node\ n\\ \qquad\qquad\qquad\qquad (C^n_A, C^n_B)\\ \qquad\quad-----------\\ \qquad\quad |\qquad\qquad\quad\ |\qquad\qquad\quad |\\\ \qquad Node\ l\qquad\quad\ Node\ r\qquad\quad\ Node..\ \\ \quad (C^l_A,C^l_B)\qquad(C^r_A,C^r_B)\qquad\qquad..\\ $

$(C^n_B,C^n_B)$ are the training set data instance counts of classes A and B respectively at node $n$. Similarly $(C^l_B,C^l_B)$ at node $l$ and $(C^r_B,C^r_B)$ at node $r$. Considering a binary split:

$EntropyBefore = - P(A|n)log_2P(A|n) - P(B|n)log_2P(B|n)\\ Where\\ P(A|n) = C^n_A/(C^n_A+C^n_B)\\ P(B|n) = C^n_B/(C^n_A+C^n_B)\\ $

Since we recursively split the training set, $C^n_A = C^l_A + C^r_A$ and $C^n_B = C^l_B + C^r_B$

$ Entropy^{l} = - P(A|l)log_2P(A|l) - P(B|l)log_2P(B|l)\\ Where\quad P(A|l) = C^l_A/(C^l_A+C^l_B)\\ \qquad\qquad P(B|l) = C^l_B/(C^l_A+C^l_B)\\ $
$ Entropy^{r} = - P(A|r)log_2P(A|r) - P(B|r)log_2P(B|r)\\ Where\quad P(A|r) = C^r_A/(C^r_A+C^r_B)\\ \qquad\qquad P(B|r) = C^r_B/(C^r_A+C^r_B)\\ $
$ EntropyAfter = W^l * Entropy^{l} + W^r * Entropy^{r}\\ Where\quad W^l = (C^l_A+C^l_B) / (C^n_A+C^n_B)\\ \qquad\qquad W^r = (C^r_A+C^r_B) / (C^n_A+C^n_B)\\ $

Entropies are weighted according to the ratio of instances that goes to each side.

This process is repeated for each of the features available yielding a set of Information Gains per each candidate feature.

In general for k features:

Best split Feature Index  = ArgMax(InformationGain_F0, ... ,InformationGain_Fk)

For example here's a training set

Name EyeColor PlayCricket Education Class
Apu green true UG A
Bernice blue false PG B
Carl green true UG A
Doris brown false PG B
Edna blue false PG B
Frink brown true PG B
Gil brown true UG B
Homer blue true UG A
Itchy blue true PG A

Let's evaluate the training set on the feature 'EyeColor'

$EntropyBefore = - P(A|n)log_2P(A|n) - P(B|n)log_2P(B|n)\\ Where\\ P(A|n) = C^n_A/(C^n_A+C^n_B)\\ P(B|n) = C^n_B/(C^n_A+C^n_B)\\ C^n_A = 4\ C^n_B = 5\\ Therefore\\ P(A|n) = \frac{4}{(4+5)} = \frac{4}{9}\\ P(B|n) = \frac{5}{(4+5)} = \frac{5}{9}\\ EntropyBefore = -\frac{4}{9}log_2 \frac{4}{9} - \frac{5}{9}log_2 \frac{5}{9}\\ \qquad = 0.991 $

EyeColor feature, splits the training set into 3, one for each outcome of {green,blue,brown} resulting in 3 potential nodes.

        Feature (EyeColor)
     ____________|____________
    |            |            |
    Green = 2    Blue = 4    Brown = 3
    (A=2,B=0)    (A=2,B=2)   (A=0,B=3)

$ P(A|green)\ = \frac{2}{2},\qquad P(B|green)\ \ = \frac{0}{2}\\ P(A|blue)\quad= \frac{2}{4},\qquad P(B|blue)\quad= \frac{2}{4}\\ P(A|brown) = \frac{0}{3},\qquad P(B|brown)\ = \frac{3}{3}\\ $

Entropies for each outcome

$ Entropy^{green} = - P(A|green) log_2 P(A|green) - P(B|green) log_2 P(B|green)\\ \qquad\qquad\quad\ = - \frac{2}{2}log_2 \frac{2}{2} - \frac{0}{2} log_2 \frac{0}{2} \\ \qquad\qquad\quad\ = - \frac{2}{2}log_2 \frac{2}{2} $
$ Entropy^{blue} = - P(A|blue) log_2 P(A|blue) - P(B|blue) log_2 P(B|blue)\\ \qquad\qquad\quad = - \frac{2}{4} log_2 \frac{2}{4} - \frac{2}{4} log_2 \frac{2}{4} \\ $
$ Entropy^{brown} = - P(A|brown) log_2 P(A|brown) - P(B|brown) log_2 P(B|brown)\\ \qquad\qquad\quad\ = - \frac{0}{3} log_2 \frac{0}{3} - \frac{3}{3} log_2 \frac{3}{3} \\ \qquad\qquad\quad\ = 0 - \frac{3}{3} log_2 \frac{3}{3} \\ $

Therefore EntropyAfter can be computed as below

$ Entropy^{EyeColor} = W^{green} Entropy^{green} + W^{blue} Entropy^{blue} + W^{brown} Entropy^{brown}\\ Where\\ W^{green} = \frac{2}{9},\ W^{blue} = \frac{4}{9},\ W^{brown} = \frac{3}{9}\\ Entropy^{EyeColor} = \frac{2}{9} (-\frac{2}{2} log_2 \frac{2}{2}) + \frac{4}{9} (-\frac{2}{4} log_2 \frac{2}{4} - \frac{2}{4} log_2 \frac{2}{4}) + \frac{3}{9} (- 0 - \frac{3}{3} log_2 \frac{3}{3})\\ \qquad\qquad = \frac{2}{9} \times 0 + \frac{2}{4} \times 1 + \frac{3}{9}\\ \qquad\qquad = 0.44 $

Now let's evaluate the training set on the feature 'PlayCricket' (Remember we are still evaluating the features, and haven't commited to a split using a specific feature yet). As before we can compute the relevant entropies as shown below:

    Feature (PlayCricket)
    |                    |
    Play = 6             NoPlay = 3
    (A=4,B=2)            (A=0,B=3)

$ P^(A|Play) = \frac{4}{6}, P(B|Play) = \frac{2}{6}\\ P^(A|NoPlay) = \frac{0}{3}, P^{NoPlay}(B) = \frac{3}{3}\\ $
$ Entropy^{Play} = -\frac{4}{6} log_2 \frac{4}{6} - \frac{2}{6} log_2 \frac{2}{6}\\ \qquad\qquad\qquad = 0.9\\ Entropy^{NotPlay} =-\frac{0}{3} log_2 \frac{0}{3} - \frac{3}{3} log_2 \frac{3}{3}\\ \qquad\qquad\qquad = 0.0\\ $
$ W^{Play}\ = \frac{6}{9},\qquad W^{NoPlay}\ = \frac{3}{9}\\ $
$ Entropy^{PlayCricket} = W^{Play} \times Entropy^{Play} + W^{NoPlay} \times Entropy^{NotPlay}\\ \qquad\qquad\qquad = \frac{6}{9} \times 0.9 + \frac{3}{9} \times 0.0\\ \qquad\qquad\qquad = 0.6\\ $

Now let's evaluate the training set on the 'Education' feature.

    Feature (Education)
    |                    |
 UG = 4              PG = 5
(A=3,B=1)            (A=1,B=4)

Unlike the other feature, this would yield two impure nodes. Therefore it's unlikely that this feature would give us a better InformationGain.

$ Entropy^{UG} = - P(A|UG) log_2 P(A|UG) - P(B|UG) log_2 P(B|UG)\\ \qquad = -\frac{3}{4} log_2 \frac{3}{4} - \frac{1}{4} log_2 \frac{1}{4}\\ \qquad = 0.81127\\ Entropy^{PG} = - P(A|PG) log_2 P(A|PG) - P(B|PG) log_2 P(B|PG)\\ \qquad = -\frac{1}{5} log_2 \frac{1}{5} - \frac{4}{5} log_2 \frac{4}{5}\\ \qquad = 0.7219\\ Entropy^{Education} = \frac{4}{9} \times 0.812 + \frac{5}{9} \times 0.722\\ \qquad = 0.76 $

Now that we have evaluated all the features available, lets compute what feature yields the maximum InformationGain.

InformationGain = EntropyBefore - EntropyAfter

Feature InformationGain
EyeColor 0.99 - 0.44
Play 0.99 - 0.60
Education 0.99 - 0.76

Therefore, the best split would be made by choosing EyeColor as the feature on this node.

Now the decision tree so far would look as below:

        Feature (EyeColor)
     ____________|____________
    |            |            |
    Green = 2    Blue = 4    Brown = 3
    (A=2,B=0)    (A=2,B=2)   (A=0,B=3)


Green and Brown Nodes have a pure split each and they couldn't be split any further, as such they form leaves. Blue node is impure, so we evaluate the features again for a best split.

Let's draw the potential decision tree with each feature and decide which feature to evaluate first.

With Education feature, we get the following tree. If we choose this feature, it would yield one leaf and an impure node, which needs splitting further.

        Feature (EyeColor)
     ____________|____________
    |            |            |
    Green = 2    Blue = 4    Brown = 3
    (A=2,B=0)    (A=2,B=2)   (A=0,B=3)
                |
            Education
        -----------------
       |                 |
      UG = 1          PG = 3
      (A=1,B=0)       (A=1,B=2)


With PlayCricket feature, we get the following tree. If we choose this feature, it would yield two leaves as the splits would be pure as shown. Therefore the best choice would be to split the node with PlayCricket feature.

        Feature (EyeColor)
     ____________|____________
    |            |            |
    Green = 2    Blue = 4    Brown = 3
    (A=2,B=0)    (A=2,B=2)   (A=0,B=3)
                |
       Feature(PlayCricket)
        -----------------
       |                 |
      Play = 2          NoPlay = 2
      (A=2,B=0)       (A=0,B=2)



$ Entropy^{Play} = -P(A|Play) log_2 P(A|Play) - P(B|Play) log_2 P(B|Play)\\ \qquad = -\frac{2}{2} log_2 \frac{2}{2} - \frac{0}{2} log_2 \frac{0}{2}\\ \qquad = -\frac{2}{2} log_2 \frac{2}{2} - 0\\ Entropy^{NoPlay} = -P(A|NoPlay) log_2 P(A|NoPlay) - P(B|NoPlay) log_2 P(B|NoPlay)\\ \qquad = - \frac{0}{2} log_2 \frac{0}{2} - \frac{2}{2} log_2 \frac{2}{2}\\ \qquad = -0 - \frac{2}{2} log_2 \frac{2}{2}\\ \\ Entropy^{PlayCricket} = \frac{2}{4} \times Entropy^{Play} + \frac{2}{4} \times Entropy^{NoPlay}\\ \qquad\quad = \frac{2}{4} \times (- \frac{2}{2} log_2 \frac{2}{2} - 0) + \frac{2}{4} \times (-0 - \frac{2}{2} log_2 \frac{2}{2})\\ \qquad\quad = 0.0 $

Choosing PlayCricket feature on this node would yield the maximum InformationGain, therefore no need to evaluate the rest of the features (i.e. Education). As we have partitioned the whole training set into pure nodes by choosing appropriate features at each node, the training must stop now.

Testing

[To Do]

References