Decision Trees

How complex decisions can be divided & conquered

Decision tree is a classification technique. Recursive nature of the decision tree makes it easier to reason out about the problem and how this technique solves it.

Let's look at the structure of the decision tree.

decision tree illustration

Each node represents a feature to be evaluated.
Each edge from that node represents the outcomes of evaluating the feature.
Each leaf node represents the class label.

Decision trees can be used for regression tasks and for multi-class classification tasks. But here we consider only binary classification.

Training a Decision Tree Classifier

The training set consists of instances represented by feature vectors labelled into either A or B.

At each node, we select a feature to split the training set such that the Information Gain is maximised. When we reach a point where a node only contains a homogenious set of instances (i.e. all instances that reached that node are either class A or class B) we stop further splitting. This is also called the purity of the split.

Information Gain = Entropy Before Split - Entropy After Split

Entropy measures the impurity of the split and is quantified for binary classes A,B as:

decision tree entropy explanation1

So in general at any node

decision tree entropy explanation2

$(C^n_B,C^n_B)$ are the training set data instance counts of classes A and B respectively at node $n$. Similarly $(C^l_B,C^l_B)$ at node $l$ and $(C^r_B,C^r_B)$ at node $r$. Considering a binary split:

decision tree entropy explanation3

Since we recursively split the training set, $C^n_A = C^l_A + C^r_A$ and $C^n_B = C^l_B + C^r_B$

decision tree entropy explanation4

Entropies are weighted according to the ratio of instances that goes to each side.

This process is repeated for each of the features available yielding a set of Information Gains per each candidate feature.

In general for k features:

Best split Feature Index  = ArgMax(InformationGain_F0, ... ,InformationGain_Fk)

For example here's a training set

Name	EyeColor	PlayCricket	Education	Class
Apu	green	true	UG	A
Bernice	blue	false	PG	B
Carl	green	true	UG	A
Doris	brown	false	PG	B
Edna	blue	false	PG	B
Frink	brown	true	PG	B
Gil	brown	true	UG	B
Homer	blue	true	UG	A
Itchy	blue	true	PG	A

Let's evaluate the training set on the feature 'EyeColor'

decision tree entropy explanation5

EyeColor feature, splits the training set into 3, one for each outcome of {green,blue,brown} resulting in 3 potential nodes.

        Feature (EyeColor)
     ____________|______________
    |		 |		|
    Green = 2	Blue = 4	Brown = 3
    (A=2,B=0)	(A=2,B=2)   (A=0,B=3)

decision tree entropy explanation6

Entropies for each outcome

decision tree entropy explanation7

Therefore EntropyAfter can be computed as below

decision tree entropy explanation8

Now let's evaluate the training set on the feature 'PlayCricket' (Remember we are still evaluating the features, and haven't commited to a split using a specific feature yet). As before we can compute the relevant entropies as shown below:

    Feature (PlayCricket)
    |			|
    Play = 6 		NoPlay = 3
    (A=4,B=2)		(A=0,B=3)

decision tree entropy explanation9

Now let's evaluate the training set on the 'Education' feature.

    Feature (Education)
    |			|
 UG = 4 		 PG = 5
(A=3,B=1)		(A=1,B=4)

Unlike the other feature, this would yield two impure nodes. Therefore it's unlikely that this feature would give us a better InformationGain.

decision tree entropy explanation10

Now that we have evaluated all the features available, lets compute what feature yields the maximum InformationGain.

InformationGain = EntropyBefore - EntropyAfter

Feature	InformationGain
EyeColor	0.99 - 0.44
Play	0.99 - 0.60
Education	0.99 - 0.76

Therefore, the best split would be made by choosing EyeColor as the feature on this node.

Now the decision tree so far would look as below:

        Feature (EyeColor)
     ____________|______________
    |		 |		|
    Green = 2	Blue = 4	Brown = 3
    (A=2,B=0)	(A=2,B=2)       (A=0,B=3)

Green and Brown Nodes have a pure split each and they couldn't be split any further, as such they form leaves. Blue node is impure, so we evaluate the features again for a best split.

Let's draw the potential decision tree with each feature and decide which feature to evaluate first.

With Education feature, we get the following tree. If we choose this feature, it would yield one leaf and an impure node, which needs splitting further.

        Feature (EyeColor)
     ____________|______________
    |		 |		|
    Green = 2	Blue = 4	Brown = 3
    (A=2,B=0)	(A=2,B=2)       (A=0,B=3)
                |
            Education
        -----------------
       |                 |
      UG = 1	      PG = 3
      (A=1,B=0)      (A=1,B=2)

With PlayCricket feature, we get the following tree. If we choose this feature, it would yield two leaves as the splits would be pure as shown. Therefore the best choice would be to split the node with PlayCricket feature.

        Feature (EyeColor)
     ____________|______________
    |		 |		|
    Green = 2	Blue = 4	Brown = 3
    (A=2,B=0)	(A=2,B=2)       (A=0,B=3)
                 |
       Feature(PlayCricket)
        -----------------
       |                 |
      Play = 2	      NoPlay = 2
      (A=2,B=0)      (A=0,B=2)

decision tree entropy explanation11

Choosing PlayCricket feature on this node would yield the maximum InformationGain, therefore no need to evaluate the rest of the features (i.e. Education). As we have partitioned the whole training set into pure nodes by choosing appropriate features at each node, the training must stop now.

Testing

[To Do] #References