Studying machine learning and writing code to see whether I understand the theory of decision tree correctly or not by writing python code. For the reference, refer [MLT].
First Entrophy and Gain should be calculated and here is the code snippet. Entrophy can be used like Entrophy( [9, 5] ) and it will return 0.9402859586706309. And Gain( [9,5], [[6,2],[3,3]] ) will return 0.04812703040826927. .
import numpy as NP
import itertools
import functools
import math
def Entrophy( S ):
N = sum( S )
if N == 0.0 : return 0.0
P = [Si/N for Si in S]
return sum( [ (-pi * math.log(pi,2) if pi != 0.0 else 0.0) for pi in P] )
def Gain( S, A ):
NS = sum( S )
NSv = [ sum(Ai) for Ai in A ]
return Entrophy(S) - sum( [ (NSv[i]/NS)*Entrophy(A[i]) for i in range(len(A))] )
For test example , I make up O variables like below. It is numpy array to use range index like O[:,-1].
O = NP.array([
['Sunny', 'Hot', 'High', 'Weak', 'No'],
['Sunny', 'Hot', 'High', 'Strong', 'No'],
['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
['Rain', 'Mild', 'High', 'Weak', 'Yes'],
['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
['Rain', 'Cool', 'Normal', 'Strong', 'No'],
['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
['Sunny', 'Mild', 'High', 'Weak', 'No'],
['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
['Rain', 'Mild', 'High', 'Strong', 'No'],
])
Then there should be holder to know what kind of variables are there.
def DistinctAttributeMap( O ) :
DA = []
for i in range(len(O[0])) :
m = {}
A = list(set(O[:,i]))
for j in range(0,len(A)) :
m[ A[j] ] = j
m[ 'Reverse' ] = A
m[ 'Count' ] = len(A)
DA.append( m )
return DA
DistinctAtrributeMap goes through each column and figure out how many different variables are there. It will generate output like below with the examples at above.
[
{ 'Rain': 0, 'Sunny': 1, 'Overcast': 2, 'Reverse': ['Rain', 'Sunny', 'Overcast']},
{ 'Cool': 0, 'Hot': 1, 'Mild': 2, 'Reverse': ['Cool', 'Hot', 'Mild']},
{ 'High': 0, 'Normal': 1, 'Reverse': ['High', 'Normal']},
{ 'Weak': 0, 'Strong': 1, 'Reverse': ['Weak', 'Strong']},
{ 'No': 0, 'Yes': 1, 'Reverse': ['No', 'Yes']}
]
And to generate A matrix for Gain(), here comes Attributes().
def Attributes( O, DA, i ) :
A = NP.zeros( [ DA[i]['Count'], DA[-1]['Count'] ] )
for j in range(len(O)) :
a = DA[i][ O[j,i] ]
c = DA[-1][ O[j,-1] ]
A[ a, c ] = A[ a, c ] + 1
return A
The Attributes( O, DA, 0 ) will generate result like below.
array([[ 2., 3.],
[ 3., 2.],
[ 0., 4.]])
Armed with above functions, the ID3 is like below.
def ID3( Examples, AttributeMap, Availables, Tree ) :
C = Attributes( Examples, AttributeMap, -1 )
S = [ C[i,i] for i in range(len(C)) ]
print( 'Examples = ', Examples )
print( 'Availables = ', Availables )
print( 'S = ', S )
for i in range(len(S)) :
if S[i] == len(Examples) :
Tree[ 'Leaf' ] = AttributeMap[ 'Reverse' ][i]
return
if len(Availables) == 0 :
maxS = max( S )
maxIndex = S.index( maxS )
Tree[ 'Probably' ] = AttributeMap[-1][maxIndex]
Tree[ 'Odd' ] = maxS / sum( S )
return
maxAttr = -1
maxGain = -1
for i in Availables :
Gi = Gain( S, Attributes( Examples, AttributeMap, i ))
if Gi > maxGain :
maxAttr = i
maxGain = Gi
Tree[ 'Best' ] = maxAttr
Tree[ 'Gain' ] = maxGain
leftAttr = Availables.copy()
leftAttr.remove( maxAttr )
for a in AttributeMap[maxAttr][ 'Reverse' ] :
leftExamples = NP.array( list(filter( lambda r : r[maxAttr] == a, Examples )))
leftClass = set(leftExamples[:,-1])
if len( leftClass ) == 1 :
Tree[ a ] = leftClass.pop()
else :
Tree[ a ] = {}
ID3( leftExamples, AttributeMap, leftAttr, Tree[a] )
return
The result of this is like below with some formatting. .
>DA = DistinctAttributeMap( O )
>DT = {}
>ID3( O, DA, [0,1,2,3], DT )
>print( 'DecisionTree = ', DT )
DecisionTree ={
'Best': 0,
'Gain': 0.24674981977443911}
'Rain': {
'Best': 3,
'Gain': 0.97095059445466858,
'Weak': 'Yes',
'Strong': 'No'},
'Sunny': {
'Best': 2,
'Gain': 0.97095059445466858,
'High': 'No',
'Normal': 'Yes'},
'Overcast': 'Yes',
}
I used dictionary to represent a tree structure. I guess I will try to show result in drawing. But it will be another blog page .
How about using 'Gini index' rather than 'Gain' ? Will it make any difference ?
How do I do to make noise resilent - prevent overfitting ?
How to let user do fine tuning ? Like decision tree with at most 3 depths and at least 95% classification ?
How about presenting the decision tree in diagram and let user verfiy it ? An interactive diagram that shows and update decision tree ?
How about making this as an API so that user provide data and retrieve decision tree ?
How about dynamically updating decision tree in run time ?
Reference :
[MLT] Machine Learning by Tom Mitchell. Chapter 3