essay on programming languages, computer science, information techonlogies and all.

Sunday, May 8, 2016

Decision Tree with Python

Studying machine learning and writing code to see whether I understand the theory of decision tree correctly or not by writing python code. For the reference, refer [MLT].

First Entrophy and Gain should be calculated and here is the code snippet. Entrophy can be used like Entrophy( [9, 5] ) and it will return 0.9402859586706309. And Gain( [9,5], [[6,2],[3,3]] ) will return 0.04812703040826927. .
import numpy as NP
import itertools
import functools
import math

def Entrophy( S ):
    N = sum( S )
    if N == 0.0 : return 0.0

    P = [Si/N for Si in S]
    return sum( [ (-pi * math.log(pi,2) if pi != 0.0 else 0.0) for pi in P] ) 

def Gain( S, A ):
    NS = sum( S )
    NSv = [ sum(Ai) for Ai in A ]
    return Entrophy(S) - sum( [ (NSv[i]/NS)*Entrophy(A[i]) for i in range(len(A))] ) 
For test example , I make up O variables like below. It is numpy array to use range index like O[:,-1].

O = NP.array([
    ['Sunny', 'Hot', 'High', 'Weak', 'No'],
    ['Sunny', 'Hot', 'High', 'Strong', 'No'],
    ['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Strong', 'No'],
    ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
    ['Sunny', 'Mild', 'High', 'Weak', 'No'],
    ['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
    ['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
    ['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
    ['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Strong', 'No'],    
    ])
Then there should be holder to know what kind of variables are there.
def DistinctAttributeMap( O ) :
    DA = []
    for i in range(len(O[0])) :
        m = {}
        A = list(set(O[:,i]))
        for j in range(0,len(A)) : 
             m[ A[j] ] = j
        m[ 'Reverse' ] = A
        m[ 'Count' ] = len(A)
        DA.append( m  )
    return DA
DistinctAtrributeMap goes through each column and figure out how many different variables are there. It will generate output like below with the examples at above.
[
{    'Rain': 0, 'Sunny': 1, 'Overcast': 2,      'Reverse': ['Rain', 'Sunny', 'Overcast']}, 
{    'Cool': 0, 'Hot': 1, 'Mild': 2,          'Reverse': ['Cool', 'Hot', 'Mild']}, 
{    'High': 0, 'Normal': 1,                 'Reverse': ['High', 'Normal']}, 
{    'Weak': 0, 'Strong': 1,                 'Reverse': ['Weak', 'Strong']}, 
{    'No': 0, 'Yes': 1,                         'Reverse': ['No', 'Yes']}
]
And to generate A matrix for Gain(), here comes Attributes().
def Attributes( O, DA, i ) : 
    A = NP.zeros( [ DA[i]['Count'], DA[-1]['Count'] ] )

    for j in range(len(O)) :
        a = DA[i][ O[j,i] ]
        c = DA[-1][ O[j,-1] ]
        A[ a, c ] =  A[ a, c ] + 1

    return A
The Attributes( O, DA, 0 ) will generate result like below.
array([[ 2.,  3.],
       [ 3.,  2.],
       [ 0.,  4.]])
Armed with above functions, the ID3 is like below.
def ID3( Examples, AttributeMap, Availables, Tree  ) :

    C = Attributes( Examples, AttributeMap, -1 )
    S = [ C[i,i] for i in range(len(C)) ]

    print( 'Examples = ', Examples )
    print( 'Availables = ', Availables )
    print( 'S = ', S )

    for i in range(len(S)) :
        if S[i] == len(Examples) :
            Tree[ 'Leaf' ] = AttributeMap[ 'Reverse' ][i]
            return

    if len(Availables) == 0 :
        maxS = max( S )
        maxIndex = S.index( maxS )
        Tree[ 'Probably' ] = AttributeMap[-1][maxIndex]
        Tree[ 'Odd' ] = maxS / sum( S )
        return
    
    maxAttr = -1
    maxGain = -1
    for i in Availables :
        Gi = Gain( S, Attributes( Examples, AttributeMap, i ))
        if Gi > maxGain :
            maxAttr = i
            maxGain = Gi

    Tree[ 'Best' ] = maxAttr
    Tree[ 'Gain' ] = maxGain

    leftAttr = Availables.copy()
    leftAttr.remove( maxAttr )

    for a in AttributeMap[maxAttr][ 'Reverse' ] :
        leftExamples = NP.array( list(filter( lambda r : r[maxAttr] == a, Examples )))

        leftClass = set(leftExamples[:,-1])
        if len( leftClass ) == 1 :
            Tree[ a ] = leftClass.pop()
        else :
            Tree[ a ] = {}  
            ID3( leftExamples, AttributeMap, leftAttr, Tree[a] )
            
    return
The result of this is like below with some formatting. .

>DA = DistinctAttributeMap( O )
>DT = {}
>ID3( O, DA, [0,1,2,3], DT )
>print( 'DecisionTree = ', DT )
DecisionTree ={
'Best': 0,
'Gain': 0.24674981977443911}
'Rain': {
    'Best': 3,
    'Gain': 0.97095059445466858,
    'Weak': 'Yes',
    'Strong': 'No'},
'Sunny': {
   'Best': 2,
   'Gain': 0.97095059445466858,
   'High': 'No',
   'Normal': 'Yes'},
'Overcast': 'Yes', 
}
I used dictionary to represent a tree structure. I guess I will try to show result in drawing. But it will be another blog page .
How about using 'Gini index' rather than 'Gain' ? Will it make any difference ?
How do I do to make noise resilent - prevent overfitting ?
How to let user do fine tuning ? Like decision tree with at most 3 depths and at least 95% classification ?
How about presenting the decision tree in diagram and let user verfiy it ? An interactive diagram that shows and update decision tree ?
How about making this as an API so that user provide data and retrieve decision tree ?
How about dynamically updating decision tree in run time ?

Reference :
[MLT] Machine Learning by Tom Mitchell. Chapter 3

No comments: