essay on programming languages, computer science, information techonlogies and all.
Thursday, June 30, 2016
Monday, June 13, 2016
Handling continuous valued attribute by finding splitting threshold and its information gain for decision tree
Decision tree needs to find information gain on each attribute with certain condition. When attribute is discrete value, each discrete can be the decision but when attribute value is continuous, there can be too many conditions to consider. Say attributes changes 0 to 10, then how to choose a threshold ? Simple answer may be 5. But 5 is only good if attributes values are uniformly distributed. Or if class label is evenly divided at 5.
An algorithm is introduced at [IML]. First, it sorts the attributes and then find all the attributes that changes class label and calculate information gain. Then find the best gain which can be used as a binary split.
Here I will try to show the implementation of the algorithm in python code. The Entrophy() and Gain() are methods to calculate information gain as below.
Decision tree algorithm looks for best attribute to choose at the current state which is asking for calculating information gain for each attribute. So usually, the attribute and class label comes in a separate variable hence below method definition.
Examples should be sorted with attribute value. Note that there can be attribute with same value or same attribute with different class label. It is the reason why I don't use set or dict because those can silently throw away duplicated key or update value with last one. Instead, I used plain sort with attribute and label for key. In this way, the implementation can cope well on various cases.
Then, there should be distinct label list. It is a perfect case to use set comprehension. Then figuring out how many each class label are there on training example - just summing up like below.
Then go through examples ( E ) and find all the splitting threshold that changes class label.
Then actual counting of class label on the left and right with respect to threshold. This code is not so efficient as there are loops in loop. It must be like O( len(Thresholds) x len(E) x len(L) ). The information of Threshold can be used in here but then the code can not be concise. I presumed the len(Thresholds) and len(L) is reasonably small compared to len(E) so the it ends up more like O(len(E)).
The whole method is at below.
And the unit test is as below.
An algorithm is introduced at [IML]. First, it sorts the attributes and then find all the attributes that changes class label and calculate information gain. Then find the best gain which can be used as a binary split.
Here I will try to show the implementation of the algorithm in python code. The Entrophy() and Gain() are methods to calculate information gain as below.
def Entrophy( S ):
N = sum( S ) if len(S) !=0 else sys.float_info.epsilon
P = [Si/N for Si in S]
return sum( [ (-pi * math.log(pi,2) if pi != 0.0 else 0.0) for pi in P] )
def Gain( S, A ):
NS = sum( S )
NSv = [ sum(Ai) for Ai in A ]
return Entrophy(S) - sum( [ (NSv[i]/NS)*Entrophy(A[i]) for i in range(len(A))] )
Decision tree algorithm looks for best attribute to choose at the current state which is asking for calculating information gain for each attribute. So usually, the attribute and class label comes in a separate variable hence below method definition.
def SplitGains( A, C ):
Examples should be sorted with attribute value. Note that there can be attribute with same value or same attribute with different class label. It is the reason why I don't use set or dict because those can silently throw away duplicated key or update value with last one. Instead, I used plain sort with attribute and label for key. In this way, the implementation can cope well on various cases.
Example = collections.namedtuple( 'Example', ['att','label'] )
E = [ Example(ai,ci) for ai,ci in zip(A,C) ]
E.sort( key = lambda x : (x.att,x.label) )
Then, there should be distinct label list. It is a perfect case to use set comprehension. Then figuring out how many each class label are there on training example - just summing up like below.
L = set( [ ei.label for ei in E ] )
S = [ sum( [ 1 for ei in E if ei.label == li ] ) for li in L ]
Then go through examples ( E ) and find all the splitting threshold that changes class label.
Threshold = collections.namedtuple( 'Threshold', ['att', 'label', 'begin', 'end'])
Thresholds = []
label = E[0].label
begin = 0
for (i,ei) in enumerate(E):
if ei.label == label: continue
middle = (ei.att + E[i-1].att)/2.0
Thresholds.append( Threshold(middle, label, begin, i-1))
label = ei.label
begin = i
Then actual counting of class label on the left and right with respect to threshold. This code is not so efficient as there are loops in loop. It must be like O( len(Thresholds) x len(E) x len(L) ). The information of Threshold can be used in here but then the code can not be concise. I presumed the len(Thresholds) and len(L) is reasonably small compared to len(E) so the it ends up more like O(len(E)).
AttGains = []
for ti in Thresholds:
left = [
sum( [ 1 for ei in E[:ti.end+1] if ei.label == li ] )
for li in L ]
right = [
sum( [ 1 for ei in E[ti.end+1:] if ei.label == li ] )
for li in L ]
AttGains.append( (ti.att, Gain( S, [left,right]) ) )
return AttGains
The whole method is at below.
def SplitGains( A, C ):
Example = collections.namedtuple( 'Example', ['att','label'] )
E = [ Example(ai,ci) for ai,ci in zip(A,C) ]
E.sort( key = lambda x : (x.att,x.label) )
L = set( [ ei.label for ei in E ] )
S = [ sum( [ 1 for ei in E if ei.label == li ] ) for li in L ]
Threshold = collections.namedtuple( 'Threshold', ['att', 'label', 'begin', 'end'])
Thresholds = []
label = E[0].label
begin = 0
for (i,ei) in enumerate(E):
if ei.label == label: continue
middle = (ei.att + E[i-1].att)/2.0
Thresholds.append( Threshold(middle, label, begin, i-1))
label = ei.label
begin = i
AttGains = []
for ti in Thresholds:
left = [
sum( [ 1 for ei in E[:ti.end+1] if ei.label == li ] )
for li in L ]
right = [
sum( [ 1 for ei in E[ti.end+1:] if ei.label == li ] )
for li in L ]
AttGains.append( (ti.att, Gain( S, [left,right]) ) )
return AttGains
And the unit test is as below.
class TestDecisionTreeID3(unittest.TestCase):
# [ML] P58
def test_EntrophySimple(self):
self.assertAlmostEqual( Entrophy([9,5]), 0.94028595)
def test_MultipleLabel(self):
print( 'Entrophy([9,5,4])=', Entrophy([9,5,4]) )
print( 'Gain([9,5,4],[[6,2,3],[3,3,1]])=', Gain([9,5,4],[[6,2,3],[3,3,1]]) )
# [ML] P58
def test_GainSimple(self):
self.assertAlmostEqual( Gain( [9,5], [[6,2],[3,3]]), 0.04812703)
def test_EntrophyKnown(self):
self.assertEqual( Entrophy([100]), 0.0)
self.assertEqual( Entrophy([]), 0.0)
self.assertEqual( Entrophy([1,1]), 1.0)
# [IML] P125
def test_SplitGains(self):
sg = SplitGains(
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11,12,13],
[ 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0 ])
self.assertEqual( sg,
[ (5.5, 0.49647937549469),
(10.5, 0.014582259610812609),
(12.5, 0.09123321517616922) ])
def test_SplitGainsAnyOrder(self):
sg = SplitGains(
[ 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 11,12,13],
[ 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0 ])
self.assertEqual( sg,
[ (5.5, 0.49647937549469),
(10.5, 0.014582259610812609),
(12.5, 0.09123321517616922) ])
def test_SplitGainsSameAtt(self):
sg = SplitGains(
[ 0, 0, 5, 10],
[ 0, 0, 1, 1 ])
self.assertEqual( sg, [(2.5, 1.0)])
def test_SplitGainsSameAttDifferentLabel(self):
sg = SplitGains(
[ 0, 0, 5, 10],
[ 0, 1, 1, 1 ])
self.assertEqual( sg, [(0.0, 0.8112781244591328)])
def test_SplitGainsSameAttDifferentLabelMixed(self):
sg = SplitGains(
[ 0, 0, 0, 5, 10],
[ 0, 1, 0, 1, 1 ])
self.assertEqual( sg, [(0.0, 0.9709505944546686)])
-
References
- [ML] : Machine Learning, Tom M.Mitchell
- [IML] : An Introduction to Machine Learning, Miroslav Kubat
Subscribe to:
Posts (Atom)