essay on programming languages, computer science, information techonlogies and all.

Sunday, May 15, 2016

k-Nearest-Neighbor classifer and cross-validation

Next subject is k-Nearest-Neighbor classifier. When it gets a new sample, it searches through training example to find similar one ( or k ) and say that it should be same as history. If you have a big training samples that covers most of the case, then it should pick something similiar.

Comparing with Decision Tree, you may wonder what is pros and cons. Here is analogy. When you drive to a city never visited before, a GPS navigator can help you to go to the hotel. It tells you where it is by the hotel address. As long as you are going some well-known hotel, the navigator is the first choice. To follow the navigator is like using a deicision tree. You just need to take turn when decision tree ask you do. Though if you have a wrong street name, then no matter how good the building number is, you may end up at wrong address. Or if you don't have address, or name of the place, but just a description , then navigator leads to nowhere.

Let's say you want somewhere near hill-top where you can see the sunset and have beer at a pub and then go to a cosy B&B in a walking distance , then your best bet is parking the car and take a cab and explain yourself. An experienced cab driver can take you a place in where all description meets in the middle. A dumb navigator is no match in here. Of course, you have to tip the cab driver dearly as she has to go through every locations in her memory and match with those description.

Above analogy is quite crude but I am trying to make a point that kNN classifier is something you should try when you know that the classification should go through a wide range of choices with uncertainty in everywhere.

There is an open database for this kind of test. Try to visit UCI Machine Learning Repository Here I am using Steel Plate Falut

First the database is a tab separated text file which can be loaded up as below.
def LoadTrainingExamples( file, separator ) :
    examples = []
    for line in open( file ) :
        examples.append( [ float(e) for e in line.rstrip().split(separator) ] )
    return examples
>>> LoadTrainingExamples( 'faultsshort.txt', '\t' )
...

When there are large number of attributes - like this database which has 27 variables - we have to choose variables to play with. These chosen variables are called a model in large. The selection can come from an expert or from a extensive automatic search - like hill climbing. No matter how the selection comes, it has to be verified whether the chosen model and training set is a good match. One of the verification method is a cross-validation. It is to take out some of examples and train the model with those examples and then classify the remained examples with the trained model and see if the result classification matches with record.

So we need a function to choose attribute to make a model. Here examples comes with each row as a record and each column as a attribute. Below function pick columns and forms a model
def PickColumns( A, I ) :
    P = [ A[:,i] for i in I ]
    return np.matrix( P ).getT()
>>> PickColumns( np.array( [[1,2,3],[4,5,6] ] ), [0,2] )
matrix([[1, 3],
        [4, 6]])

Before apply training, each attribute should be scaled up/down so that each contribute to the result equally. In here, the normalization is done using max,min of each attribute.
def Normalize( A, minmax ) :
    N = []
    for i in range(A.shape[1]) :
        minv = minmax[0,i]
        maxv = minmax[1,i]
        N.append( [ ((float(a)-minv)/(maxv-minv)) for a in A[:,i] ] )
    return np.matrix( N ).getT()
>>> Normalize( np.matrix('1,2;0,1'), np.matrix('0,0;1,2'))
matrix([[ 1. ,  1. ],
        [ 0. ,  0.5]])

Finding out k-Nearest-Neighbor is relatively simple as we can use scipy 's spatial.KDTree. KDTree is like binary tree with n-Dimenonal attributes and should return result in O(log2(N)). Here is code snippet.
    KD = spatial.KDTree( A )
    distance, index = KD.query( X, k )

With above KDTree, we can find out k-Nerest-Neighbor at index. Then with those neighbor, we have to choose one neighbor or classifcation result. Here is a function that select most frequent neighbor.
def MaxArg( index, C ) :
    if len(index.shape) == 1 :
        S = sum( [ C[i] for i in index ] )
        NS = 1
    else :
        S = sum( [ C[i] for i in index[0] ] )
        NS = len( index[0] )
    return [ int(si/NS+0.5) for si in S ]

Integrating all the above code, here is a code that does cross validation. Refer [Moore and Lee] for LOOCV ( Leave-One-Out Cross Validation )

Examples = np.array(LoadTrainingExamples( 'faults.txt', '\t' ))
AI = range( 4,8 ); CI = range( 27,28 )  # The model is to use column 4,5,6,7 as attribute and 27 as classification
error_sum = 0

k = 1  # k can be 1 or 3 
E = PickColumns( Examples, AI );  
minmax = np.vstack( (E.min(0),E.max(0)) ); 
E = Normalize( E, minmax );
EC = PickColumns( Examples, CI )

# LOOCV ( Leave-One-Out Cross Validation )
for out_one in range(len(Examples)): 
    A = np.delete( E, (out_one), axis=0)    # training attribute
    C = np.delete( EC, (out_one), axis=0)   # training classification
 
    X = E[out_one];                         # validation point
    CX = EC[out_one];                       # recodred classification

    KD = spatial.KDTree( A )
    distance, index = KD.query( X, k )
    predict = MaxArg( index, C )
    error_sum += sum( [ abs(p-CX[0,i]) for i,p in enumerate(predict)])
    
    print( str.format( "Leave-One {0} is classifed by {1}{2} as {3} while sampled as {4}",
           out_one, index, distance, predict, CX[0] ))  

print( str.format( "Error = {0} ({1}/{2})",
    error_sum/(len(Examples)*len(CI)), error_sum, len(Examples)*len(CI)))  

When running above script with k=1, the cross validation error is 10.8%.
Leave-One 0 is classifed by [1299][ 0.00052379] as [0] while sampled as [[ 1.]]
Leave-One 1 is classifed by [115][ 0.00033883] as [1] while sampled as [[ 1.]]
...
Leave-One 1940 is classifed by [225][ 0.00013041] as [0] while sampled as [[ 0.]]
Error = 0.10819165378670788 (210.0/1941)

With k=3, the error is slightly less of 8.6%
Leave-One 0 is classifed by [[1299 1314 1004]][[ 0.00052379  0.00063176  0.00063623]] as [0] while sampled as [[ 1.]]
Leave-One 1 is classifed by [[ 115 1595 1554]][[ 0.00033883  0.00036469  0.00039562]] as [0] while sampled as [[ 1.]]
...
Leave-One 1940 is classifed by [[ 225 1927  857]][[ 0.00013041  0.00022388  0.00024139]] as [0] while sampled as [[ 0.]]
Error = 0.0865533230293663 (168.0/1941)

The model chosen is not from an expert. Though the result is relatively impressive to me. I mean it is not easy to be correct on 9 out of 10 case.
    There are a lot to explore from here.
  • Choosing model automatically using Hill-climbing and see if there is 'No free hunch'. I mean my guess - AI = range( 4,8 )- should be really bad choice compared to extensive and through search
  • Every validation needs KDTree setup with almost similar training set. If previous KDTree can be reused - by removing and adding a row - the speed can be improved dramatically.
  • To speed up at run time, the training set needs to be trimmed by removing redundancy examples. 'Tomek Link' test can be tried


    References
  • An Introduction to Machine Learning, Miroslav Kubat. p51
  • Efficient Algorithms for Minimizing Cross Validation Error, Andrew W. Moore, Mary S.Lee
  • dataset provided by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. www.semeion.it

No comments: