essay on programming languages, computer science, information techonlogies and all.

Monday, July 25, 2016

Install SQLite on Windows for Python

Installation of SQLite is rather straight forward but still it isn't like using installer.

Python first. Download WinPython 3.4.4.2/ and install it.

Add following on system PATH


C:\WinPython-32bit-3.4.4.2\python-3.4.4;

Go to SQLite download page and download sqlite-dll-win32-x86-3130000.zip. ( Or other version/arch file )

Unzip the downloaded file at c:\sqlite then put the path to the system PATH

Python comes with sqlite interface so no need further action. You should be able to type below and then find an example.db created on the working directory. Refer sqlite3

Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> conn = sqlite3.connect( 'example.db' )
>>> c = conn.cursor()
>>> conn.commit()
>>> conn.close()

Thursday, July 21, 2016

Run Apache web server on top of IIS7

I am paying around 15000 won per month to run a Windows server 2008 remotely. This server has an static IP address - uvans. Installed a redmine and also Bonobo git server.

In addition to this IIS7 and programs, I want to run Django and tried to run it IIS7 but after all the effort, I found that Django not supporting fastcgi anymore. So it is not a recommended approach. The supported way is to run it with Apache web server.

Thankfully, Apache server can be run on Windows machine. First download windows installer from here. I unzipeed httpd-2.4.23-win32.zip to c:\apache24 and installed vcredist_x86.exe.

Then modification on the C:\Apache24\conf\httpd.conf as below. This is to get Apache responds to port 81. I have only 1 IP and no other way but to use another port number. Is 81 safe enough ?

Listen 0.0.0.0:81

ServerAdmin the.sungwoo@gmail.com
...
ServerName uvans.vps.phps.kr:81
...
AllowOverride All
...
DocumentRoot "c:/WebPages"
<Directory "c:/WebPages">

Then running the httpd service in command line as below.
C:\Apache24\bin>httpd -t
Syntax OK

C:\Apache24\bin>httpd -k install
Installing the 'Apache2.4' service
The 'Apache2.4' service is successfully installed.
Testing httpd.conf....
Errors reported here must be corrected before the service can be started.

C:\Apache24\bin>httpd -k start

C:\Apache24\bin>

Check the apache web service is running.
Then you can go to localhost:81 and see the index.html at c:\WebPages comes up
And make sure firewall is taken care of - allow Apache httpd.exe.
Then Apache served web page can be reached from the world.

Wednesday, July 6, 2016

JsTestDriver for JavaScript unit test

When start any serious work on a programming language, it is always a good idea to setup unit test. Otherwise, you can't make a progress without breaking previous work.

For JavaScript, there are number of unit test framework, but I chosed JsTestDriver because it can be integrated in build process. Though found number of issues during the journey.

First, the JsTestDriver can be downloaded from js-test-driver at google code. At first tried 1.3.5 but later changed to 1.3.3d as 1.3.5 has bug on relateive path. Refer stackoverflow discussion

JsTestDriver needs a running process in background behaves like a server. In Windows, I wrote a batch script that can be double-clicked to launch the process.
@echo off
set path=%path%;C:\Program Files (x86)\Java\jre7\bin;
set SRC_BUILD=%~dp0
SET JSTESTDRIVER_HOME=%SRC_BUILD%\..\3rdParty\JsTestDriver
cmd /k java.exe -jar %JSTESTDRIVER_HOME%/JsTestDriver-1.3.5.jar --port 4224 --browser "C:\Program Files (x86)\Legpat\Application\chrome.exe"

When double-clicked, it will shows command window and chrome will have a dedicated page. These two should be kept on to function correctly. Though annoying it is, better to find a way to run command line without window, and run the page in background page of chrome. Let me know if you have an idea.
Then jsTestDriver.conf like below. It should be located at a base directory which contains UnitTest folder and the folder should have js files to test.
server: http://localhost:4224

load:
 - UnitTest/*.js
 
Then unit test files can be executed as part of MSBUILD process like below.
<?xml version="1.0" encoding="utf-8"?> 
<Project 
    ToolsVersion="4.0" 
    xmlns="http://schemas.microsoft.com/developer/msbuild/2003" 
    DefaultTarget="FullBuild">
...

    <Target Name="CoreBuild">    
        <Exec 
            Command="java -jar %JSTESTDRIVER_HOME%\JsTestDriver-1.3.5.jar --tests all --reset --verbose"
            WorkingDirectory="$(SRC_ROOT)\Diagram"
            CustomErrorRegularExpression="\[FAILED\]"
        />
    </Target>
...


Make sure you put '--reset' option otherwise the you see an error magically disappear but comes right back when source file get saved.

Here is a example of running above build project as a batch script.

When looking closely above window, you can see that there are js-test-driver logging like below. Note that it loaded '/test/UnitTest/Sample.js'. I didn't use 'test' folder but it somehow magically prefixed it. It gave me a lot of confusion at the first time. With combination of '--reset', it dragged me down quite long.
  Chrome: Reset
  Chrome 50.25.2661.78 Windows loaded /test/UnitTest/Sample.js
...

Thursday, June 30, 2016

JointJS Diagram - Google Driver link

Monday, June 13, 2016

Handling continuous valued attribute by finding splitting threshold and its information gain for decision tree

Decision tree needs to find information gain on each attribute with certain condition. When attribute is discrete value, each discrete can be the decision but when attribute value is continuous, there can be too many conditions to consider. Say attributes changes 0 to 10, then how to choose a threshold ? Simple answer may be 5. But 5 is only good if attributes values are uniformly distributed. Or if class label is evenly divided at 5.

An algorithm is introduced at [IML]. First, it sorts the attributes and then find all the attributes that changes class label and calculate information gain. Then find the best gain which can be used as a binary split.

Here I will try to show the implementation of the algorithm in python code. The Entrophy() and Gain() are methods to calculate information gain as below.
def Entrophy( S ):
    N = sum( S ) if len(S) !=0 else sys.float_info.epsilon
    P = [Si/N for Si in S]
    return sum( [ (-pi * math.log(pi,2) if pi != 0.0 else 0.0) for pi in P] ) 

def Gain( S, A ):
    NS = sum( S )
    NSv = [ sum(Ai) for Ai in A ]
    return Entrophy(S) - sum( [ (NSv[i]/NS)*Entrophy(A[i]) for i in range(len(A))] ) 


Decision tree algorithm looks for best attribute to choose at the current state which is asking for calculating information gain for each attribute. So usually, the attribute and class label comes in a separate variable hence below method definition.
def SplitGains( A, C ):


Examples should be sorted with attribute value. Note that there can be attribute with same value or same attribute with different class label. It is the reason why I don't use set or dict because those can silently throw away duplicated key or update value with last one. Instead, I used plain sort with attribute and label for key. In this way, the implementation can cope well on various cases.
    Example = collections.namedtuple( 'Example', ['att','label'] )
    E = [ Example(ai,ci) for ai,ci in zip(A,C) ]
    E.sort( key = lambda x : (x.att,x.label) )


Then, there should be distinct label list. It is a perfect case to use set comprehension. Then figuring out how many each class label are there on training example - just summing up like below.
    L = set( [ ei.label for ei in E ] )
    S = [ sum( [ 1 for ei in E if ei.label == li ] ) for li in L ]


Then go through examples ( E ) and find all the splitting threshold that changes class label.
    Threshold = collections.namedtuple( 'Threshold', ['att', 'label', 'begin', 'end'])

    Thresholds = []    
    label = E[0].label
    begin = 0
    for (i,ei) in enumerate(E):
        if ei.label == label: continue
        
        middle = (ei.att + E[i-1].att)/2.0 
        Thresholds.append( Threshold(middle, label, begin, i-1))
        
        label = ei.label
        begin = i



Then actual counting of class label on the left and right with respect to threshold. This code is not so efficient as there are loops in loop. It must be like O( len(Thresholds) x len(E) x len(L) ). The information of Threshold can be used in here but then the code can not be concise. I presumed the len(Thresholds) and len(L) is reasonably small compared to len(E) so the it ends up more like O(len(E)).
    AttGains = []
    for ti in Thresholds:
        left = [ 
            sum( [ 1 for ei in E[:ti.end+1] if ei.label == li ] ) 
            for li in L ]
        right = [ 
            sum( [ 1 for ei in E[ti.end+1:] if ei.label == li ] ) 
            for li in L ]
        AttGains.append( (ti.att, Gain( S, [left,right]) ) )        

    return AttGains 


The whole method is at below.
def SplitGains( A, C ):
    Example = collections.namedtuple( 'Example', ['att','label'] )
    E = [ Example(ai,ci) for ai,ci in zip(A,C) ]
    E.sort( key = lambda x : (x.att,x.label) )
    
    L = set( [ ei.label for ei in E ] )
    S = [ sum( [ 1 for ei in E if ei.label == li ] ) for li in L ]
    
    Threshold = collections.namedtuple( 'Threshold', ['att', 'label', 'begin', 'end'])

    Thresholds = []    
    label = E[0].label
    begin = 0
    for (i,ei) in enumerate(E):
        if ei.label == label: continue
        
        middle = (ei.att + E[i-1].att)/2.0 
        Thresholds.append( Threshold(middle, label, begin, i-1))
        
        label = ei.label
        begin = i

    AttGains = []
    for ti in Thresholds:
        left = [ 
            sum( [ 1 for ei in E[:ti.end+1] if ei.label == li ] ) 
            for li in L ]
        right = [ 
            sum( [ 1 for ei in E[ti.end+1:] if ei.label == li ] ) 
            for li in L ]
        AttGains.append( (ti.att, Gain( S, [left,right]) ) )        

    return AttGains       


And the unit test is as below.
class TestDecisionTreeID3(unittest.TestCase):

    # [ML] P58
    def test_EntrophySimple(self):
        self.assertAlmostEqual( Entrophy([9,5]), 0.94028595)
        
    def test_MultipleLabel(self):
        print( 'Entrophy([9,5,4])=', Entrophy([9,5,4]) )        
        print( 'Gain([9,5,4],[[6,2,3],[3,3,1]])=', Gain([9,5,4],[[6,2,3],[3,3,1]]) )        
        
    # [ML] P58
    def test_GainSimple(self):
        self.assertAlmostEqual( Gain( [9,5], [[6,2],[3,3]]), 0.04812703)

    def test_EntrophyKnown(self):
        self.assertEqual( Entrophy([100]), 0.0)
        self.assertEqual( Entrophy([]), 0.0)
        self.assertEqual( Entrophy([1,1]), 1.0)
        
    # [IML] P125
    def test_SplitGains(self):
        sg = SplitGains( 
            [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11,12,13],
            [ 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0 ])
        self.assertEqual( sg, 
            [ (5.5, 0.49647937549469),
              (10.5, 0.014582259610812609),
              (12.5, 0.09123321517616922) ])
   
    def test_SplitGainsAnyOrder(self):
        sg = SplitGains( 
            [ 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 11,12,13],
            [ 0, 0, 0, 0, 0,  1, 1, 1, 1, 1, 1, 1, 0 ])
        self.assertEqual( sg, 
            [ (5.5, 0.49647937549469),
              (10.5, 0.014582259610812609),
              (12.5, 0.09123321517616922) ])
        
    def test_SplitGainsSameAtt(self):
        sg = SplitGains( 
            [ 0, 0, 5, 10],
            [ 0, 0, 1, 1 ])
        self.assertEqual( sg, [(2.5, 1.0)])

    def test_SplitGainsSameAttDifferentLabel(self):
        sg = SplitGains( 
            [ 0, 0, 5, 10],
            [ 0, 1, 1, 1 ])
        self.assertEqual( sg, [(0.0, 0.8112781244591328)])
        
    def test_SplitGainsSameAttDifferentLabelMixed(self):
        sg = SplitGains( 
            [ 0, 0, 0, 5, 10],
            [ 0, 1, 0, 1, 1 ])
        self.assertEqual( sg, [(0.0, 0.9709505944546686)])
      


    References
  • [ML] : Machine Learning, Tom M.Mitchell
  • [IML] : An Introduction to Machine Learning, Miroslav Kubat

Saturday, May 28, 2016

Fluent for educated

Recently came across a book named 'Fluent Python' and being inspired to make my python code fluent and concise.

First there is no need to read each line and parse and put to a list like below.
def LoadTrainingExamples( file, separator ) :
    examples = []
    for line in open( file ) :
        examples.append( [ float(e) for e in line.rstrip().split(separator) ] )
    return examples

>>> Examples = LoadTrainingExamples( 'faults.txt', '\t' )
You just need to call loadtxt() like below
>>> Examples = np.loadtxt( 'faults.txt' )


Next, I used array index as in implicit index for storing min, max. It is a sore to eyes but have to show you again.
>>> minmax = np.vstack( (A.min(0),A.max(0)) ); 
>>> A = Normalize( A, minmax,[-1,1] );
There is 'namedtuple' to make a C structure like class in a line.
>>> import collections
>>> Limit = collections.namedtuple( 'Limit', ['Upper','Lower']) 
>>> limits = [Limit(min(aj),max(aj)) for c in A.T]   # A.T because min, max across column vector. Also named aj instead of ai 
>>> A = Normalize( A, limits, Limit(-1,1)) 


Then Normalize function that is another pain to read.
def Normalize( A, minmax, interval ) :
    N = []
    scale = math.fabs( interval[1] - interval[0] )
    offset = interval[0]
    for i in range(A.shape[1]) :
        minv = minmax[0,i]
        maxv = minmax[1,i]
        N.append( [ (((float(a)-minv)/(maxv-minv))*scale + offset)
                    for a in A[:,i] ] )
    return np.matrix( N ).getT()
With 'Limit' namedtuple and numpy.matrix 's operator overloading of + and *, the code can be written as below.
def Normalize( A, Limits, interval ) :
    N = np.matrix([ (aj-li.Lower)/(li.Upper-li.Lower) 
        for aj,li in zip(A.T, Limits) ])                         # convert to [0:1] 
    N = N * (interval.Upper - interval.Lower) + interval.Lower   # scale and offset
    return N.getT() 

    References
  • Fluent Python, Luciano Ramalho

Wednesday, May 25, 2016

Aritificial Neural Network with two layers and weight update

ANN is popular machine learning suject. Here I describe basic two layer neural network with backpropagation. It is simplest form of neural network that can update itself to reduce classification error.
First, input range from -1 to 1 instead of 0 to 1. Below is the code that normalize input value on this range.
def Normalize( A, minmax, interval ) :
    N = []
    scale = math.fabs( interval[1] - interval[0] )
    offset = interval[0]
    for i in range(A.shape[1]) :
        minv = minmax[0,i]
        maxv = minmax[1,i]
        N.append( [ (((float(a)-minv)/(maxv-minv))*scale + offset)
                    for a in A[:,i] ] )
    return np.matrix( N ).getT()
Forward propagation with sigmoid neuron can be coded as below. Here X is the input vector and HN is the weight for Hidden Neurons - nerons after input. And ON is the weight of output neuron. Here
def ForwardPropagation( X, HN, ON ) :
    Z = np.dot( HN, X )
    H = [ (1/(1+math.exp(-zi))) for zi in Z ]
    Z = np.dot( ON, H )
    O = [ (1/(1+math.exp(-zi))) for zi in Z ]
    return H, O
Backward propagation can be expressed in matrix form as below. H and O is output from above ForwardPropagation and T is the output observed and nu is training rate.
def BackwardPropagation( X, HN, ON, H, O, T, nu ) :
    D1 = [ yi*(1-yi)*(ti-yi) for yi,ti in zip(O,T) ]
    D2 = [ hi*(1-hi)*pi for hi, pi in zip(H, np.dot(ON.T,D1))]
    ON += nu * np.dot(np.matrix(D1).T, np.matrix(H))
    HN += nu * np.dot(np.matrix(D2).T, np.matrix(X).T )
    return HN, ON
With steel plate database - as from previous blog - below is a code to run a epoch ( running through all example to train ).
Examples = np.array(LoadTrainingExamples( 'faults.txt', '\t' ))
Examples = Examples[ range(0,346) ]  # 0-345 rows is for classification 27,28 column
A = Examples[:, range(4,27)]  # attributes - exclude 0-3 column as it is x,y position
minmax = np.vstack( (A.min(0),A.max(0)) );
A = Normalize( A, minmax, [-1,1] )
N = A.shape[1]                # number of attributes - input vector length
C = Examples[:, [27,28] ]
M = C.shape[1]                # number of output - output vector length
HN = 0.01 * np.ones( [N,N] )  # 0.01 for initial weight
ON = 0.01 * np.ones( [M,N] ) 

for Ai,Ci in zip( A,C)  :
    H, O = ForwardPropagation( Ai.T, HN, ON )
    HN, ON = BackwardPropagation( Ai.T, HN, ON, H, O, Ci, 0.1)             

When running above with 200 epoch, the error rate goes below 1%. Below is a graph shows how the error rates changes as ANN is get trained.


Though when running with all 7 attributes, it doesn't get trained well as below graph shows.


    References
  • An Introduction to Machine Learning, Miroslav Kubat. Chapter5
  • dataset provided by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. www.semeion.it