essay on programming languages, computer science, information techonlogies and all.

Thursday, July 28, 2016

Following Django tutorial on Windows using Apache server

In this post, I will try to follow Django 1.9 tutorial on Windows Server 2008 32bit with Apache server 2.4. Python version is 3.4.4.

Get updated pip


C:\download>python get-pip.py
Collecting pip
  Downloading pip-8.1.2-py2.py3-none-any.whl (1.2MB)
    100% |################################| 1.2MB 177kB/s
Installing collected packages: pip
  Found existing installation: pip 8.1.1
    Uninstalling pip-8.1.1:
      Successfully uninstalled pip-8.1.1
Successfully installed pip-8.1.2

Install virtualenv - but didn't used it in here and not strictly necessary

C:\download>python -m pip install virtualenv
Collecting virtualenv
  Downloading virtualenv-15.0.2-py2.py3-none-any.whl (1.8MB)
    100% |################################| 1.8MB 84kB/s
Installing collected packages: virtualenv
Successfully installed virtualenv-15.0.2

Install Django

C:\download>python -m pip install Django
Collecting Django
  Downloading Django-1.9.7-py2.py3-none-any.whl (6.6MB)
    100% |################################| 6.6MB 27kB/s
Installing collected packages: Django
Successfully installed Django-1.9.7
Check Django version
>>> import django
>>> print(django.get_version())
1.9.7
>>> 

Run Django admin


C:\>python C:\WinPython-32bit-3.4.4.2\python-3.4.4\Lib\site-packages\django\bin\django-admin.py startproject mysite

Modify C:\mysite\mysite\wsgi.py.


import os, sys
from django.core.wsgi import get_wsgi_application
sys.path.append('C:\mysite')
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mysite.settings")
application = get_wsgi_application()
Modify C:\Apache24\conf\httpd.conf.

WSGIScriptAlias /mysite "C:/mysite/mysite/wsgi.py"
WSGIPythonPath "C:/mysite"
<Directory "C:/mysite/mysite">
<Files wsgi.py>
Require all granted
</Files>
</Directory>
You should see below when accessing mysite. Refer previous post for connection port 81.
Start polls app.
C:\mysite>python manage.py startapp polls
Modify C:\mysite\polls\views.py.
from django.shortcuts import render
from django.http import HttpResponse

def index(request):
    return HttpResponse("Hello, world. You're at the polls index.")
Modify C:\mysite\polls\urls.py.
from django.conf.urls import url
from . import views
urlpatterns = [
    url(r'^$', views.index, name='index'),
]    
Modify C:\mysite\mysite\urls.py.
from django.conf.urls import include, url
from django.contrib import admin
urlpatterns = [
    url(r'^polls/', include('polls.urls')),
    url(r'^admin/', admin.site.urls),
]
Now you should see index page.
Run migrate
C:\mysite>python manage.py migrate
Operations to perform:
  Apply all migrations: sessions, admin, auth, contenttypes
Running migrations:
  Rendering model states... DONE
  Applying contenttypes.0001_initial... OK
  Applying auth.0001_initial... OK
...
Modify C:\mysite\polls\models.py.
from django.db import models

class Question(models.Model):
    question_text = models.CharField(max_length=200)
    pub_date = models.DateTimeField('date published')

class Choice(models.Model):
    question = models.ForeignKey(Question, on_delete=models.CASCADE)
    choice_text = models.CharField(max_length=200)
    votes = models.IntegerField(default=0)
Modify C:\mysite\mysite\settings.py.
INSTALLED_APPS = [
    'polls.apps.PollsConfig',
    'django.contrib.admin',
...
]    
Run commands to generate prepare database.
C:\mysite>python manage.py makemigrations polls
Migrations for 'polls':
  0001_initial.py:
    - Create model Choice
    - Create model Question
    - Add field question to choice

C:\mysite>python manage.py sqlmigrate polls 0001
BEGIN;
--
-- Create model Choice
--
CREATE TABLE "polls_choice" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "c
hoice_text" varchar(200) NOT NULL, "votes" integer NOT NULL);
....

COMMIT;

C:\mysite>python manage.py migrate
Operations to perform:
  Apply all migrations: polls, admin, contenttypes, auth, sessions
Running migrations:
  Rendering model states... DONE
  Applying polls.0001_initial... OK
Create admin account
C:\mysite>python manage.py createsuperuser
Username (leave blank to use 'administrator'): sungwoo
Email address: the.sungwoo@gmail.com
Password:
Password (again):
Superuser created successfully.
Now you should see admin page but static files - CSS and javascript - are not in place.
Modify C:\mysite\mysite\settings.py.
STATIC_URL = '/static/'
STATIC_ROOT = "c:/mysite/static"
Run collectstatic
C:\mysite>python manage.py collectstatic

You have requested to collect static files at the destination
location as specified in your settings:

    c:\mysite\static

This will overwrite existing files!
Are you sure you want to do this?

Type 'yes' to continue, or 'no' to cancel: yes
Copying 'C:\WinPython-32bit-3.4.4.2\python-3.4.4\lib\site-packages\django\contri
b\admin\static\admin\css\base.css'
...
Copying 'C:\WinPython-32bit-3.4.4.2\python-3.4.4\lib\site-packages\django\contri
b\admin\static\admin\js\vendor\xregexp\xregexp.min.js'

56 static files copied to 'c:\mysite\static'.
Modify C:\Apache24\conf\httpd.conf.
Alias /media/ "C:/mysite/media/"
Alias /static/ "C:/mysite/static/"

<Directory "C:/mysite/static">
Require all granted
</Directory>

<Directory "C:/mysite/media">
Require all granted
</Directory>

WSGIScriptAlias /mysite "C:/mysite/mysite/wsgi.py"
WSGIPythonPath "C:/mysite"
<Directory "C:/mysite/mysite">
<Files wsgi.py>
Require all granted
</Files>
</Directory>
Then fancy painted admin page should come up.

Monday, July 25, 2016

Install mod_wsgi on Windows

There are number of way to make Apache web server service python code. mod_python, mod_wsgi and FastCGI. mod_pyhton and FastCGI weren't tamed on Windows Server 2008. But mod_wsgi can be configured to work and here is the installation procedure.

First, need to get mod_wsgi pre-built binary from 'Unofficial Windows Binaries for Python Extension Packages'. You need to select compiler, version and arch . I installed Apache from Apache Lounge with VC10 link so chose mod_wsgi-4.4.23+ap24vc10-cp34-cp34m-win32.whl.

Copy the file to c:\download folder and install it using pip.

C:\minute\orange>python -m pip install "c:\download\mod_wsgi-4.4.23+ap24vc10-cp3
4-cp34m-win32.whl"
Processing c:\download\mod_wsgi-4.4.23+ap24vc10-cp34-cp34m-win32.whl
Installing collected packages: mod-wsgi
Successfully installed mod-wsgi-4.4.23+ap24vc10
C:\minute\orange>    
The installed 'mod_wsgi.so' file is located at 'C:\WinPython-32bit-3.4.4.2\python-3.4.4\mod_wsgi.so' and I copied it to 'C:\Apache24\modules'.

Then open 'C:\Apache24\conf\httpd.conf' and edit content as below.

LoadModule wsgi_module modules/mod_wsgi.so
...
WSGIScriptAlias /wsgi "C:/WebPages/wsgi/wsgi_app.py"
<Directory "C:/WebPages/wsgi">
AllowOverride None
Options None
Order deny,allow
Allow from all
</Directory>

Now restart the Apache service using Windows Service.
If service can't be restarted, then check error at C:\Apache24\logs\error.log. If error is like below talking about encodings, it is because PYTHONPATH is not set.
Fatal Python error: Py_Initialize: unable to load the file system codec
ImportError: No module named 'encodings'
Add environmental variable of 'PYTHONPATH' like below.
At this point, you should be able to start Apache service. Then write first Hello-World python code that will be served by Apache server at 'C:\WebPages\wsgi\wsgi_app.py'. This code is originated from Brugbart blog but modified to work by Sumit question at stackoverflow.
def application(environ, start_response):
    status = '200 OK'
    output = b'Hello World!'

    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return [output]
Then finally a tiny python step in to the wold wide web.

Install SQLite on Windows for Python

Installation of SQLite is rather straight forward but still it isn't like using installer.

Python first. Download WinPython 3.4.4.2/ and install it.

Add following on system PATH


C:\WinPython-32bit-3.4.4.2\python-3.4.4;

Go to SQLite download page and download sqlite-dll-win32-x86-3130000.zip. ( Or other version/arch file )

Unzip the downloaded file at c:\sqlite then put the path to the system PATH

Python comes with sqlite interface so no need further action. You should be able to type below and then find an example.db created on the working directory. Refer sqlite3

Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> conn = sqlite3.connect( 'example.db' )
>>> c = conn.cursor()
>>> conn.commit()
>>> conn.close()

Thursday, July 21, 2016

Run Apache web server on top of IIS7

I am paying around 15000 won per month to run a Windows server 2008 remotely. This server has an static IP address - uvans. Installed a redmine and also Bonobo git server.

In addition to this IIS7 and programs, I want to run Django and tried to run it IIS7 but after all the effort, I found that Django not supporting fastcgi anymore. So it is not a recommended approach. The supported way is to run it with Apache web server.

Thankfully, Apache server can be run on Windows machine. First download windows installer from here. I unzipeed httpd-2.4.23-win32.zip to c:\apache24 and installed vcredist_x86.exe.

Then modification on the C:\Apache24\conf\httpd.conf as below. This is to get Apache responds to port 81. I have only 1 IP and no other way but to use another port number. Is 81 safe enough ?

Listen 0.0.0.0:81

ServerAdmin the.sungwoo@gmail.com
...
ServerName uvans.vps.phps.kr:81
...
AllowOverride All
...
DocumentRoot "c:/WebPages"
<Directory "c:/WebPages">

Then running the httpd service in command line as below.
C:\Apache24\bin>httpd -t
Syntax OK

C:\Apache24\bin>httpd -k install
Installing the 'Apache2.4' service
The 'Apache2.4' service is successfully installed.
Testing httpd.conf....
Errors reported here must be corrected before the service can be started.

C:\Apache24\bin>httpd -k start

C:\Apache24\bin>

Check the apache web service is running.
Then you can go to localhost:81 and see the index.html at c:\WebPages comes up
And make sure firewall is taken care of - allow Apache httpd.exe.
Then Apache served web page can be reached from the world.

Wednesday, July 6, 2016

JsTestDriver for JavaScript unit test

When start any serious work on a programming language, it is always a good idea to setup unit test. Otherwise, you can't make a progress without breaking previous work.

For JavaScript, there are number of unit test framework, but I chosed JsTestDriver because it can be integrated in build process. Though found number of issues during the journey.

First, the JsTestDriver can be downloaded from js-test-driver at google code. At first tried 1.3.5 but later changed to 1.3.3d as 1.3.5 has bug on relateive path. Refer stackoverflow discussion

JsTestDriver needs a running process in background behaves like a server. In Windows, I wrote a batch script that can be double-clicked to launch the process.
@echo off
set path=%path%;C:\Program Files (x86)\Java\jre7\bin;
set SRC_BUILD=%~dp0
SET JSTESTDRIVER_HOME=%SRC_BUILD%\..\3rdParty\JsTestDriver
cmd /k java.exe -jar %JSTESTDRIVER_HOME%/JsTestDriver-1.3.5.jar --port 4224 --browser "C:\Program Files (x86)\Legpat\Application\chrome.exe"

When double-clicked, it will shows command window and chrome will have a dedicated page. These two should be kept on to function correctly. Though annoying it is, better to find a way to run command line without window, and run the page in background page of chrome. Let me know if you have an idea.
Then jsTestDriver.conf like below. It should be located at a base directory which contains UnitTest folder and the folder should have js files to test.
server: http://localhost:4224

load:
 - UnitTest/*.js
 
Then unit test files can be executed as part of MSBUILD process like below.
<?xml version="1.0" encoding="utf-8"?> 
<Project 
    ToolsVersion="4.0" 
    xmlns="http://schemas.microsoft.com/developer/msbuild/2003" 
    DefaultTarget="FullBuild">
...

    <Target Name="CoreBuild">    
        <Exec 
            Command="java -jar %JSTESTDRIVER_HOME%\JsTestDriver-1.3.5.jar --tests all --reset --verbose"
            WorkingDirectory="$(SRC_ROOT)\Diagram"
            CustomErrorRegularExpression="\[FAILED\]"
        />
    </Target>
...


Make sure you put '--reset' option otherwise the you see an error magically disappear but comes right back when source file get saved.

Here is a example of running above build project as a batch script.

When looking closely above window, you can see that there are js-test-driver logging like below. Note that it loaded '/test/UnitTest/Sample.js'. I didn't use 'test' folder but it somehow magically prefixed it. It gave me a lot of confusion at the first time. With combination of '--reset', it dragged me down quite long.
  Chrome: Reset
  Chrome 50.25.2661.78 Windows loaded /test/UnitTest/Sample.js
...

Thursday, June 30, 2016

JointJS Diagram - Google Driver link

Monday, June 13, 2016

Handling continuous valued attribute by finding splitting threshold and its information gain for decision tree

Decision tree needs to find information gain on each attribute with certain condition. When attribute is discrete value, each discrete can be the decision but when attribute value is continuous, there can be too many conditions to consider. Say attributes changes 0 to 10, then how to choose a threshold ? Simple answer may be 5. But 5 is only good if attributes values are uniformly distributed. Or if class label is evenly divided at 5.

An algorithm is introduced at [IML]. First, it sorts the attributes and then find all the attributes that changes class label and calculate information gain. Then find the best gain which can be used as a binary split.

Here I will try to show the implementation of the algorithm in python code. The Entrophy() and Gain() are methods to calculate information gain as below.
def Entrophy( S ):
    N = sum( S ) if len(S) !=0 else sys.float_info.epsilon
    P = [Si/N for Si in S]
    return sum( [ (-pi * math.log(pi,2) if pi != 0.0 else 0.0) for pi in P] ) 

def Gain( S, A ):
    NS = sum( S )
    NSv = [ sum(Ai) for Ai in A ]
    return Entrophy(S) - sum( [ (NSv[i]/NS)*Entrophy(A[i]) for i in range(len(A))] ) 


Decision tree algorithm looks for best attribute to choose at the current state which is asking for calculating information gain for each attribute. So usually, the attribute and class label comes in a separate variable hence below method definition.
def SplitGains( A, C ):


Examples should be sorted with attribute value. Note that there can be attribute with same value or same attribute with different class label. It is the reason why I don't use set or dict because those can silently throw away duplicated key or update value with last one. Instead, I used plain sort with attribute and label for key. In this way, the implementation can cope well on various cases.
    Example = collections.namedtuple( 'Example', ['att','label'] )
    E = [ Example(ai,ci) for ai,ci in zip(A,C) ]
    E.sort( key = lambda x : (x.att,x.label) )


Then, there should be distinct label list. It is a perfect case to use set comprehension. Then figuring out how many each class label are there on training example - just summing up like below.
    L = set( [ ei.label for ei in E ] )
    S = [ sum( [ 1 for ei in E if ei.label == li ] ) for li in L ]


Then go through examples ( E ) and find all the splitting threshold that changes class label.
    Threshold = collections.namedtuple( 'Threshold', ['att', 'label', 'begin', 'end'])

    Thresholds = []    
    label = E[0].label
    begin = 0
    for (i,ei) in enumerate(E):
        if ei.label == label: continue
        
        middle = (ei.att + E[i-1].att)/2.0 
        Thresholds.append( Threshold(middle, label, begin, i-1))
        
        label = ei.label
        begin = i



Then actual counting of class label on the left and right with respect to threshold. This code is not so efficient as there are loops in loop. It must be like O( len(Thresholds) x len(E) x len(L) ). The information of Threshold can be used in here but then the code can not be concise. I presumed the len(Thresholds) and len(L) is reasonably small compared to len(E) so the it ends up more like O(len(E)).
    AttGains = []
    for ti in Thresholds:
        left = [ 
            sum( [ 1 for ei in E[:ti.end+1] if ei.label == li ] ) 
            for li in L ]
        right = [ 
            sum( [ 1 for ei in E[ti.end+1:] if ei.label == li ] ) 
            for li in L ]
        AttGains.append( (ti.att, Gain( S, [left,right]) ) )        

    return AttGains 


The whole method is at below.
def SplitGains( A, C ):
    Example = collections.namedtuple( 'Example', ['att','label'] )
    E = [ Example(ai,ci) for ai,ci in zip(A,C) ]
    E.sort( key = lambda x : (x.att,x.label) )
    
    L = set( [ ei.label for ei in E ] )
    S = [ sum( [ 1 for ei in E if ei.label == li ] ) for li in L ]
    
    Threshold = collections.namedtuple( 'Threshold', ['att', 'label', 'begin', 'end'])

    Thresholds = []    
    label = E[0].label
    begin = 0
    for (i,ei) in enumerate(E):
        if ei.label == label: continue
        
        middle = (ei.att + E[i-1].att)/2.0 
        Thresholds.append( Threshold(middle, label, begin, i-1))
        
        label = ei.label
        begin = i

    AttGains = []
    for ti in Thresholds:
        left = [ 
            sum( [ 1 for ei in E[:ti.end+1] if ei.label == li ] ) 
            for li in L ]
        right = [ 
            sum( [ 1 for ei in E[ti.end+1:] if ei.label == li ] ) 
            for li in L ]
        AttGains.append( (ti.att, Gain( S, [left,right]) ) )        

    return AttGains       


And the unit test is as below.
class TestDecisionTreeID3(unittest.TestCase):

    # [ML] P58
    def test_EntrophySimple(self):
        self.assertAlmostEqual( Entrophy([9,5]), 0.94028595)
        
    def test_MultipleLabel(self):
        print( 'Entrophy([9,5,4])=', Entrophy([9,5,4]) )        
        print( 'Gain([9,5,4],[[6,2,3],[3,3,1]])=', Gain([9,5,4],[[6,2,3],[3,3,1]]) )        
        
    # [ML] P58
    def test_GainSimple(self):
        self.assertAlmostEqual( Gain( [9,5], [[6,2],[3,3]]), 0.04812703)

    def test_EntrophyKnown(self):
        self.assertEqual( Entrophy([100]), 0.0)
        self.assertEqual( Entrophy([]), 0.0)
        self.assertEqual( Entrophy([1,1]), 1.0)
        
    # [IML] P125
    def test_SplitGains(self):
        sg = SplitGains( 
            [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11,12,13],
            [ 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0 ])
        self.assertEqual( sg, 
            [ (5.5, 0.49647937549469),
              (10.5, 0.014582259610812609),
              (12.5, 0.09123321517616922) ])
   
    def test_SplitGainsAnyOrder(self):
        sg = SplitGains( 
            [ 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 11,12,13],
            [ 0, 0, 0, 0, 0,  1, 1, 1, 1, 1, 1, 1, 0 ])
        self.assertEqual( sg, 
            [ (5.5, 0.49647937549469),
              (10.5, 0.014582259610812609),
              (12.5, 0.09123321517616922) ])
        
    def test_SplitGainsSameAtt(self):
        sg = SplitGains( 
            [ 0, 0, 5, 10],
            [ 0, 0, 1, 1 ])
        self.assertEqual( sg, [(2.5, 1.0)])

    def test_SplitGainsSameAttDifferentLabel(self):
        sg = SplitGains( 
            [ 0, 0, 5, 10],
            [ 0, 1, 1, 1 ])
        self.assertEqual( sg, [(0.0, 0.8112781244591328)])
        
    def test_SplitGainsSameAttDifferentLabelMixed(self):
        sg = SplitGains( 
            [ 0, 0, 0, 5, 10],
            [ 0, 1, 0, 1, 1 ])
        self.assertEqual( sg, [(0.0, 0.9709505944546686)])
      


    References
  • [ML] : Machine Learning, Tom M.Mitchell
  • [IML] : An Introduction to Machine Learning, Miroslav Kubat

Saturday, May 28, 2016

Fluent for educated

Recently came across a book named 'Fluent Python' and being inspired to make my python code fluent and concise.

First there is no need to read each line and parse and put to a list like below.
def LoadTrainingExamples( file, separator ) :
    examples = []
    for line in open( file ) :
        examples.append( [ float(e) for e in line.rstrip().split(separator) ] )
    return examples

>>> Examples = LoadTrainingExamples( 'faults.txt', '\t' )
You just need to call loadtxt() like below
>>> Examples = np.loadtxt( 'faults.txt' )


Next, I used array index as in implicit index for storing min, max. It is a sore to eyes but have to show you again.
>>> minmax = np.vstack( (A.min(0),A.max(0)) ); 
>>> A = Normalize( A, minmax,[-1,1] );
There is 'namedtuple' to make a C structure like class in a line.
>>> import collections
>>> Limit = collections.namedtuple( 'Limit', ['Upper','Lower']) 
>>> limits = [Limit(min(aj),max(aj)) for c in A.T]   # A.T because min, max across column vector. Also named aj instead of ai 
>>> A = Normalize( A, limits, Limit(-1,1)) 


Then Normalize function that is another pain to read.
def Normalize( A, minmax, interval ) :
    N = []
    scale = math.fabs( interval[1] - interval[0] )
    offset = interval[0]
    for i in range(A.shape[1]) :
        minv = minmax[0,i]
        maxv = minmax[1,i]
        N.append( [ (((float(a)-minv)/(maxv-minv))*scale + offset)
                    for a in A[:,i] ] )
    return np.matrix( N ).getT()
With 'Limit' namedtuple and numpy.matrix 's operator overloading of + and *, the code can be written as below.
def Normalize( A, Limits, interval ) :
    N = np.matrix([ (aj-li.Lower)/(li.Upper-li.Lower) 
        for aj,li in zip(A.T, Limits) ])                         # convert to [0:1] 
    N = N * (interval.Upper - interval.Lower) + interval.Lower   # scale and offset
    return N.getT() 

    References
  • Fluent Python, Luciano Ramalho

Wednesday, May 25, 2016

Aritificial Neural Network with two layers and weight update

ANN is popular machine learning suject. Here I describe basic two layer neural network with backpropagation. It is simplest form of neural network that can update itself to reduce classification error.
First, input range from -1 to 1 instead of 0 to 1. Below is the code that normalize input value on this range.
def Normalize( A, minmax, interval ) :
    N = []
    scale = math.fabs( interval[1] - interval[0] )
    offset = interval[0]
    for i in range(A.shape[1]) :
        minv = minmax[0,i]
        maxv = minmax[1,i]
        N.append( [ (((float(a)-minv)/(maxv-minv))*scale + offset)
                    for a in A[:,i] ] )
    return np.matrix( N ).getT()
Forward propagation with sigmoid neuron can be coded as below. Here X is the input vector and HN is the weight for Hidden Neurons - nerons after input. And ON is the weight of output neuron. Here
def ForwardPropagation( X, HN, ON ) :
    Z = np.dot( HN, X )
    H = [ (1/(1+math.exp(-zi))) for zi in Z ]
    Z = np.dot( ON, H )
    O = [ (1/(1+math.exp(-zi))) for zi in Z ]
    return H, O
Backward propagation can be expressed in matrix form as below. H and O is output from above ForwardPropagation and T is the output observed and nu is training rate.
def BackwardPropagation( X, HN, ON, H, O, T, nu ) :
    D1 = [ yi*(1-yi)*(ti-yi) for yi,ti in zip(O,T) ]
    D2 = [ hi*(1-hi)*pi for hi, pi in zip(H, np.dot(ON.T,D1))]
    ON += nu * np.dot(np.matrix(D1).T, np.matrix(H))
    HN += nu * np.dot(np.matrix(D2).T, np.matrix(X).T )
    return HN, ON
With steel plate database - as from previous blog - below is a code to run a epoch ( running through all example to train ).
Examples = np.array(LoadTrainingExamples( 'faults.txt', '\t' ))
Examples = Examples[ range(0,346) ]  # 0-345 rows is for classification 27,28 column
A = Examples[:, range(4,27)]  # attributes - exclude 0-3 column as it is x,y position
minmax = np.vstack( (A.min(0),A.max(0)) );
A = Normalize( A, minmax, [-1,1] )
N = A.shape[1]                # number of attributes - input vector length
C = Examples[:, [27,28] ]
M = C.shape[1]                # number of output - output vector length
HN = 0.01 * np.ones( [N,N] )  # 0.01 for initial weight
ON = 0.01 * np.ones( [M,N] ) 

for Ai,Ci in zip( A,C)  :
    H, O = ForwardPropagation( Ai.T, HN, ON )
    HN, ON = BackwardPropagation( Ai.T, HN, ON, H, O, Ci, 0.1)             

When running above with 200 epoch, the error rate goes below 1%. Below is a graph shows how the error rates changes as ANN is get trained.


Though when running with all 7 attributes, it doesn't get trained well as below graph shows.


    References
  • An Introduction to Machine Learning, Miroslav Kubat. Chapter5
  • dataset provided by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. www.semeion.it

Sunday, May 15, 2016

k-Nearest-Neighbor classifer and cross-validation

Next subject is k-Nearest-Neighbor classifier. When it gets a new sample, it searches through training example to find similar one ( or k ) and say that it should be same as history. If you have a big training samples that covers most of the case, then it should pick something similiar.

Comparing with Decision Tree, you may wonder what is pros and cons. Here is analogy. When you drive to a city never visited before, a GPS navigator can help you to go to the hotel. It tells you where it is by the hotel address. As long as you are going some well-known hotel, the navigator is the first choice. To follow the navigator is like using a deicision tree. You just need to take turn when decision tree ask you do. Though if you have a wrong street name, then no matter how good the building number is, you may end up at wrong address. Or if you don't have address, or name of the place, but just a description , then navigator leads to nowhere.

Let's say you want somewhere near hill-top where you can see the sunset and have beer at a pub and then go to a cosy B&B in a walking distance , then your best bet is parking the car and take a cab and explain yourself. An experienced cab driver can take you a place in where all description meets in the middle. A dumb navigator is no match in here. Of course, you have to tip the cab driver dearly as she has to go through every locations in her memory and match with those description.

Above analogy is quite crude but I am trying to make a point that kNN classifier is something you should try when you know that the classification should go through a wide range of choices with uncertainty in everywhere.

There is an open database for this kind of test. Try to visit UCI Machine Learning Repository Here I am using Steel Plate Falut

First the database is a tab separated text file which can be loaded up as below.
def LoadTrainingExamples( file, separator ) :
    examples = []
    for line in open( file ) :
        examples.append( [ float(e) for e in line.rstrip().split(separator) ] )
    return examples
>>> LoadTrainingExamples( 'faultsshort.txt', '\t' )
...

When there are large number of attributes - like this database which has 27 variables - we have to choose variables to play with. These chosen variables are called a model in large. The selection can come from an expert or from a extensive automatic search - like hill climbing. No matter how the selection comes, it has to be verified whether the chosen model and training set is a good match. One of the verification method is a cross-validation. It is to take out some of examples and train the model with those examples and then classify the remained examples with the trained model and see if the result classification matches with record.

So we need a function to choose attribute to make a model. Here examples comes with each row as a record and each column as a attribute. Below function pick columns and forms a model
def PickColumns( A, I ) :
    P = [ A[:,i] for i in I ]
    return np.matrix( P ).getT()
>>> PickColumns( np.array( [[1,2,3],[4,5,6] ] ), [0,2] )
matrix([[1, 3],
        [4, 6]])

Before apply training, each attribute should be scaled up/down so that each contribute to the result equally. In here, the normalization is done using max,min of each attribute.
def Normalize( A, minmax ) :
    N = []
    for i in range(A.shape[1]) :
        minv = minmax[0,i]
        maxv = minmax[1,i]
        N.append( [ ((float(a)-minv)/(maxv-minv)) for a in A[:,i] ] )
    return np.matrix( N ).getT()
>>> Normalize( np.matrix('1,2;0,1'), np.matrix('0,0;1,2'))
matrix([[ 1. ,  1. ],
        [ 0. ,  0.5]])

Finding out k-Nearest-Neighbor is relatively simple as we can use scipy 's spatial.KDTree. KDTree is like binary tree with n-Dimenonal attributes and should return result in O(log2(N)). Here is code snippet.
    KD = spatial.KDTree( A )
    distance, index = KD.query( X, k )

With above KDTree, we can find out k-Nerest-Neighbor at index. Then with those neighbor, we have to choose one neighbor or classifcation result. Here is a function that select most frequent neighbor.
def MaxArg( index, C ) :
    if len(index.shape) == 1 :
        S = sum( [ C[i] for i in index ] )
        NS = 1
    else :
        S = sum( [ C[i] for i in index[0] ] )
        NS = len( index[0] )
    return [ int(si/NS+0.5) for si in S ]

Integrating all the above code, here is a code that does cross validation. Refer [Moore and Lee] for LOOCV ( Leave-One-Out Cross Validation )

Examples = np.array(LoadTrainingExamples( 'faults.txt', '\t' ))
AI = range( 4,8 ); CI = range( 27,28 )  # The model is to use column 4,5,6,7 as attribute and 27 as classification
error_sum = 0

k = 1  # k can be 1 or 3 
E = PickColumns( Examples, AI );  
minmax = np.vstack( (E.min(0),E.max(0)) ); 
E = Normalize( E, minmax );
EC = PickColumns( Examples, CI )

# LOOCV ( Leave-One-Out Cross Validation )
for out_one in range(len(Examples)): 
    A = np.delete( E, (out_one), axis=0)    # training attribute
    C = np.delete( EC, (out_one), axis=0)   # training classification
 
    X = E[out_one];                         # validation point
    CX = EC[out_one];                       # recodred classification

    KD = spatial.KDTree( A )
    distance, index = KD.query( X, k )
    predict = MaxArg( index, C )
    error_sum += sum( [ abs(p-CX[0,i]) for i,p in enumerate(predict)])
    
    print( str.format( "Leave-One {0} is classifed by {1}{2} as {3} while sampled as {4}",
           out_one, index, distance, predict, CX[0] ))  

print( str.format( "Error = {0} ({1}/{2})",
    error_sum/(len(Examples)*len(CI)), error_sum, len(Examples)*len(CI)))  

When running above script with k=1, the cross validation error is 10.8%.
Leave-One 0 is classifed by [1299][ 0.00052379] as [0] while sampled as [[ 1.]]
Leave-One 1 is classifed by [115][ 0.00033883] as [1] while sampled as [[ 1.]]
...
Leave-One 1940 is classifed by [225][ 0.00013041] as [0] while sampled as [[ 0.]]
Error = 0.10819165378670788 (210.0/1941)

With k=3, the error is slightly less of 8.6%
Leave-One 0 is classifed by [[1299 1314 1004]][[ 0.00052379  0.00063176  0.00063623]] as [0] while sampled as [[ 1.]]
Leave-One 1 is classifed by [[ 115 1595 1554]][[ 0.00033883  0.00036469  0.00039562]] as [0] while sampled as [[ 1.]]
...
Leave-One 1940 is classifed by [[ 225 1927  857]][[ 0.00013041  0.00022388  0.00024139]] as [0] while sampled as [[ 0.]]
Error = 0.0865533230293663 (168.0/1941)

The model chosen is not from an expert. Though the result is relatively impressive to me. I mean it is not easy to be correct on 9 out of 10 case.
    There are a lot to explore from here.
  • Choosing model automatically using Hill-climbing and see if there is 'No free hunch'. I mean my guess - AI = range( 4,8 )- should be really bad choice compared to extensive and through search
  • Every validation needs KDTree setup with almost similar training set. If previous KDTree can be reused - by removing and adding a row - the speed can be improved dramatically.
  • To speed up at run time, the training set needs to be trimmed by removing redundancy examples. 'Tomek Link' test can be tried


    References
  • An Introduction to Machine Learning, Miroslav Kubat. p51
  • Efficient Algorithms for Minimizing Cross Validation Error, Andrew W. Moore, Mary S.Lee
  • dataset provided by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. www.semeion.it

Sunday, May 8, 2016

Decision Tree with Python

Studying machine learning and writing code to see whether I understand the theory of decision tree correctly or not by writing python code. For the reference, refer [MLT].

First Entrophy and Gain should be calculated and here is the code snippet. Entrophy can be used like Entrophy( [9, 5] ) and it will return 0.9402859586706309. And Gain( [9,5], [[6,2],[3,3]] ) will return 0.04812703040826927. .
import numpy as NP
import itertools
import functools
import math

def Entrophy( S ):
    N = sum( S )
    if N == 0.0 : return 0.0

    P = [Si/N for Si in S]
    return sum( [ (-pi * math.log(pi,2) if pi != 0.0 else 0.0) for pi in P] ) 

def Gain( S, A ):
    NS = sum( S )
    NSv = [ sum(Ai) for Ai in A ]
    return Entrophy(S) - sum( [ (NSv[i]/NS)*Entrophy(A[i]) for i in range(len(A))] ) 
For test example , I make up O variables like below. It is numpy array to use range index like O[:,-1].

O = NP.array([
    ['Sunny', 'Hot', 'High', 'Weak', 'No'],
    ['Sunny', 'Hot', 'High', 'Strong', 'No'],
    ['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Strong', 'No'],
    ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
    ['Sunny', 'Mild', 'High', 'Weak', 'No'],
    ['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
    ['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
    ['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
    ['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Strong', 'No'],    
    ])
Then there should be holder to know what kind of variables are there.
def DistinctAttributeMap( O ) :
    DA = []
    for i in range(len(O[0])) :
        m = {}
        A = list(set(O[:,i]))
        for j in range(0,len(A)) : 
             m[ A[j] ] = j
        m[ 'Reverse' ] = A
        m[ 'Count' ] = len(A)
        DA.append( m  )
    return DA
DistinctAtrributeMap goes through each column and figure out how many different variables are there. It will generate output like below with the examples at above.
[
{    'Rain': 0, 'Sunny': 1, 'Overcast': 2,      'Reverse': ['Rain', 'Sunny', 'Overcast']}, 
{    'Cool': 0, 'Hot': 1, 'Mild': 2,          'Reverse': ['Cool', 'Hot', 'Mild']}, 
{    'High': 0, 'Normal': 1,                 'Reverse': ['High', 'Normal']}, 
{    'Weak': 0, 'Strong': 1,                 'Reverse': ['Weak', 'Strong']}, 
{    'No': 0, 'Yes': 1,                         'Reverse': ['No', 'Yes']}
]
And to generate A matrix for Gain(), here comes Attributes().
def Attributes( O, DA, i ) : 
    A = NP.zeros( [ DA[i]['Count'], DA[-1]['Count'] ] )

    for j in range(len(O)) :
        a = DA[i][ O[j,i] ]
        c = DA[-1][ O[j,-1] ]
        A[ a, c ] =  A[ a, c ] + 1

    return A
The Attributes( O, DA, 0 ) will generate result like below.
array([[ 2.,  3.],
       [ 3.,  2.],
       [ 0.,  4.]])
Armed with above functions, the ID3 is like below.
def ID3( Examples, AttributeMap, Availables, Tree  ) :

    C = Attributes( Examples, AttributeMap, -1 )
    S = [ C[i,i] for i in range(len(C)) ]

    print( 'Examples = ', Examples )
    print( 'Availables = ', Availables )
    print( 'S = ', S )

    for i in range(len(S)) :
        if S[i] == len(Examples) :
            Tree[ 'Leaf' ] = AttributeMap[ 'Reverse' ][i]
            return

    if len(Availables) == 0 :
        maxS = max( S )
        maxIndex = S.index( maxS )
        Tree[ 'Probably' ] = AttributeMap[-1][maxIndex]
        Tree[ 'Odd' ] = maxS / sum( S )
        return
    
    maxAttr = -1
    maxGain = -1
    for i in Availables :
        Gi = Gain( S, Attributes( Examples, AttributeMap, i ))
        if Gi > maxGain :
            maxAttr = i
            maxGain = Gi

    Tree[ 'Best' ] = maxAttr
    Tree[ 'Gain' ] = maxGain

    leftAttr = Availables.copy()
    leftAttr.remove( maxAttr )

    for a in AttributeMap[maxAttr][ 'Reverse' ] :
        leftExamples = NP.array( list(filter( lambda r : r[maxAttr] == a, Examples )))

        leftClass = set(leftExamples[:,-1])
        if len( leftClass ) == 1 :
            Tree[ a ] = leftClass.pop()
        else :
            Tree[ a ] = {}  
            ID3( leftExamples, AttributeMap, leftAttr, Tree[a] )
            
    return
The result of this is like below with some formatting. .

>DA = DistinctAttributeMap( O )
>DT = {}
>ID3( O, DA, [0,1,2,3], DT )
>print( 'DecisionTree = ', DT )
DecisionTree ={
'Best': 0,
'Gain': 0.24674981977443911}
'Rain': {
    'Best': 3,
    'Gain': 0.97095059445466858,
    'Weak': 'Yes',
    'Strong': 'No'},
'Sunny': {
   'Best': 2,
   'Gain': 0.97095059445466858,
   'High': 'No',
   'Normal': 'Yes'},
'Overcast': 'Yes', 
}
I used dictionary to represent a tree structure. I guess I will try to show result in drawing. But it will be another blog page .
How about using 'Gini index' rather than 'Gain' ? Will it make any difference ?
How do I do to make noise resilent - prevent overfitting ?
How to let user do fine tuning ? Like decision tree with at most 3 depths and at least 95% classification ?
How about presenting the decision tree in diagram and let user verfiy it ? An interactive diagram that shows and update decision tree ?
How about making this as an API so that user provide data and retrieve decision tree ?
How about dynamically updating decision tree in run time ?

Reference :
[MLT] Machine Learning by Tom Mitchell. Chapter 3