Computing Memoir: 2013

Sunday, December 22, 2013

Using Boost io_service to synchronize with user theread

When writing GUI application using MFC, you can encounter a situation that ask you to call some GUI method in a non-GUI thread. This can happens when there are more than 1 threads running - quite common case in nowaday. When application call methods like GetAfxMainWnd() in non-GUI thread, the application crashes. AfxGetMainWnd() stores information at thread local storage which can't be access in the non-GUI thread. Refer below code. Foo runs a non-GUI thread which call a GUI-thread only method that can crash an application.

To get over this problem, you have to make a GUI thread to call a method for the non-GUI thread. If we just put the method on the GUI class holding GUI thread - CMainFrame in above, it makes the GUI class to be dependant on a completely separate class. It is a bad design. And also it is not possible to know all the different methods that will comes up to be called in GUI thread. If we add methods on the GUI-thread whenever a new stuff comes up, the code is constantly changing. Refer below code snippet that demonstart this scenario. In here, the exact implementation detail of synchronization is not shown .e.g SignalGUI(), WaitForGUIToFinish(), IsGUISignaled(). It can be implemented with semaphore and list structure.

With boost IOService, this situation can be elegantly solved with good separation between GUI and non-GUI class. Controller class has io_service and it acts like a bridge between GUI and non-GUI class. The non-GUI class can Post() it's own method that has to run in a GUI thread. The GUI class can just call Controller's Poll() method and that will call non-GUI's posted method subsequently Above Controller() provides a good isolation between two different entity and also reusable as it doesn't depend how the GUI and non-GUI class are made.

Thursday, December 19, 2013

Using MSBUILD

Using MSBUILD and making a build process has advantage on other building tools - e.g. nants ... - and that is IDE. When project file is loaded in Visudal Studio IDE and when project is compiled in build process, there is no difference. So just write code and test it in the IDE is enough to run build process.

But to be consistent on the build result across number of project, it should have a common configuration setting that applies to all the project. This can be easily done by just taking out common part of project and import it in the project configuration. Refer below code snippet. There is Import which asks for BuildUnitTest.Default.props. This file can contains all the common setting across all the unit test that is based on the boot unit test framework.

Now the BuildUnitTest.Default.props can be looked like below. Note setting like OutDir and IntDir. By using this common project file, all the project can emit the generated output file to the OutDir which is consitently named location and all the intermediate file can go to the consitently named IntDir. And note the Debug and Release usage to make consistent setting applying to all the project that import this file.

After having number of this kind os project, it is time to weave those together and make a one project file that can build all the projects. Refer below BuildAll.proj. All solution files are listed at the SolutionToBuild. And all unit test that can be run can be liseted at AfterBuild.

And time to write a simple batch file that can kick off the build process. Refer below BuildAll.bat.

Note that this BuildAll.bat can be used in the command line with vcavr32.bat set. Command window can have below setting. This way of making build process doesn't ask for any other tools. Just Visual Studio - demonstrated in here with 2010 - is enough and can be extended to various ways without much effort and learning.

Saturday, December 14, 2013

Using Git - setup SSH server

People starts to have more than 1 computer to play with. At my case, I have two PCs and 1 laptop and 1 tablet. I use one PC for work and the other PC for private matter and 1 laptop for work and private in mobile. I like to go to a cafe and work at there with my laptop. I see lots of people come to cafe with their laptop. Nowaday, most of popular cafe has power socket at their table or nearby wall. And of course WIFI.

This convenience isn't that convenient when it comes to move around files between PC and laptop. As a software developer, I need to be able to share source code between these computers. Sometimes those files are for work or private or public source code.

I can put all the files in some server and try to download and modify and upload. Or the server can be a source server. At this case, you need to have a server with a fixed IP for this purpose. This asks for money most of the case.

There are another way. You can use a Git. Git is a distributed version control. With Git, you can write code with any of your computer and share it between computer easily.

Roughly speaking, each computer has it's own source repository. After you write code, you will 'commit' it to the local repository which is keeps the changes in it's own computer. Then you will 'push' the committed change to a global repository which can reside in any of your computer. This global repository is where all the changes resides made by all the computer. This repository doesn't have to be on the web. It can be one of your computer. Then you can use another PC or laptop to 'pull' the changes from the global repository. Then makes changes and 'commit' and then 'push'

The distributed part of Git comes handy when you want to see the changed history. The local repository also has all the changed history what global repository has. When you 'push' the changes, all the committed part comes up to the global repository. And then when you 'pull', the local repository gets history of all the changes made by different computer through global repository.

The global repository should be accessible in every computer you have. One way to achieve this is to run a SSH server. This SSH is a gateway that allows other computer to access the computer. Other computer can read/write files in the global repository through SSH server.

SSH server asks a public key for each user. It can be generated using puttygen like below screen shot. You will send the saved public key to the SSH server administrator but keeps the private key in private. Server administrator knows your name and it's public key. And later, when you try to access the server with your name, the SSH server will ask for matched private key to let you in.

Then as SSH Server administrator, you should register you as an user. In here, refer below screen shot with Bitvise SSH Server. You will add an user in the account and then add pulic key you generated using puttygen.

Now when you use Git, you need to use the matched private key to log in the SSH server. Refer below screen shot using TortoiseGit. You can select your private key to log on specific SSH server. In here, I used my computer as SSH server so 'localhost'

After successful log-in, the tortoise git will execute commands of Git which resides at git installer folder e.g. "C:\Program Files (x86)\Git\libexec\git-core". You need to put this path at the environmental variable PATH. Otherwise, the SSH server will just terminate just after executing any git command. e.g. cmd.exe /c git-upload-pack "F:\depo\my_repo.git" - in this case if git-upload-pack is not in path, the cmd.exe just returns without any error.

Thursday, August 1, 2013

Lightweight Component Object Model

COM (Component Object Model)is an ambitious and grand paradigm that aimed to cover almost everything. First of all, it is compiler neutral and further more it is programming language neutral. And it can create object within a process or it allow to communicate with a service ( an component in a separate process ) via marshaling. And it allows multiple inheritance.

This grandness is what makes the COM to be dominant in various software project. Though it is heavy weight and has a steep learning curve. Compiler and language neutral brings in IDL ( Interface Definition Language ), VARIANT, BSTR and HRESULT. There can be no exception crossing object boundary. Object creation model brings in various registry scheme. Apartment model comes with lots of concepts e.g. MTA, STA, marshaling, server lifetime and so on. Installation needs registering the DLLs in a specific way.

There are lots of situation that we don't need too much freedom. For example, in a equipment control software, it is tightly controlled software project and won't need that much neutral stuff. Any changes in software should be heavily tested and usually deployed as a whole. And only one programming language is used to develop the whole control code. Engineer spends more time to make machine running rather than mixing two different programming language for the sake of grandness of software.

Now imagine what happen if a project doesn't need language neutral or compiler neutral. What happen if we don't need marshaling ? What happens if we want to throw exception from a sub module and catch it in a host module ? What if we can just use string or wstring ? What if we just want to drop a DLL into a specific folder ?

Here I will try to code a C++ component object model - named PlugIn - that will be written with C++ and complied with VisualStudio and nothing else. It can be created only within the host process and it only allows single inheritance.

// IPlugIn.h 
class IPlugIn
{
public:
  virtual ~IPlugIn() {};
  virtual const char * GetID() const = 0;
};

typedef std::vector PlugInNameArray;

class IPlugInFactory : public IPlugIn
{
public:
  virtual PlugInNameArray GetPlugInIDs() const = 0;
  virtual IPlugIn* Create( const char *ID ) = 0;
};

First IPlugIn is equivalent of IUnknown in COM. It is the base interface of all object exposed. Unlike IUnknown, it doesn't need QueryInterface because polymorphic cast i.e. dynamic_cast retrieves the interface desired - in type safe way. And as we don't support multiple inheritance, complex QueryInterface() doesn't necessary. AddRef() and Release() won't be necessary as boost smart pointer .i.e shared_ptr will be used to manage the allocated memory. Then it has to be noted that the class in DLL should be allocated by new as shared_ptr will clear it with delete. This leads to define virtual destructor.

IPlugInFactory is a factory pattern that delegate the creation of IPlugIn based one string id. This string id is what ProgID - Programmatic ID does. Again we don't use GUID for interface as we are gonna rely on the dynamic_cast.

GetPlugInIDs() is to retrieve all the supported class. This is to allow discovery of class dynamically. User is supposed to enumerate all the IDs and create and check whether it can be cast to the desired interface.

This IPlugIn.h declare what the basic interface looks like. This interface is exported using C functions as below.

#ifdef PLUGIN_EXPORTS
#define LIBSPECS extern "C" __declspec(dllexport)
#else
#define LIBSPECS extern "C" __declspec(dllimport)
#endif

LIBSPECS HRESULT DllGetVersion( DLLVERSIONINFO *pvdi );
LIBSPECS IPlugInFactory* CPI_CreateFactory();

DllGetVersion is to let user knows the version of loaded DLL. According to Johnson M.Hart ( Windows System Programming 4th : 175-177 ), it is quite common practice in Microsoft. As it is not a good idea to invent a wheel, I use the same method.

CPI_CreateFactory() is the only entrance of the DLL. All the plugin in the DLL should be created using this interface.

Now the implementation of the plugin.


// IFoo.h
class IFoo : public IPlugIn
{
public:
 virtual int Foo( int i ) = 0;
};

typedef boost::shared_ptr< IFoo > IFooPtr;


// PlugInFoo.cpp
class PlugInFoo : public IFoo
{
public:
  const char * GetID() const { return s_PlugInClassName; }
  int Foo( int i ) { return i * i; }

  static const char *GetClassName() { return s_PlugInClassName; }

private:
  static const char *s_PlugInClassName;
};

const char * PlugInFoo::s_PlugInClassName = "CPI.PlugInFoo.1";


class FooPlugInFactory : public IPlugInFactory
{
public:
  const char * GetID() const { return "CPI.PlugInFooFactory.1"; }

  PlugInNameArray GetPlugInIDs() const 
  {
    PlugInNameArray names;
    names.push_back( PlugInFoo::GetClassName() );
    return names;
  }

  IPlugIn* Create( const char *plugInName ) 
  {
    std::string strPlugInName( plugInName );
    if( strPlugInName == PlugInFoo::GetClassName() )
    {
      return new PlugInFoo();
    }
    else
      return NULL;
  }
};

IPlugInFactory* CPI_CreateFactory()
{
  return new FooPlugInFactory();
}

IFoo.h is what is going to be shared with host process. This is what the host process going to cast on the created object.

Implementation of IFoo.h and IPlugInFactory is almost trivial. PlugInFoo can return it's class name as ID and FooPlugInFactory is returning this class name as it's sole PlugIn. Of course, it can return more than a IDs. Create() is just allocating memory and returning it to the user.

Now it is turn to Host process. It should be able to retrieve a known interface in a DLL based on the known id. Here I put these function in a hpp file so that it can be used by just adding this header.

// PlugInModule.hpp

class PlugInModule
{
public:
  PlugInModule() : m_hDLL( NULL ) {}

  ~PlugInModule()
  {
    m_PlugInFactory.reset();

    if( m_hDLL ) {
      FreeLibrary( m_hDLL );
    }
  }

  void Load( const _TCHAR * filename )
  {
    m_hDLL = LoadLibrary( filename );
    if( m_hDLL == NULL ) {
      throw Exception( StringFormat::As( _T("%s is not a valid DLL"), filename ) );
    }

    FN_DllGetVersion DllGetVersion = 
      (FN_DllGetVersion) GetProcAddress( m_hDLL, "DllGetVersion" );
    if( DllGetVersion == NULL ) {
      throw Exception( StringFormat::As( _T("%s is not a supported PlugIn DLL"), filename ) );
    }

    if( S_OK != DllGetVersion( &m_VerInfo ) ) {
      throw Exception( StringFormat::As( _T("Failed to get version info of the PlugIn DLL %s"), filename ) );
    }


    FN_CPI_CreateFactory CPI_CreateFactory = 
      (FN_CPI_CreateFactory) GetProcAddress( m_hDLL, "CPI_CreateFactory" );

    if( CPI_CreateFactory == NULL ) {
      throw Exception( StringFormat::As( _T("%s is not a supported PlugIn DLL"), filename ) );
    }

    m_PlugInFactory = IPlugInFactoryPtr( CPI_CreateFactory() );
  }

  template< typename TPlugIn >
  boost::shared_ptr< TPlugIn > CreatePlugIn( const char *name )
  {
    if( m_PlugInFactory == NULL ) { throw Exception( StringFormat::As( _T("No plugin in loaded") ) ); }

    IPlugIn* plugIn = m_PlugInFactory->Create( name );
    if( plugIn == NULL ) {
      throw Exception( StringFormat::As( _T("No plugin found with give name %s"), name ) );
    }

    TPlugIn* targetPlugIn = dynamic_cast< TPlugIn* > ( plugIn );
    if( targetPlugIn == NULL ) {
      delete plugIn;
      throw Exception( StringFormat::As( _T("Can't convert to taregt plugin interface with %s"), name ) );
    }

    return boost::shared_ptr< TPlugIn >( targetPlugIn );
  }

  const DLLVERSIONINFO& GetVersionInfo() const 
  { 
    if( m_hDLL == NULL ) { throw Exception( StringFormat::As( _T("No DLL has been loaded yet" ) ) ); }

    return m_VerInfo; 
  }

  PlugInNameArray GetPlugInIDs() const
  {
    if( m_hDLL == NULL ) { throw Exception( StringFormat::As( _T("No DLL has been loaded yet" ) ) ); }
    return m_PlugInFactory->GetPlugInIDs();
  }
  
private:
  typedef boost::shared_ptr IPlugInFactoryPtr;
  typedef IPlugInFactory* (* FN_CPI_CreateFactory)();
  typedef HRESULT (* FN_DllGetVersion)(DLLVERSIONINFO *);

  HMODULE m_hDLL;
  IPlugInFactoryPtr m_PlugInFactory;
  DLLVERSIONINFO m_VerInfo;
};

typedef boost::shared_ptr< PlugInModule > PlugInModulePtr;

Above hpp can be used as below code. User just load a dll and can create interface with known ID.

#include "PlugInModule.hpp"

  PlugInModulePtr plugInFoo( new PlugInModule() );

  plugInFoo->Load( _T("PlugInFoo.DLL") );
  IFooPtr foo( plugInFoo->CreatePlugIn( "CPI.PlugInFoo.1" ) );
  foo->Foo( 10 )

This light weight component object model is very restricted but it can be an easy alternative to the heavy weight COM.

Wednesday, July 24, 2013

Properties of chain code i.e. perimeter, area and center of mass

The chain code is compact and efficient to store and transfer. Though it is not that straight forward to calculate some properties of shape .i.e area, perimeter and center of mass.

Perimeter is relatively easy and from Bernd Jähne book ( Digital Image Processing 5th: 508-511 ), it can be calculated as below code.

  double CalculatePerimeter( const ChainCode& chains )
  {
    double p = 0;
    double sqrt2 = sqrt( 2.0 );

    for( size_t i=0; i < chains.size(); ++i )
    {
      if( chains[i]%2 == 0 ) 
      {
        p += 1.0;
      }
      else
      {
        p += sqrt2;
      }
    }

    return p;
  }

Bernd Jähne also showed how to get the area using a chain code but it doesn't work well when the chain code is 1 pixel width or height e.g. ( 0 0 4 4 ) will return 2 instead of 3. Refer below image how the area end up 1 pixel short.

Cris Luengo ( Cris's Image Analysis Blog : More chain code measures ) shows a matlab code that works well with 1 pixel wide chain code. In above case of 3 pixels, it adds two more addition when the direction changes which add up 1 more pixel. Refer below image.

Here is the C/C++ code snippet that is converted from the above Cris Luengo's matlab code.


// making a circular chain code by inserting the last code to the front 
ChainCode cc;
cc.push_back( chains[ chains.size()-1 ] );
for( size_t i=0; i < chains.size(); ++i )
{
  cc.push_back( chains[i] );
}
m_Area = CalculateArea( cc );


int CalculateArea( const ChainCode& cc ) // a circular chain
{
  int (&M)[8][8]( TableAreaIncrement );
  int B = 10, A = 0;

  for( size_t i=1; i < cc.size(); ++i )
  {
    uint8_t c_i = cc[i];
    uint8_t c_im1 = cc[i-1];

    switch( M[c_im1][c_i] )
    {
    case 'X' :
      throw Exception( StringFormat::As( _T("Invalid chain code found as %d->%d"), 
        c_im1, c_i ));
    case 'A': A += -B+1;  break;
    case 'B': A += -B;    break;
    case 'C': A += B;    break;
    case 'D': A += B-1;    break;
    }

    switch( c_i )
    {
    case 0: A+=B;      break;
    case 1: A+=(++B);    break;
    case 2: B++;      break;
    case 3: A+=-(++B)+1;  break;
    case 4: A+=-B+1;    break;
    case 5: A+=-(--B)+1;  break;
    case 6: B--;      break;
    case 7: A+=(--B);    break;
    }
  }

  return A;
}
  
int ChainCodeProperties::TableAreaIncrement[8][8] = 
{
  {'0','1','B','X','A','A','6','7'},
  {'0','1','B','B','X','A','6','7'},
  {'C','C','2','3','4','X','C','C'},
  {'C','C','2','3','4','5','X','C'},
  {'C','C','2','3','4','5','D','X'},
  {'X','C','2','3','4','5','D','D'},
  {'0','X','A','A','A','A','6','7'},
  {'0','1','X','A','A','A','6','7'},
};

The center of mass can be calculated by summing up x and y coordinates. But just simply adding up does not cover 1 pixel wide chain code as below image. The (1,1) pixel added two times instead one time. The compensation can be done by adding more pixels when a direction changes in opposite way as the idea of Cris Luengo's area calculation.

Below is the C/C++ code snippet that calculate the centroid.

void CalculateCentroid( const ChainCode& cc, double* cx, double *cy )
{
  int (&O)[8][2]( TableOffsetIncrement );
  int vpc = 0;      // visitied pixel count
  int sx = 0, sy = 0;    // sum of x and y
  int px = 0, py = 0;
  
  for( size_t i=1; i < cc.size(); ++i )
  {
    uint8_t c_i = cc[i];
    uint8_t c_im1 = cc[i-1];

    if( c_i != c_im1 && (c_i & 0x3 ) == ( c_im1 & 0x3 ) )
    {
      // when direction changes in opposite direction 
      // e.g. 0 -> 4, 4 -> 0, 1 -> 5, ...
      vpc++; sx += px;  sy+=py;
    }

    px += O[ c_i ][0];
    py += O[ c_i ][1];

    vpc++; sx += px;  sy+=py;
  }

  *cx = (double)sx / (double)vpc;
  *cy = (double)sy / (double)vpc;
}

int ChainCodeProperties::TableOffsetIncrement[8][2] =
{
  { 1,  0  },
  { 1,  -1  },
  { 0,  -1  },
  { -1,   -1  },
  { -1,  0  },
  { -1,  1 },
  { 0,  1 },
  { 1,  1 },
};

It works well most of the case but it fails if there is a pixel which is visited more than 2 times as below image. It has (1,1) pixel counted 3 times but it should be counted 2 times.

Monday, July 8, 2013

Contour tracing by Moore Neighbor Algorithm

Having a binary image with defects, it is time to find out how the defects looks like. The location and size is what we want to figure out. Though we need to extract the defect from the image first. This is called segmentation and there are two methods. One is contour detection and the other is region coloring.

Region coloring is to put a unique label to a separated component. At first it labels all the pixels in a way and at a second phase, those labels are merged when connected. The result of the region coloring is still a area image with each component assigned with a unique number. And if you want to count how many defects are there, the area has to be scanned and count unique number. Or if you want to transmit the defect information, the whole labeled image has to be sent out - though the size can be much smaller than original image if run length encoding is used.

Contour detection is to trace the defect pixels and extract contour only. This start on a pixel and keep following on nearby defect pixels until it comes back to the start pixel. The result of this segmentation is only contour pixels and can apply chain coding which takes 3 bits per pixel. And it makes quite small size and compact to transmit to other. This contour information can be used to calculate area or moment for analysis.

There are number of algorithms for contour tracing. Refer [CTA]. The square tracing doesn't work with 8 neighbor hood. The Moore-Neighbor tracing and radial sweep algorithms are almost identical. Theo Pavlidis's algorithm is not that easy to implement.

In here, I will show implementation of Moore-Neighbor Tracing based on the above web site. Though the stop condition at the web site is not so simple and will use approach from other book - P192 of [IPA] as below exerpt.

  If the current boundary element Pn is equal to the second border element P1 , and
  if the previous border element Pn-1 is equal to Po , stop. Otherwise repeat step 2.

First the image has to be traversed to find the start point. And when found a pixel, do the contour tracing and clear out the traced object so that it does not get picked up again. Here is the code snippet.

void Build( BinaryImageBase& image, const Region& roi, PointsContainer* contours ) 
{
  for( int sy = roi.Top; sy<= roi.Bottom; ++sy )
  {
    for( int sx = roi.Left; sx<=roi.Right; ++sx )
    {
      if( image.GetPixel( sx, sy ) == false )
        continue;

      Points contour;
      m_Tracer->Trace( image, roi, Point((uint16_t)sx,(uint16_t)sy), &contour ); // MooreNeighbor Tracing
      m_Filler->Fill( image, contour[0], false );  // FloodFill
    
      contours->push_back( contour );
    }
  }
}

The Moore-Neighbor tracing can be implemented as below.

struct WalkTableItem
{
  Point Step;
  int Next;
} WalkTable[8] =
{
  { Point(1,0),   6, },  // from North to North-East, when found pixel, start from index 6 of this table
  { Point(0,1),  0, },   // from North-East to East
  { Point(0,1),  0, },   // from East to South-East
  { Point(-1,0),  2, },  // from South-East to South
  { Point(-1,0),  2, },  // from South to South-West
  { Point(0,-1),  4, },  // from South-West to West
  { Point(0,-1),  4, },  // from West to North-West
  { Point(1,0),  6, },   // from North-West to North
};

void Trace( const BinaryImageBase& image, const Region& roi, const Point& start, Points* trace )
{
  const int SizeOfT = sizeof(WalkTable)/sizeof(WalkTable[0]);
  WalkTableItem (&T)[SizeOfT]( WalkTable );   // access WalkTable with variable name T for short

  Points B; B.push_back( start );             // B is the container of points
  int sk = 0, k = 0;                          // sk is the start index, k is current index
  Point c = start - T[k].Step;                // c is current point
  k = T[k].Next;
  sk = ( SizeOfT + k - 1 ) % SizeOfT;

  for( ; ; ) 
  {
    Point n = c + T[k].Step;                  // n is next point
    if( image.GetPixel( n.As.XY.X, n.As.XY.Y ) == true ) 
    {
      size_t bc = B.size();

      if( bc > 2 && n == B[1] && B[bc-1] == B[0] )  
      {
        break;                                 // end condition by [IPA]
      }
      else
      {
        B.push_back( n );                      // found contour pixel
        n = c;                                 // back-track
        k = T[k].Next;                         // next index to search 
        sk = ( SizeOfT + k - 1 ) % SizeOfT;    // stop index if no pixel at moore-neighbor
      }
    }
    else
    {
      k = (k+1) % SizeOfT;                      // move to next index 
    }

    c = n;
    if( sk == k )                               // stop condition with 1 pixel defect
    {
      break;
    }
  }

  if( B.size() == 1 )                           
  {
    trace->swap( B );                           // 1 pixel defect
  }
  else
  {
    trace->swap( Points( &B[0], &B[ B.size() - 1 ] ) );  // The last pixel is the start pixel so exclude it.
  }
}

Here is code of stack-based FloodFill based on [WFF].

void Fill( BinaryImageBase& image, const Point& start, bool fillValue )
{
  std::stack< Point > Q;
  Q.push( start );

  while( Q.size() != 0 )
  {
    Point p = Q.top();
    Q.pop();

    if( p.As.XY.X < 0 || p.As.XY.X >= image.GetWidth() ||
      p.As.XY.Y < 0 || p.As.XY.Y >= image.GetHeight() )
      continue;

    if( image.GetPixel( p.As.XY.X, p.As.XY.Y ) == fillValue ) 
      continue;

    image.SetPixel( p.As.XY.X, p.As.XY.Y, fillValue );

    Q.push( Point( p.As.XY.X + 1, p.As.XY.Y ) );
    Q.push( Point( p.As.XY.X - 1, p.As.XY.Y ) );

    Q.push( Point( p.As.XY.X + 1, p.As.XY.Y - 1 ) );
    Q.push( Point( p.As.XY.X,     p.As.XY.Y - 1 ) );
    Q.push( Point( p.As.XY.X - 1, p.As.XY.Y - 1 ) );

    Q.push( Point( p.As.XY.X + 1, p.As.XY.Y + 1 ) );
    Q.push( Point( p.As.XY.X,     p.As.XY.Y + 1 ) );
    Q.push( Point( p.As.XY.X - 1, p.As.XY.Y + 1 ) );
  }
}

The result of contour tracing is boundary pixels and one or more pixels can be repeated as it needs to return to the original image. Refer below image and it's result.

Contour pixels : (1,1),(2,2),(3,1),(2,2),(3,3),(2,2)

Reference :
[CTA] Contour Tracing WEB Site by Abeer George Ghuneim
[IPA] Image Processing, Analysis and Machine Vision by Milan Sonka, Vaclav Hlavac, Roger Boyle.
[WFF] FloodFill from Wikipedia

Wednesday, June 19, 2013

OpenCL - Connecting pixels

Sometimes, it can happens that number of 1 pixel defects scattered around closely. When the pixels are close enough it is better to consider as a 1 defects rather than separate defects.

To merge number of close pixels, we can use closing - dilation and erosion. This does add pixels between horizontal or vertical distanced pixels. But it doesn't fill the gap between diagonal distance. Refer below two images which shows original image and image after closing.

This is because the dilation is spreading to 8 neighbor pixels like below.

To fill a gap at a diagonal distanced pixels, the dilation should cover more as below.

When this 2 pixel distance dilation is used, the diagonal distance is filled as below.

This 2 pixel dilation code is as below.

__kernel
void Dilation2PixelDistance(
  __global uchar *d_src, 
  __global uchar *d_dst,
  int width, int height,
  int roiLeft, int roiTop, int roiRight, int roiBottom )
{ 
  int x = (roiLeft & (~0x07)) + ( get_global_id(0) << 3 );
  int y = roiTop + get_global_id(1);
  int stride = width >> 3;

  if( x <= roiRight ) 
  {
    int idx = x + (y-2) * width;
    int bidx = idx >> 3;

    uchar dilated = 0, C;

    if( y > roiTop+1 ) 
    {
      C = d_src[bidx];
      dilated |= C;
    }

    bidx += stride;
    if( y > roiTop ) 
    {
      C = d_src[bidx];
      dilated |= C;
      dilated |= (C >> 1) | ( d_src[bidx-1] << 7);
      dilated |= (C << 1) | ( d_src[bidx+1] >> 7);
    }

    bidx += stride;
    if( y <= roiBottom )
    {
      C = d_src[ bidx ];
      dilated |= C;
      dilated |= (C >> 2) | (d_src[ bidx - 1] << 6);
      dilated |= (C >> 1) | (d_src[ bidx - 1] << 7);
      dilated |= (C << 1) | (d_src[ bidx + 1] >> 7);
      dilated |= (C << 2) | (d_src[ bidx + 1] >> 6);
    }

    bidx += stride;
    if( y < roiBottom )
    {
      C = d_src[ bidx ];
      dilated |= d_src[ bidx ];
      dilated |= (C >> 1) | (d_src[ bidx - 1] << 7);
      dilated |= (C << 1) | (d_src[ bidx + 1] >> 7);
    }

    bidx += stride;
    if( y < roiBottom -1 )
    {
      C = d_src[bidx];
      dilated |= C;
    }

    bidx -= (stride<<1);
    if( y <= roiBottom )
    {
      d_dst[ bidx ] = dilated;
    }
  }
}

The closing operation - 2 pixel dilation & erosion - runs at 1.84 GB/s. Refer below profiling result.

Step 1 : start 0 ns, end 103840 ns, duration 103840 ns, 38243.76 MB/s              : copying source image to device
Step 2 : start 323584 ns, end 837440 ns, duration 513856 ns, 7728.30 MB/s          : 2 pixel dilation
Step 3 : start 1730560 ns, end 2041888 ns, duration 311328 ns, 12755.78 MB/s       : erosion
Step 4 : start 2072288 ns, end 2153088 ns, duration 80800 ns, 49148.92 MB/s        : copying result image from device
Total : duration 2153088 ns, 1844.44 MB/s

Saturday, June 15, 2013

OpenCL - Dilation and Erosion in packed bit

With packed binary pixel - 1 bit per pixel - morphology operation can be done with number of bits together. i.e. 8 pixels - 1 byte - can be read and dilated and eroded in a go. When morphology is done in unit of byte, then the number of thread needs is pixel count divide by 8. Below is a OpenCL kernel code that does dilation in 8 neighbor hood.

__kernel void Dilation( 
  __global uchar *d_src, 
  __global uchar *d_dst,
  int width, int height,
  int roiLeft, int roiTop, int roiRight, int roiBottom )
{ 
  int x = (roiLeft & (~0x07)) + ( get_global_id(0) << 3 );
  int y = roiTop + get_global_id(1);
  int stride = width >> 3;

  if( x <= roiRight ) 
  {
    int idx = x + (y-1) * width;
    int bidx = idx >> 3;

    uchar dilated = 0, C;

    if( y > roiTop ) 
    {
      C = d_src[bidx];
      dilated |= C;                                   // North
      dilated |= (C >> 1) | ( d_src[bidx-1] << 7);    // North West
      dilated |= (C << 1) | ( d_src[bidx+1] >> 7);    // North East
    }

    bidx += stride;
    if( y <= roiBottom )
    {
      C = d_src[ bidx ];
      dilated |= C;                                    // Center
      dilated |= (C >> 1) | (d_src[ bidx - 1] << 7);   // West
      dilated |= (C << 1) | (d_src[ bidx + 1] >> 7);   // East
    }

    bidx += stride;
    if( y < roiBottom )
    {
      C = d_src[ bidx ];
      dilated |= d_src[ bidx ];                        // South
      dilated |= (C >> 1) | (d_src[ bidx - 1] << 7);   // South West
      dilated |= (C << 1) | (d_src[ bidx + 1] >> 7);   // South East
    }

    bidx -= stride;
    if( y <= roiBottom )
    {
      d_dst[ bidx ] = dilated;
    }
  }
}

Below is a OpenCL kernel code that does erosion in 8 neighbor hood.

__kernel
void Erosion(
  __global uchar *d_src, 
  __global uchar *d_dst,
  int width, int height,
  int roiLeft, int roiTop, int roiRight, int roiBottom )
{ 
  int x = (roiLeft & (~0x07)) + ( get_global_id(0) << 3 );
  int y = roiTop + get_global_id(1);
  int stride = width >> 3;

  if( x <= roiRight ) 
  {
    int idx = x + (y-1) * width;
    int bidx = idx >> 3;

    uchar eroded = 0xFF, C;

    if( y > roiTop ) 
    {
      C = d_src[bidx];
      eroded &= C;
      eroded &= (C >> 1) | ( d_src[bidx-1] << 7);
      eroded &= (C << 1) | ( d_src[bidx+1] >> 7);
    }

    bidx += stride;
    if( y <= roiBottom )
    {
      C = d_src[ bidx ];
      eroded &= C;
      eroded &= (C >> 1) | (d_src[ bidx - 1] << 7);
      eroded &= (C << 1) | (d_src[ bidx + 1] >> 7);
    }

    bidx += stride;
    if( y < roiBottom )
    {
      C = d_src[ bidx ];
      eroded &= d_src[ bidx ];
      eroded &= (C >> 1) | (d_src[ bidx - 1] << 7);
      eroded &= (C << 1) | (d_src[ bidx + 1] >> 7);
    }

    bidx -= stride;
    if( y <= roiBottom )
    {
      d_dst[ bidx ] = eroded;
    }
  }
}

Thursday, May 9, 2013

LibPng with png++

To debug the inspection algorithm, the straight forward way is to look at the image. Though display image, there are lot of things to consider. Window, panning, zooming, display coordinates, display pixel value, bits per pixel, pixel aliasing, speed, ... . One simple way is to save the image to a file and then open with a off-the-shelf image viewer.

There are lots of image viewer and one is MeeSoft Image Analyzer. It supports lots of image format and displayed pixels are not anti-aliased.

There are also lots of image format. Though in here, I want wide-spread, loseless, support binary packed pixel and royalty-free image format. I choose PNG file format.

To manipulate PNG file format, there is LibPng for Windows. Also there is C++ wrapper project png++. To use LibPng, zlib is also needed for compression/decompression.

To setup, you need to download binaries and developer files from above link and extract it to some place. I extracted at 3rdParty folder.

Then modify the project's 'Additional Include Directories' with below directories. Prefixed "../" depends on the relative depth between project to the 3rdParty folder.

  "../../../3rdParty/libpng-1.2.37/include";
  "../../../3rdParty/zlib-1.2.3/include";
  "../../../3rdParty/png++-0.2.5";

Then modify Library Directories with below path.

"../../../3rdParty/libpng-1.2.37/lib"

Then add '__Win32' to the Processor Definition. This is for png++ and without it, you will see below error.

  /3rdparty/png++-0.2.5/config.hpp(57) : fatal error C1189: #error :  Byte-order could not be detected.

There is one more step that needs attention on zlib. The zlib is trying to include unistd.h in Windows and there seems no other way but to comments the line out. Maybe I missed a step in the setup. Anyway with this line removed, it make the project compile and link though I don't know the implication. Refer below for the change.

zconf.h at line 289

#if 1           /* HAVE_UNISTD_H -- this line is updated by ./configure */
#  include  /* for off_t */
//#  include     /* for SEEK_* and off_t */ // commented out 
...
#endif

When the project is executed, it will look for the dlls for libpng and zlib. This can go to the system32 folder but in here I copied those file to the output folder using 'Post build event' as below.

  xcopy "../../../3rdParty/libpng-1.2.37/bin/libpng12.dll" "$(TargetDir)" /Y
  xcopy "../../../3rdParty/zlib-1.2.3/bin/zlib1.dll" "$(TargetDir)" /Y

Now all set and code like below will save any image to png file.

#pragma warning( push, 3 )
#pragma warning( disable : 4996 4267 4355 )  // there are number warnings that may not be a problem at all ...
#include < png.hpp >
#pragma warning( pop )
#include "Image.hpp"     // Image<> and BinaryImage<> are defined in here
#include < iostream >

#pragma comment(lib, "libpng.lib" ) 

// Most of this template class comes from png++ example. 
template < typename TImage, typename PngPixelType >
class ImagePngWriter
    : public png::generator < PngPixelType, ImagePngWriter < TImage,PngPixelType > >
{
public:
    ImagePngWriter( TImage& image )
        : png::generator < PngPixelType, ImagePngWriter > ( image.GetWidth(), image.GetHeight() ), 
          m_Image( image )
    {}

    png::byte* get_next_row(size_t pos)
    {
        return reinterpret_cast < png::byte* > ( m_Image.GetPixelPtr( 0, (int)pos ) );
    }

    void Write( const char *filename )
    {
        std::ofstream file( filename, std::ios::binary);
        write(file);
    }

private:
    TImage& m_Image;
};

usgae
  Image< boost::uint8_t > image( 640, 480 );
  ...
  ImagePngWriter< Image< boost::uint8_t >, png::gray_pixel > pngWriter( image );
  pngWriter.Write( "Gray.png" );


  BinaryImage<> image( 640, 480 );
  ...
  ImagePngWriter< BinaryImage<>, png::gray_pixel_1 > pngWriter( image );
  pngWriter.Write( "Binary.png" );

Tuesday, April 16, 2013

OpenCL - GPU, Threshold and binary image

After pitch comparison, the intensity is compared with certain threshold and if the pixel difference is above certain value, the pixel is treated as a defect. And what matter is whether the pixel is foreground or background, the result is binary value. And as binary value can be represented by 1 bit, and as it is expensive to read/write pixel through memory, the pixel is packed not to waste space.

To pack 8 binary pixels to a byte, it is needed to read result from the neighbor's binary pixel value. This neighbor's pixel can be written to a global memory and read back but that will be expensive write and read operations. Instead, we can write binary result to a local memory and read 8 pixels back and pack to a byte and write to the global memory.

Another thing to be careful is the bit order - which bit to turn on for the pixel index 0. Is it the most significant bit or least significant bit ? In here, I use the most significant bit for the pixel index 0 and decrease the bit location for increased pixel index. In this way, it is easy to read byte value and guess the location of the pixel.

Below is the kernel code


float PitchedSubPixel( __global uchar* src, int idx, float weightLeft, float weightRight )
{
  return (float)src[ idx ] * weightLeft + (float)src[ idx+1 ] * weightRight;
}

float PixelDifference( float center, float east, float west )
{
  float diff = ( center * 2.0f  - ( east + west ) ) / 2.0f;
  return diff >= 0 ? diff : -diff;
}

__kernel
void PitchThreshold(
  __global uchar *d_src, 
  __global uchar *d_dst,
  int width, int height,
  int roiLeft, int roiTop, int roiRight, int roiBottom,
  int integerPitch, float toCeiling, float toFloor,
  __local uchar *localDst,                                         // this is the local memory where temporary binary pixel is stored
  uchar threshold )
{ 
  int x = (roiLeft & (~0x07)) + get_global_id(0);                  // x always starts with multiple of 8 - byte aligned
  int y = roiTop + get_global_id(1);
  int idx = x + y * width;

  bool isInBound = x >= roiLeft && x <= roiRight && y <= roiBottom;

  float east = 0.0f, west = 0.0f, diff = 0.0f;
  if( isInBound ) 
  {
    east = PitchedSubPixel( d_src, idx-integerPitch-1, toCeiling, toFloor );
    west = PitchedSubPixel( d_src, idx+integerPitch, toFloor, toCeiling );
    diff = PixelDifference( (float)d_src[idx], east, west );
  }

  uchar binary = ((uchar)( diff ) <  threshold) ? 0 : 1;           // fixed threshold, if outside of ROI, then 0

  int localX = get_local_id(0);                                    // localX is 0 to 63. Refer host code
  int localIdx = get_local_id(1) * get_local_size(0) + localX;     // get_local_id(1) is 0 to 3. get_local_size(0) is 64 as width. Refer host code
  binary = binary << (7-(localX & 0x07));                          // i.e, if localX is 10, then it is 1 << 5,
  localDst[localIdx] = binary;                                     // putting the shifted binary value to local memory

  isInBound = x <= roiRight && y <= roiBottom;                     // ROI start at byte aligned
  
  barrier( CLK_LOCAL_MEM_FENCE );                                  // wait for the neighbor's local memory is written

  if( isInBound && (localX & 0x07) == 0 )
  {
    uchar packed =                                                 // read 8 pixels and pack as a byte 
      localDst[ localIdx   ] | localDst[ localIdx+1 ] | localDst[ localIdx+2 ] | localDst[ localIdx+3 ] | 
      localDst[ localIdx+4 ] | localDst[ localIdx+5 ] | localDst[ localIdx+6 ] | localDst[ localIdx+7 ] ; 

    d_dst[ idx >> 3 ] = packed;                                    // write to global memory
  }   
}

Below is the host code snippet

  const int LOCAL_WIDTH = 64, LOCAL_HEIGHT = 4;
  cl::NDRange localws( LOCAL_WIDTH, LOCAL_HEIGHT );
  cl::NDRange globalws( 
    LOCAL_WIDTH * (int)ceil( (double)(roiRight-roiLeft+1) / (double)LOCAL_WIDTH ), 
    LOCAL_HEIGHT * (int)ceil( (double)(roiBottom-roiTop+1) / (double)LOCAL_HEIGHT )
  );

  ...
  kernelPitch0.setArg( 11, LOCAL_WIDTH*LOCAL_HEIGHT*sizeof(uint8_t), NULL );  // request to allocate local memory
  kernelPitch0.setArg( 12, threshold );

  vector ve1; ve1.push_back( steps[1] );
  queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, localws, &ve1, &steps[2] );

Just pitch comparison was 1.4GB/s but now it is around 1.3GB/s. Refer below time profiling.

Step 1 : start 0 ns, end 646304 ns, duration 646304 ns, 6144.53 MB/s
Step 2 : start 1086688 ns, end 2980640 ns, duration 1893952 ns, 2096.80 MB/s    <- was 3085 MB/s
Step 3 : start 3009344 ns, end 3090080 ns, duration 80736 ns, 49187.88 MB/s     <- was 6330 MB/s
Total : duration 3090080 ns, 1285.16 MB/s                                       <- was 1406 MB/s

Wednesday, April 3, 2013

OpenCL - one work group per CPU core

At the last post, I got around 1GB/s throughput but that is just utilizing only one core out of dual core CPU.

Digging up the correct multi-core usage with a kernel in Intel OpenCL SDK, found that the work group size should be same as number of cores. Then it will make each work group sits on a single core. For this OpenCL 1.1 spec says that

Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit. A compute unit is composed of one or more processing elements and local memory. A compute unit may also include dedicated texture filter units that can be accessed by its processing elements.

When making host code for GPU, I did carefully assign the work group size and local work item size. But with CPU case, I didn't pay much attention to it as there is just 2 cores. Also due to Intel recommendation at "Writing Optimal OpenCL* Code with the Intel® OpenCL SDK" as below.

2.7 Work-Group Size Considerations
We always recommend letting the OpenCL implementation to automatically determine the optimal work-group size (sometimes referred as “local work size”) for a given kernel. Simply pass NULL for a pointer to the local work size when calling clEnqueueNDRangeKernel.

But it turns out that this work group size and local work item size should be set according to number of cores. The required change is as below.

  // host code

  cl::NDRange globalws( 2 );  // 2 work items in total
  cl::NDRange localws( 1 );   // 1 work item per work group - makes up 2 work groups
  ...
  queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, localws, NULL, &ev );
  ...

  // kernel code
  __kernel 
  __attribute__((reqd_work_group_size(1,1,1)))   // this can be specified but makes no difference for me.
  void Pitch( 
    __global const uchar *src, __global uchar *dst, 
    ...

This change makes around 2.1GB/s throughput as below.

...
Total : duration 1961850 ns, 2022.74 MB/s
Total : duration 1787610 ns, 2219.90 MB/s
Total : duration 1796190 ns, 2209.29 MB/s
...

Friday, March 29, 2013

OpenCL - CPU Explicit Vectorization

Last blog, I got a disappointing result with CPU OpenCL code. And it is time to find out the right way to code in CPU side.

First of all, I tried to use float version of pitch comparison. This is to check whether the implicit vectorization can make any difference with 4 bytes type. At least, Intel OpenCL optimization book recommends to use int and float.

Refer below code. This code scores meager 39.3 MB/s. So still implicit vectorization with float isn't good. And it also tells that float operation is still slightly slower than integer operation. If precision doesn't matter, it is better stick to fixed floating point operation using integer type than float type. Though the code is clean and easy to read with float version. So if code maintenance matters, then float can be a helpful.

_kernel 
void Pitch( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int nPitch, 
  __const float toCeiling, __const float toFloor )
{
  int x = get_global_id(0);
  int y = get_global_id(1);
  int idx = y * width + x;

  float eastLeft = src[ idx + nPitch ];      // uchar pixel value is converted to float - 4 bytes
  float eastRight = src[ idx + nPitch + 1];
  float westLeft = src[ idx - nPitch - 1];
  float westRight = src[ idx - nPitch ];
  float center = src[ idx ];

  float east = toFloor * eastLeft + toCeiling * eastRight;
  float west = toFloor * westRight + toCeiling * westLeft;

  float diff = fabs( (2*center - (east+west)) / 2.0f );
 
  dst[ idx ] = (uchar)diff;                  // float difference value is converted back to uchar
}

Second try is a bold step of using just 2 kernels. And I expect that each kernel runs on each core nicely - my PC has dual core CPU. Then the kernel has to loop through each pixel in image.

Refer below code. This scores 144MB/s. This big jump on performance is saying that this is the right way. CPU side has to be coded quite differently to the GPU side. In GPU, you have to make lots of kernels. But in the CPU side, you have to make just as many kernels as number of cores.


// Each kernel processes half of full image in column.  The blockWidth is ( roiRight - roiLeft ) / 2. 
__kernel 
void Pitch( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int roiLeft, __const int roiTop, __const int roiRight, __const int roiBottom,
  __const int blockWidth,   
  __const int nPitch, 
  __const short toCeiling, __const short toFloor )
{
  int i = get_global_id(0);

  for( int y = roiTop; y <= roiBottom; ++y )
  {
    for( int x = roiLeft + blockWidth*i; x < min( roiLeft + blockWidth*(i+1), roiRight + 1 ); ++x )
    {
      int idx = y * width + x;

      short eastLeft = src[ idx + nPitch ];
      short eastRight = src[ idx + nPitch + 1];
      short westLeft = src[ idx - nPitch - 1];
      short westRight = src[ idx - nPitch ];
      short center = src[ idx ];

      short east = toFloor * eastLeft + toCeiling * eastRight;
      short west = toFloor * westRight + toCeiling * westLeft;

      east = east >> 1;
      west = west >> 1;

      short sum = east + west;
      short c2 = center << 7;
      short diff = abs_diff( c2, sum );

      diff = diff >> 7;

      if ( x <= roiRight ) 
      {
        dst[ idx ] = (uchar)diff; 
      }

    }
  }
}

Now time to combine the explicit loop with vector type. Here short8 is used as primary vector type considering that it fits with no loss at the 128bits xmm register. Also the built-in function abs_diff() supports only signed type. That's the reason why I choose short8 over ushort8. Though this is traded off with precision.

This code scores 986MB/s. This is another big jump. But best result with SSE instructions with 1 core is 940 MB/s. As this OpenCL code supposes to use dual core in full, it can be two fold increase but disappointingly, it is just above 1 core best score. There is still a wide room to improve.


__kernel 
void Pitch( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int roiLeft, __const int roiTop, __const int roiRight, __const int roiBottom,
  __const int blockWidth,
  __const int nPitch, 
  __const short toCeiling, __const short toFloor )
{
  int i = get_global_id(0);
  short8 toCeiling8 = toCeiling;  // toCeiling8 is filled with 8 of toCeiling.
  short8 toFloor8 = toFloor;

  for( int y = roiTop; y <= roiBottom; ++y )
  {
    int left = roiLeft + blockWidth*i;
    int right = left + blockWidth;

    for( int x = left; x < right; x+=8 )
    {
      int idx = y * width + x;

      short8 eastLeft = convert_short8( vload8( 0, src + idx + nPitch ) );      // uchar8 filled with 8 pixels is converted to ushort8.
      short8 eastRight = convert_short8( vload8( 0, src + idx + nPitch + 1 ) );
      short8 westLeft = convert_short8( vload8( 0, src + idx - nPitch - 1 ) );
      short8 westRight = convert_short8( vload8( 0, src + idx - nPitch ) );
      short8 center = convert_short8( vload8( 0, src + idx ) );

      short8 east = toFloor * eastLeft + toCeiling * eastRight;
      short8 west = toFloor * westRight + toCeiling * westLeft;
  
      east = east >> 1;
      west = west >> 1;
      center = center << 7;

      ushort8 diff = abs_diff( center, east+west );    // abs_diff is only available with signed type.

      diff = diff >> 7;
    
      if ( x <= roiRight ) 
      {
        vstore8( convert_uchar8( diff ), 0, dst + idx );  // convert short8 to uchar8 and stores 8 pixels at a time.
      }
    }

  }
}

In this analysis, I didn't put the memory copy time. This is because it doesn't need. In CPU side, you don't have to copy buffer to a device memory. The CPU's device memory is the host memory. So what we need is just map the device memory and get a pointer and use it to manipulate and un-map it.

  cl::Buffer devSrc( task.GetContext(), CL_MEM_READ_ONLY, width*height );
  cl::Buffer devDst( task.GetContext(), CL_MEM_WRITE_ONLY, width*height );

  ...
  // manipulate source image buffer. Note that we map to write
  uint8_t* hostSrcPtr = (uint8_t*)queue.enqueueMapBuffer( devSrc, CL_TRUE, CL_MAP_WRITE, 0, width*height );

  AttachedByteImage src( width, height, hostSrcPtr );
  ForEachPixels::Fill( src, SineWaveColumn( 20.2f ) );
  src.SetPixel( 50, 50, 50 );

  queue.enqueueUnmapMemObject( devSrc, hostSrcPtr );

  ...
  // run 2 kernels 
  cl::NDRange globalws( 2 );
  queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, cl::NullRange, &ve0, &steps[1] );
  ... 

  // manipulate destination buffer. Note that we map to read
  uint8_t* hostDstPtr = (uint8_t*)queue.enqueueMapBuffer( devDst, CL_TRUE, CL_MAP_READ, 0, width*height );
  
  AttachedByteImage dst( width, height, hostDstPtr );
  ThresholdAboveCounter counter( 5 );
  ForEachPixels::Visit( dst, counter, Region( roiLeft, roiTop, roiRight, roiBottom), RowAfterRow  );
  BOOST_CHECK_EQUAL( counter.Count(), 5 );

  queue.enqueueUnmapMemObject( devDst, hostDstPtr );

Saturday, March 23, 2013

OpenCL - CPU Implicit Vectorization

Intel OpenCL SDK boasts that the it's clever implicit vectorization - or Autovectorization - can magically put SIMD instructions across kernel work item boundary. I doubts about it but tried.

Here is a naive translation of pitch comparison to OpenCL CPU kernel function. In here, I still use the fixed float arithmetic using short.

__kernel 
__attribute__((vec_type_hint(short8)))  // try to hint the compiler to pack 8 shorts to 128 bits xmm register
void Pitch0( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int nPitch, 
  __const short toCeiling, __const short toFloor )
{
  int x = get_global_id(0);
  int y = get_global_id(1);
  int idx = y * width + x;

  short eastLeft = src[ idx + nPitch ];
  short eastRight = src[ idx + nPitch + 1];
  short westLeft = src[ idx - nPitch - 1];
  short westRight = src[ idx - nPitch ];
  short center = src[ idx ];

  short east = toFloor * eastLeft + toCeiling * eastRight;
  short west = toFloor * westRight + toCeiling * westLeft;
  
  east = east >> 1;
  west = west >> 1;

  short sum = east + west;
  short c2 = center << 7;
  short diff = abs_diff( c2, sum );

  diff = diff >> 7;
  dst[ idx ] = (uchar)diff; 
}


// HOST code
  ...
  int integerPitch = (int)(pitch);
  short toFloor = (short)( ((pitch - integerPitch) * 0x7F)+0.5f);
  short toCeiling = (short)( 0x7F - toFloor );

  kernelPitch0.setArg( 0, devSrc );
  kernelPitch0.setArg( 1, devDst );
  kernelPitch0.setArg( 2, width );
  kernelPitch0.setArg( 3, height );
  kernelPitch0.setArg( 4, integerPitch );
  kernelPitch0.setArg( 5, toCeiling );
  kernelPitch0.setArg( 6, toFloor );

  cl::NDRange globalws( roiRight-roiLeft+1, roiBottom-roiTop+1 );
  cl::NDRange offset( roiLeft, roiTop );

  ...
  queue.enqueueNDRangeKernel( kernelPitch0, offset, globalws, cl::NullRange, &ve1, &e2 );
  ...

This code only gives me 42MB/s with Intel G2120 CPU. As the SIMD version gave me nearly 1GB/s, it is quite discouraging. Now I listed down number of points that I am not sure ...

Will there be too many context switching ? Intel says it want 10,000 - 100,000 instructions per kernel and warned that there will be too many context switch if not. Clearly above code won't make that many instructions.
Should I go read the assembler code using Intel Offline compiler ? I checked it but the assembler output is very long and it seems there are framework code also. Not an easy job at there. Not much information on the web also.
Can it be optimized with short also ? Intel mentions that it works well with 4 bytes type like int or float. Also mentioned that it prefer signed to unsigned.
I used events to synchronize kernel execution. Refer OpenCL for NVidia code. Though Intel says it is better to write blocking function call over non-blocking - Intel says it is an explicit synchronization.
Intel says it does not want to be restricted by local work size. Only want to be specified with global work size. Then how can I be sure that it's cache access is optimized ? Remember that 64x4 local size gave out good result compared to 16x16.
Should I use 'cl-fast-relazed-math' ?

Number of point to try are

Remove fixed float arithmetic - using float instead of short
Process multiple points at a kernel work item. Try to make two work items maybe so that these sit on the two processors.
Get your hand dirty. Give up implicit magic and try explicit vectorization using vector data types

Reference

Writing Optimal OpenCL* Code with the Intel® OpenCL SDK, Intel
OpenCL in Action, Matthew Scarpino

Tuesday, March 12, 2013

CUDA - PTX code analysis

Remember that OpenCL version is much faster than CUDA version. With almost identical code, I was puzzled to see two times slower in CUDA and thought that PTX assembly comparison can enlighten me.

Here is OpenCL version of C code and PTX assembly code. Those comments are added to follow.

__kernel
void Pitch0(
  __global uchar *d_src, 
  __global uchar *d_dst,
  int width, int height,
  int roiLeft, int roiTop, int roiRight, int roiBottom,
  int integerPitch, float toCeiling, float toFloor )
{
  int x = roiLeft + get_global_id(0);
  int y = roiTop + get_global_id(1);
  int idx = x + y * width;

  bool isInBound = x <= roiRight && y <= roiBottom;

  float east = 
    (float)d_src[ idx - integerPitch - 1 ] * toCeiling +
    (float)d_src[ idx - integerPitch ] * toFloor;
  float west = 
    (float)d_src[ idx + integerPitch ] * toFloor + 
    (float)d_src[ idx + integerPitch + 1 ] * toCeiling;

  float diff = ( (float)d_src[idx] * 2.0f  - ( east + west ) ) / 2.0f;
  diff = diff >= 0 ? diff : -diff;

  if( isInBound )
  {
    d_dst[ idx ] = (uchar)( diff  );
  }
}

.entry Pitch0(
  .param .u32 .ptr .global .align 1 Pitch0_param_0,
  .param .u32 .ptr .global .align 1 Pitch0_param_1,
  .param .u32 Pitch0_param_2,
  .param .u32 Pitch0_param_3,
  .param .u32 Pitch0_param_4,
  .param .u32 Pitch0_param_5,
  .param .u32 Pitch0_param_6,
  .param .u32 Pitch0_param_7,
  .param .u32 Pitch0_param_8,
  .param .f32 Pitch0_param_9,
  .param .f32 Pitch0_param_10
)
{
  .reg .f32   %f< 18 >;
  .reg .pred   %p< 5 >;
  .reg .s32   %r< 32 >;
  .reg .s16   %rc< 7 >;


  ld.param.u32   %r11, [Pitch0_param_0];  // d_src
  ld.param.u32   %r12, [Pitch0_param_2];  // width
  ld.param.u32   %r13, [Pitch0_param_4];  // roiLeft
  ld.param.u32   %r14, [Pitch0_param_5];  // roiTop
  ld.param.u32   %r15, [Pitch0_param_6];  // roiRight
  ld.param.u32   %r16, [Pitch0_param_7];  // roiBottom
  ld.param.u32   %r17, [Pitch0_param_8];  // integerPitch
  ld.param.f32   %f2, [Pitch0_param_9];   // toCeiling
  ld.param.f32   %f3, [Pitch0_param_10];  // toFloor

  mov.u32   %r3, %envreg3;
  mov.u32   %r4, %ntid.x;
  mov.u32   %r5, %ctaid.x;
  mov.u32   %r6, %tid.x;
  add.s32   %r18, %r3, %r13;           
  mad.lo.s32   %r19, %r5, %r4, %r18;   // r19 = ctaid.x * ntid.x + roiLeft
  add.s32   %r20, %r19, %r6;           // r20 = x = roiLeft + get_global_id(0) = tid.x + ctaid.x * ntid.x + roiLeft 

  mov.u32   %r7, %envreg4;
  mov.u32   %r8, %ntid.y;
  mov.u32   %r9, %ctaid.y;
  mov.u32   %r10, %tid.y;
  add.s32   %r21, %r7, %r14;           
  mad.lo.s32   %r22, %r9, %r8, %r21;   // r22 = ctaid.x * ntid.y + roiTop
  add.s32   %r23, %r22, %r10;          // r23 = y = roiTop + get_global_id(1) = tid.y + ctaid.y * ntid.y + roiTop
  
  mad.lo.s32   %r2, %r23, %r12, %r20;  // r2 = idx = width * y + x
  
  setp.le.s32   %p1, %r23, %r16;       // p1 = y <= roiBottom
  setp.le.s32   %p2, %r20, %r15;       // p2 = x <= roiRight
  and.pred    %p3, %p1, %p2;           // p3 = isInBound = p1 & p2
  
  not.b32   %r24, %r17;                
  add.s32   %r25, %r2, %r24;           
  add.s32   %r26, %r11, %r25;           
  ld.global.u8   %rc1, [%r26];          
  cvt.rn.f32.u8   %f4, %rc1;           // f4 = float( d_src[ idx - integerPitch - 1] )
  ld.global.u8   %rc2, [%r26+1];       
  cvt.rn.f32.u8   %f5, %rc2;           // f5 = float( d_src[ idx - integerPitch] )
  mul.f32   %f6, %f5, %f3;             // f6 = f5 * toFloor
  fma.rn.f32   %f7, %f4, %f2, %f6;     // f7 = east = f4 * toCeiling + f5 * toFloor
  
  add.s32   %r27, %r2, %r17;           
  add.s32   %r28, %r11, %r27;    
  ld.global.u8   %rc3, [%r28];    
  cvt.rn.f32.u8   %f8, %rc3;           // f8 = float( d_src[ idx + integerPitch ] )
  ld.global.u8   %rc4, [%r28+1];    
  cvt.rn.f32.u8   %f9, %rc4;           // f9 = float( d_src[ idx + integerPitch + 1 ] )
  mul.f32   %f10, %f9, %f2;            // f10 = f9 * toCeiling
  fma.rn.f32   %f11, %f8, %f3, %f10;   // f11 = west = f8 * toFloor + f10 * toCeiling
  
  add.s32   %r29, %r11, %r2;           
  ld.global.u8   %rc5, [%r29];    
  cvt.rn.f32.u8   %f12, %rc5;          // f12 = d_src[ idx ]
  add.f32   %f13, %f12, %f12;          
  add.f32   %f14, %f7, %f11;           
  sub.f32   %f15, %f13, %f14;          // f15 = d_src[ idx ]*2 - (east+west)
  div.full.f32   %f16, %f15, 0f40000000; // f16 = diff = (d_src[ idx ]*2 - (east+west)) / 2.0
  setp.ltu.f32   %p4, %f16, 0f00000000;  // p4 = diff > 0
  neg.f32   %f17, %f16;                
  selp.f32   %f1, %f17, %f16, %p4;     // f1 = diff = diff > 0 ? diff : diff 
  @%p3 bra   BB1_2;                    // if isInBound, jump BB1_2

  ret;

BB1_2:
  cvt.rzi.u16.f32   %rc6, %f1          // rc6 = ushort( diff )
  ld.param.u32   %r31, [Pitch0_param_1];  
  add.s32   %r30, %r31, %r2;           // r30 = d_dst + idx
  st.global.u8   [%r30], %rc6;         // d_dst[idx] = uchar( ushort(diff) )
  ret;
}

Now CUDA version of PTX code is similar. One noticeable difference is more 'ld.global' instructions due to structures used for kernel function arguments. Here is snippet.

__global__ 
void KernelPitch0( 
  unsigned char* d_src, unsigned char* d_dst, 
  const Area* imgSize, const Region* roi, const PitchContext* pc )
{
  ...
  int idx = x + y * imgSize->Width;
  bool isInBound = x < = roi->Right && y < = roi->Bottom;
  if( isInBound )
  {
    float east = 
      (float)d_src[ idx - pc->IntegerPitch - 1 ] * pc - >ToCeiling +
      (float)d_src[ idx - pc->IntegerPitch ] * pc - >ToFloor;
    ...
}

.visible .entry _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext(
  .param .u32 _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_0,
  .param .u32 _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_1,
  .param .u32 _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_2,
  .param .u32 _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_3,
  .param .u32 _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_4
)
{
  ...
  ld.param.u32   %r9, [_Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_2];  // imgSize
  ld.param.u32   %r10, [_Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_3]; // roi
  ld.param.u32   %r11, [_Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_4]; // pc

  ...
  ld.global.u32   %r16, [%r4];   // r16 = roi.Left
  add.s32   %r17, %r15, %r16   // r17 = roi.Left + tid.x

  ... 
  ld.global.u32   %r22, [%r4+4];  // r22 = roi.Top
  add.s32   %r23, %r21, %r22;
  mad.lo.s32   %r5, %r19, %r20, %r23;         // r5 = y = ntid.y * ctaid.y + roi.Top + tid.y

  ...
  ld.global.u32   %r24, [%r12];   // r24 = imgSize.x
  mad.lo.s32   %r6, %r5, %r24, %r18;         // r6 = idx = imgSize.x * y + x 

  ...
  ld.global.u32   %r25, [%r4+8];  // r25 = roi.Right
  setp.gt.s32   %p1, %r18, %r25;  // p1 = x > roi.Right
  @%p1 bra   BB0_3;
  ...

With this difference in mind, I have replaced all the structure based argument with value type like OpenCL C. Then got 1GB/s. Before it was meager 428MB/s.

Entering test case "TestProcess2"
Total time taken : 3955.87 micro sec 
Size of image : 3.98 MB 
Processing time : 1005.97 MB/s 
Leaving test case "TestProcess2"; testing time: 540ms

Using structure on the kernel function arguments makes the structure resides on the device global memory and it inevitably ask device to copy the structure member to the thread register. And it consumes up the valuable memory traffic.

Now without structure, the function just needs number of ld.param.u32. ".param" state space is said to be defined per-grid. Then the physical location is likely to be global memory - NVidia says it is implementation specific and says some use global memory. If it comes from the global memory, then using structure as argument shouldn't be two times slower than value type argument. I mean in either way, it has to be loaded from the global memory.

Other possible explanation may come from the the locality of ld.param.u32. This instruction is being used at the beginning of function to load all the argument values regardless usage in the code. I guess those argument may stay in close proximity which can be good for cache.

Another thing to note is the kernel time of 1.7ms. WIth OpenCL, I got 1.2ms. Still there is .6ms slowness in the CUDA.

shared memory version
Total time taken : 4406.42 micro sec 
Size of image : 3.98 MB 
Processing time : 903.11 MB/s 

texture memory version
Total time taken : 4372.73 micro sec 
Size of image : 3.98 MB 
Processing time : 910.07 MB/s

Shared memory version and texture memory version are tried as above result but these are not as fast as plain version. Another thing to figure out.

Friday, March 8, 2013

OpenCL - PTX code generation

With NVidia GPU, you can dump the PTX assembly code using CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES. When you get binary, it is actually PTX assembly code as demonstrated below.

Below code is from lecture 4 of Supercomputing on Graphics Cards - An Introduction to OpenCL and the C++ Bindings . For your convenience, here is copy and paste of the code with slight modification.


const std::vector< size_t > binSizes = _Program.getInfo< CL_PROGRAM_BINARY_SIZES >();
std::vector< unsigned char > binData( 
  std::accumulate( binSizes.begin(), binSizes.end(), 0 )
);

unsigned char* binChunk = &binData[0];

std::vector< unsigned char* > binaries;
for( size_t i=0; i < binSizes.size(); ++i) {
  binaries.push_back( binChunk );
  binChunk += binSizes[i];
}

_Program.getInfo( CL_PROGRAM_BINARIES, &binaries[0] );

std::ofstream binaryFile( filename.c_str() );
if( ! binaryFile.good() )
  std::runtime_error( "Failed to open binary file for kernel" );

for( size_t i=0; i < binaries.size(); ++i)
  binaryFile << binaries[i];

Just to compare the result with previous post, same kernel code modified to OpenCL style is used as below.

__kernel
void KernelFourInstructions1( float a, float b, float c, __global float *d_a ) {
  a = a * b + c;
  d_a[ get_global_id(0) ] = a;
}

And finally, the code is as below with added comment.

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Sat Sep 29 23:04:42 2012 (1348927482)
// Driver 
//

.version 3.0
.target sm_30, texmode_independent
.address_size 32

.entry KernelFourInstructions1(
  .param .f32 KernelFourInstructions1_param_0,  // float a
  .param .f32 KernelFourInstructions1_param_1,  // float b
  .param .f32 KernelFourInstructions1_param_2,  // float c
  .param .u32 .ptr .global .align 4 KernelFourInstructions1_param_3 // float *d_a
)
{
  .reg .f32   %f< 5 >;
  .reg .s32   %r< 10 >;

  ld.param.f32   %f1, [KernelFourInstructions1_param_0];
  ld.param.f32   %f2, [KernelFourInstructions1_param_1];
  ld.param.f32   %f3, [KernelFourInstructions1_param_2];
  ld.param.u32   %r5, [KernelFourInstructions1_param_3];
  fma.rn.f32   %f4, %f1, %f2, %f3;   // f4 <- a * b + c
  mov.u32   %r1, %envreg3;           // What is envreg3 ? Must be zero.
  mov.u32   %r2, %ntid.x;            // ntid.x is 1024 on sm_30
  mov.u32   %r3, %ctaid.x;           // CTA identifier within a grid
  mov.u32   %r4, %tid.x;             // thread identifier within a CTA
  add.s32   %r6, %r4, %r1;           // r6 <- tid.x + envreg3
  mad.lo.s32   %r7, %r3, %r2, %r6;   // r7 <- ctaid.x * ntid.x + r6 : equivalent of get_global_id(0)
  shl.b32   %r8, %r7, 2;           
  add.s32   %r9, %r5, %r8;           // r9 <- r5 + r8 : &d_a[ get_global_id(0) ]
  st.global.f32   [%r9], %f4;        // [ r9 ] <- f4
  ret;
}

This PTX code is slightly bloated compared to previous CUDA PTX code. It is due to the get_global_id(0) instead of threadIdx.x. get_global_id(0) should count for the block size and it has to refer ntid and ctaid.

There is one alien "envreg3". The PTX doc says it is driver-defined read-only registers and says it is described at the driver documentation which I can't find where it is. The value should be zero in the above code context. I guess OpenCL run-time kernel compilation allows the NVidia to use very hardware specific variables. And maybe that can be the real strength of the OpenCL run-time compile versus CUDA static compile.

Tuesday, March 5, 2013

CUDA - PTX assembly 1

One way to figure out why OpenCL output is two times faster than CUDA is to looking at the machine code. Though to look at the assembly code, you need to train yourself. From now on, there will be number of post on the CUDA PTX assembly code.

One of CUDA NVCC compiler option is -ptx which generate PTX assembly code and the code can be found at xxx.cu.obj.

Below is a cu code and it's PTX assembly code with added comment.

__global__ 
void KernelFourInstructions1( float a, float b, float c, float *d_a )
{
  a = a * b + c;
  d_a[ threadIdx.x ] = a;
}

.visible .entry _Z23KernelFourInstructions1fffPf(
  .param .f32 _Z23KernelFourInstructions1fffPf_param_0,  // float a
  .param .f32 _Z23KernelFourInstructions1fffPf_param_1,  // float b
  .param .f32 _Z23KernelFourInstructions1fffPf_param_2,  // float c
  .param .u32 _Z23KernelFourInstructions1fffPf_param_3   // float *d_a
)
{
  .reg .s32   %r<6>;  // local variable r0 to r5 as 32 bit signed integer
  .reg .f32   %f<5>;  // local variable f0 to f4 as 32 bit floating point 


  ld.param.f32   %f1, [_Z23KernelFourInstructions1fffPf_param_0]; 
  ld.param.f32   %f2, [_Z23KernelFourInstructions1fffPf_param_1];  
  ld.param.f32   %f3, [_Z23KernelFourInstructions1fffPf_param_2];
  ld.param.u32   %r1, [_Z23KernelFourInstructions1fffPf_param_3];
  cvta.to.global.u32   %r2, %r1;    // r2 <- d_a address as global memory
  .loc 2 31 1                       // .loc is debugging directive pointing source file location
  fma.rn.f32   %f4, %f1, %f2, %f3;  // f4 <- a * b + c
  .loc 2 33 1
  mov.u32   %r3, %tid.x;
  shl.b32   %r4, %r3, 2;
  add.s32   %r5, %r2, %r4;          // r5 <- &d_a[ threadIdx.x ]
  st.global.f32   [%r5], %f4;       // d_a[ threadIdx.x ] <- f4
  .loc 2 34 2
  ret;
}

It is a bit surprising to see that it needs 6 integer registers and 5 floating point registers. For example r3 can be reused as the destination of shl.b.32 and add.s32. Or f1 also can be as destination of fma.rn.f32. I am wondering below code is doing exactly same with less number of local variable.


  ...
  cvta.to.global.u32   %r2, %r1;
  fma.rn.f32   %f1, %f1, %f2, %f3;  // f1 <- a * b + c, f1 instead of f4
  mov.u32   %r3, %tid.x;
  shl.b32   %r3, %r3, 2;            // r3 is reused as destination
  add.s32   %r3, %r2, %r3;          // again r3 is reused
  st.global.f32   [%r3], %f1;       
  ...

Reference : Parallel Thread Execution ISA Version 3.1, NVIDIA

Thursday, February 28, 2013

OpenCL - pinned memory

Handling page-able memory must require constant intervention from CPU side. Or maybe DMA transfer can't be employed in full. I don't know what is the cause but the NVidia ask to use pinned memory for critical operation. It says that you may get 5GBps. Remember the last post which says that the copy throughput is around 3.3 GB/s.

Below is the modified code snippet for the pinned memory. Host memory is allocated with flag CL_MEM_ALLOC_HOST_PTR and get the raw pointer with mapping function. And these mapped raw pointer is used for buffer copy operation. It is a bit odd to use raw pointer not the host buffer handle but I guess it's for maintaining function interface.

...
cl::Buffer hostSrc( task.GetContext(), CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, width*height );
cl::Buffer hostDst( task.GetContext(), CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, width*height );

uint8_t *hostSrcPtr = 
  (unsigned char*) 
  queue.enqueueMapBuffer( hostSrc, CL_TRUE, CL_MAP_WRITE, 0, width*height );
uint8_t *hostDstPtr = 
  (unsigned char*) 
  queue.enqueueMapBuffer( hostSrc, CL_TRUE, CL_MAP_READ, 0, width*height );

...
queue.enqueueWriteBuffer( devSrc, CL_FALSE, 0, width*height, hostSrcPtr, &ve0, &steps[1] );
...
queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, localws, &ve1, &steps[2] );
...
queue.enqueueReadBuffer( devDst, CL_FALSE, 0, width*height, hostDstPtr, &ve2, &steps[3]  );

The result is as expected. The achieved memory copy throughput is around 6.1GB/s. And it makes overall throughput to be 1.4 GB/s. First time over the GIGA. Huray !

Entering test case "TestPitch0Pinned"
Step 1 : start 0 ns, end 647168 ns, duration 647168 ns, 6136.32 MB/s
Step 2 : start 879072 ns, end 2166112 ns, duration 1287040 ns, 3085.55 MB/s
Step 3 : start 2196800 ns, end 2824128 ns, duration 627328 ns, 6330.39 MB/s
Total : duration 2824128 ns, 1406.18 MB/s

Reference : OpenCL Best Practice Guide, Chapter 3.1.1. Pinned Memory

Wednesday, February 27, 2013

OpenCL - first pitch comparison

In here, the pitch comparsion will be tested with OpenCL from NVidia CUDA toolkit.

First, to simplify the platfrom, device, context and program creation, I make up a new class WorkBench which do all the chores. Refer below code. Note that this class rely on the reference counting of OpenCL c++ wrapper.

//
// usage :
//   Workbench task;
//   task.ConnectDevice( "GT 640" )
//       .BuildContext()
//       .BuildProgram( "OpenCLKernels.cl" );
//
class Workbench
{
public:
  Workbench() {}

  Workbench& ConnectDevice( const string& deviceName )
  {
    vector< cl::platform > platforms;  
    cl::Platform::get( &platforms );

    for( vector< cl::platform >::iterator I=platforms.begin(); I!=platforms.end(); ++I )
    {
      cl::Platform& platform( *I );
      
      vector< cl::device > devices;
      platform.getDevices( CL_DEVICE_TYPE_ALL, &devices );

      for( vector< cl::device >::iterator J=devices.begin(); J!=devices.end(); ++J )
      {
        cl::Device& device( *J );
        string discoveredName( device.getInfo< cl_device_name >() );

        if( discoveredName.find( deviceName ) != discoveredName.npos )
        {
          _Devices.push_back( device );
          return *this;
        }
      }
    }

    throw std::runtime_error( string("Cannot find the device with the name") + deviceName );
  }

  Workbench& BuildContext()
  {
    _Context = cl::Context( GetConnectedDevices() );
    return *this;
  }

  Workbench& BuildProgram( const string& sourceFile )
  {
    std::ifstream programFile( sourceFile.c_str() );
    std::string programString( 
      std::istreambuf_iterator< char >(programFile), 
      (std::istreambuf_iterator< char >()));
    cl::Program::Sources source( 
      1, 
      std::make_pair( programString.c_str(), programString.length()+1 )
      );
    _Program = cl::Program( _Context, source );
    _Program.build( GetConnectedDevices() );
    return *this;
  }

  vector< cl::Device >& GetConnectedDevices() { return _Devices; }
  cl::Context& GetContext() { return _Context; }
  cl::Program& GetProgram() { return _Program; }
    

private:
  vector< cl::Device > _Devices;
  cl::Context _Context;
  cl::Program _Program;
};

Then the host code snippet is as below

...
ByteImage src( width, height ), dst( width, height );
...
Workbench task;
task.ConnectDevice( "GT 640" )
  .BuildContext()
  .BuildProgram( "OpenCLKernels.cl" );

cl::Kernel kernelPitch0( task.GetProgram(), "Pitch0" );
cl::CommandQueue queue( task.GetContext(), task.GetConnectedDevices()[0], CL_QUEUE_PROFILING_ENABLE );
cl::Buffer devSrc( task.GetContext(), CL_MEM_READ_ONLY, width*height );
cl::Buffer devDst( task.GetContext(), CL_MEM_WRITE_ONLY, width*height );
...

vector< cl::Event > steps;
steps.push_back( cl::UserEvent( task.GetContext() ) );
steps.push_back( cl::Event() );
steps.push_back( cl::Event() );
steps.push_back( cl::Event() );

vector< cl::Event > ve0; ve0.push_back( steps[0] );
queue.enqueueWriteBuffer( devSrc, CL_FALSE, 0, width*height, src.GetPixelPtr(), &ve0, &steps[1] );

kernelPitch0.setArg( 0, devSrc );
kernelPitch0.setArg( 1, devDst );
...

const int LOCAL_WIDTH = 64, LOCAL_HEIGHT = 4;
cl::NDRange localws( LOCAL_WIDTH, LOCAL_HEIGHT );
cl::NDRange globalws( 
  LOCAL_WIDTH * (int)ceil( (double)(roiRight-roiLeft+1) / (double)LOCAL_WIDTH ), 
  LOCAL_HEIGHT * (int)ceil( (double)(roiBottom-roiTop+1) / (double)LOCAL_HEIGHT )
);

vector< cl::Event > ve1; ve1.push_back( steps[1] );
queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, localws, &ve1, &steps[2] );

vector< cl::Event > ve2; ve2.push_back( steps[2] );
queue.enqueueReadBuffer( devDst, CL_FALSE, 0, width*height, dst.GetPixelPtr(), &ve2, &steps[3]  );

(static_cast< cl::userevent >(&steps[0]))->setStatus( CL_COMPLETE );
steps[3].wait();
queue.finish();

...

The device code is as below

__kernel
void Pitch0(
  __global uchar *d_src, 
  __global uchar *d_dst,
  int width, int height,
  int roiLeft, int roiTop, int roiRight, int roiBottom,
  int integerPitch, float toCeiling, float toFloor )
{
  int x = roiLeft + get_global_id(0);
  int y = roiTop + get_global_id(1);
  int idx = x + y * width;

  bool isInBound = x <= roiRight && y <= roiBottom;

  float east = 
    (float)d_src[ idx - integerPitch - 1 ] * toCeiling +
    (float)d_src[ idx - integerPitch ] * toFloor;
  float west = 
    (float)d_src[ idx + integerPitch ] * toFloor + 
    (float)d_src[ idx + integerPitch + 1 ] * toCeiling;

  float diff = ( (float)d_src[idx] * 2.0f  - ( east + west ) ) / 2.0f;
  diff = diff >= 0 ? diff : -diff;

  if( isInBound )
  {
    d_dst[ idx ] = (uchar)( diff  );
  }
}

The resulst is as below

Step 1 : start 0 ns, end 1165280 ns, duration 1165280 ns, 3407.96 MB/s
Step 2 : start 1388384 ns, end 2675712 ns, duration 1287328 ns, 3084.86 MB/s
Step 3 : start 2854816 ns, end 4050816 ns, duration 1196000 ns, 3320.43 MB/s
Total : duration 4050816 ns, 980.35 MB/s

The memory copy from host to device and device to host is around 1.2 milli second (ms). This is similar to the non-pinned memory copy time. And it means I have to find an equivalent of pinned memory in OpenCL.

But somewhat confusing result is the kernel time which is just around 1.3 ms. This is almost two times faster than the CUDA kernel time - it took 2.1 ms at best with texture access version.

I don't expect that NVidia spent more time optimizing OpenCL compiler than CUDA compiler. Or is Apple such a big customer which can't be disappointed ?

Saturday, February 23, 2013

OpenCL - Multiple SDKs

As there are multiple vendors supports OpenCL, multiple devices should be able to be loaded up in an application with one of vendor's OpenCL SDK.

Here I have NVidia CUDA SDK & Intel OpenCL SDK installed in my machine. And here is code to enumerate all the devices in all the platforms.

  vector< cl::Platform > platforms;  

  try 
  {
    cl::Platform::get( &platforms );

    for( cl_uint n=0; n < platforms.size(); ++n )
    {
      string platformName = platforms[n].getInfo< CL_PLATFORM_NAME >();
      cout << "Platform : @ " << platformName.c_str() << endl;
      
      vector< cl::Device > devices, ctxDevices;

      platforms[n].getDevices( CL_DEVICE_TYPE_ALL, &devices );
      
      cl::Context context( devices );
      ctxDevices = context.getInfo< CL_CONTEXT_DEVICES >();
      for( cl_uint i=0; i < ctxDevices.size(); ++i )
      {
        string deviceName = ctxDevices[i].getInfo< CL_DEVICE_NAME >();
        cout << "Device: @ " << deviceName.c_str() << endl;
      }
    }
  }
  catch( cl::Error e ) {
    cout << e.what() << ": Error Code " << e.err() << endl;
  }

Above code is linked with library OpenCL.lib come with NVidia CUDA SDK. It prints out below which enumerate all the OpenCL enabled devices incluing Intel OpenCL device.

Platform : @ NVIDIA CUDA
Device: @ GeForce GT 640
Platform : @ Intel(R) OpenCL
Device: @         Intel(R) Pentium(R) CPU G2120 @ 3.10GHz

This collaboration is based on the registery key \HKLM\SOFTWARE\Khronos\OpenCL\Vendors . Refer below. I guess OpenCL.lib must loadup all these dll and asks to communicate it's devices under control.

Sunday, February 17, 2013

Next Move

CUDA

Using GT 640. PTX instructions. VLIW aware kernel programming

SSE

Any benefit using store without cache ?
Calculate throughput.

EB = ( Br + Bw ) / T , EB : effective bandwidth, Br ; read bytes, Bw : written bytes, T : time

Study hardware - cache, bank and so on

OpenCL

Using NVidia OpenCL with GT 640. Compare result with CUDA. Using buffer. Using Image.
Using NVidia OpenCL with GT 640. vector type
Using Intel OpenCL. Compare result with SSE code
Run kernels on GPU GT 640 and CPU G2120 at the same time.
Buy 3rd generation processor i3 - 3220 - to access GPU HD2500.

Inspection algorithm

Image preprocessing

Geometric Distortion corretion - radial in area camera
Shading correction - Area, line scan. How to define the compensation ?

periodic pattern inspection - 2 points ( hor, ver ), 4 points, 8 points, many points in horizontal and vertical
threshold - constant, adaptive threshold, ...
binary image handling
morphology - open, close, kernel size, iteration count, iteration order
segmentation - contour based, region based, bug follower, area, bounding box calculation

Distributed system

Group communication to manage multiple inspection processor
Join/Leave the group
Send packet that goes out to all group member fast and reliable using broadcast or multi-cast.
Designated member can get the burst of REP from all the group member.
A processor can join more than 1 group
Streaming images to a member - or designated member
REQ, REP, ACK, AYA( Are you alive ? ), IAA ( I am alive ), TA( Try Again ), AU ( Address unknown )
Make API that is flexible and easy to use ? Allow to have different type of REQ possible ? Refer other API for network ?

Wednesday, February 13, 2013

VC++ - intrinsic optimization

VC++ does not support inline assembler when it compiles to x64 target. But it says you don't have to use inline assembler as there is intrinisics. Intrinisics has methods that match with assembler instructions.

void SSE_Pitch0( 
  const uint8_t *h_src, uint8_t *h_dst, 
  int width, int height, 
  int roiLeft, int roiTop, int roiRight, int roiBottom,
  float horPitch, float verPitch )
{
  int sizeOfXMMInWords = SSE::SizeOfDataType / 2;
  int integerPitch = (int)horPitch;

  uint16_t toFloor = (uint16_t)( (horPitch - (int)horPitch) * 0xFF);
  uint16_t toCeiling = 0xFF - toFloor;

  std::vector< uint16_t > toFloors( sizeOfXMMInWords, toFloor );
  std::vector< uint16_t > toCeilings( sizeOfXMMInWords, toCeiling );

  __m128i xmmFloor, xmmCeiling;
  xmmFloor = _mm_loadu_si128( (const __m128i*)( toFloors.front()) );
  xmmCeiling = _mm_loadu_si128( (const __m128i*)( toCeilings.front()) );

  for ( int y=roiTop; y<=roiBottom; y++ ) 
  {
    int x = roiLeft;
    const uint8_t *source = h_src + width*y+x;
    uint8_t *target = h_dst + width*y+x;

    for ( ; x<=roiRight; 
      x+=sizeOfXMMInWords, source+=sizeOfXMMInWords, target+=sizeOfXMMInWords ) 
    {
      __m128i xmmTemp, xmmLeft, xmmRight, xmmEast, xmmWest, xmmCenter;

      // _mm_cvtepu8_epi16 can't load value from memory directly unlike PMOVZXBW
      xmmTemp = _mm_loadu_si128( (const __m128i*)(source-integerPitch-1) );
      xmmEast = _mm_cvtepu8_epi16( xmmTemp );
      xmmTemp = _mm_srli_si128( xmmTemp, 1 );
      xmmWest = _mm_cvtepu8_epi16( xmmTemp );

      xmmEast = _mm_mullo_epi16( xmmEast, xmmFloor );
      xmmWest = _mm_mullo_epi16( xmmWest, xmmCeiling );
      xmmLeft = _mm_adds_epu16( xmmEast, xmmWest );
      xmmLeft = _mm_srli_epi16( xmmLeft, 1 );

      xmmTemp = _mm_loadu_si128( (const __m128i*)(source+integerPitch) );
      xmmEast = _mm_cvtepu8_epi16( xmmTemp );
      xmmTemp = _mm_srli_si128( xmmTemp, 1 );
      xmmWest = _mm_cvtepu8_epi16( xmmTemp );

      xmmEast = _mm_mullo_epi16( xmmEast, xmmCeiling );
      xmmWest = _mm_mullo_epi16( xmmWest, xmmFloor );
      xmmRight = _mm_adds_epu16( xmmEast, xmmWest );
      xmmRight = _mm_srli_epi16( xmmRight, 1 );

      xmmTemp = _mm_loadu_si128( (const __m128i*)(source) );
      xmmCenter = _mm_cvtepu8_epi16( xmmTemp );
      xmmCenter = _mm_slli_epi16( xmmCenter, 8 );
      xmmTemp = xmmCenter;

      xmmCenter = _mm_subs_epu16( xmmCenter, xmmLeft );
      xmmCenter = _mm_subs_epu16( xmmCenter, xmmRight );

      xmmLeft = _mm_adds_epu16( xmmLeft, xmmRight);
      xmmLeft = _mm_subs_epu16( xmmLeft, xmmTemp );

      xmmCenter = _mm_adds_epu16( xmmCenter, xmmLeft );
      xmmCenter = _mm_srli_epi16( xmmCenter, 8 );
      xmmTemp = _mm_xor_si128( xmmTemp, xmmTemp);
      xmmTemp = _mm_packus_epi16( xmmCenter, xmmTemp );

      _mm_storel_epi64( (__m128i*)(target), xmmTemp );
    }
  }
}

Suprisingly, this is just making 160 MB/s throughput. The reason is obvious when we see the disassembled code. The compiler doesn't utilize 16 xmm registers. It seems that it think there is only one xmm register. It is constantly push/pop xmm register value to the stack. This doesn't improve with -msse2 option. Or other optimization flag.

__m128i xmmTemp, xmmLeft, xmmRight, xmmEast, xmmWest, xmmCenter;
   xmmTemp = _mm_loadu_si128( (const __m128i*)(source-integerPitch-1) );
00403616  mov         edx,dword ptr [source] 
0040361C  sub         edx,dword ptr [ebp-1Ch] 
0040361F  sub         edx,1 
00403622  movdqu      xmm0,xmmword ptr [edx] 
00403626  movdqa      xmmword ptr [ebp-0D0h],xmm0 
0040362E  movdqa      xmm0,xmmword ptr [ebp-0D0h] 
00403636  movdqa      xmmword ptr [xmmTemp],xmm0 
   xmmEast = _mm_cvtepu8_epi16( xmmTemp );
0040363E  pmovzxbw    xmm0,mmword ptr [xmmTemp] 
00403647  movdqa      xmmword ptr [ebp-0F0h],xmm0 
0040364F  movdqa      xmm0,xmmword ptr [ebp-0F0h] 
00403657  movdqa      xmmword ptr [xmmEast],xmm0 
   xmmTemp = _mm_srli_si128( xmmTemp, 1 );
0040365F  movdqa      xmm0,xmmword ptr [xmmTemp] 
00403667  psrldq      xmm0,1 
0040366C  movdqa      xmmword ptr [ebp-110h],xmm0 
00403674  movdqa      xmm0,xmmword ptr [ebp-110h] 
0040367C  movdqa      xmmword ptr [xmmTemp],xmm0

The compiler used in here is VC++ 2008 sp1. In the internet, there is saying that VC++2010 is much better than old version. But as there is intel compiler or GCC that allows inline assembler, is still a merit to pursue the intrinsics ? Wolud the compiler makes better code than hand-written code in the end ? I guess the answer depends. Still the fact that the VC++2008 instrincs is incompentent on SSE stands.

Tuesday, February 12, 2013

SSE - Pitch comparison

How fast can it be processed in the CPU side ? We have seen that the GPU can make around 700 MB/s throughput. But what if we let it be processed in the CPU with it's all strength ?

CPU has SIMD instructions. These instructions is quite well suited with the algorithm. The pitch comparison algorithm asks the difference of each pixel. It doesn't need to know it's adjacent neighbor pixel value nor dependant to any other pixel's intermediate processing result.

Also the pitch comarision doesn't need high precision. Who cares whether the difference is 10.1 or 10.2 pixels ? The pixel value is 0 - 255 and we knows that each pixel can vary at least 1 or 2 pixels. So all we needs is at most one more point afte digit. This allows us to employ a fixed point floating point calculation.

16 bits is more than enough for the pitch comaparison. SSE 64bit provides 128 bits wides xmm registers. Each xmm registers can hold 8 words - 16 bits pixels. Each instruction can process 8 pixels at a time.

VC++ supports inline assembler on the x86 target. Here is the code.

void SSE_Pitch0( 
  const uint8_t *h_src, uint8_t *h_dst, 
  int width, int height, 
  int roiLeft, int roiTop, int roiRight, int roiBottom,
  float horPitch, float verPitch )
{
  int sizeOfDataTypeInWords = SSE::SizeOfDataType / 2;
  int integerPitch = (int)horPitch;

  uint16_t toFloor = (uint16_t)( (horPitch - (int)horPitch) * 0xFF);
  uint16_t toCeiling = 0xFF - toFloor;

  std::vector< uint16_t > toFloors( sizeOfDataTypeInWords, toFloor );
  std::vector< uint16_t > toCeilings( sizeOfDataTypeInWords, toCeiling );
  const uint16_t* toFloorsPtr = &toFloors.front();
  const uint16_t* toCeilingsPtr = &toCeilings.front();

  _asm {
    MOV     ECX, toFloorsPtr  
    MOVDQU  XMM6, [ECX]
    MOV     EDX, toCeilingsPtr
    MOVDQU  XMM7, [EDX]
  }

  for ( int y=roiTop; y<=roiBottom; y++ ) 
  {
    for ( int x=roiLeft; x<=roiRight; x+=sizeOfDataTypeInWords ) 
    {
      const uint8_t *source = h_src + width*y+x;
      uint8_t *target = h_dst + width*y+x;
      
      _asm {

        MOV      ECX, source  
        SUB      ECX, integerPitch  // east pitch pixel
        PMOVZXBW XMM1, [ECX-1]  // left
        PMOVZXBW XMM2, [ECX]    // right

        PMULLW   XMM1, XMM6  
        PMULLW   XMM2, XMM7
        PADDUSW  XMM1, XMM2  
        PSRLW    XMM1, 1

        MOV      ECX, source  
        ADD      ECX, integerPitch  // west pitch pixel

        PMOVZXBW XMM2, [ECX]    // left
        PMOVZXBW XMM3, [ECX+1]  // right

        PMULLW   XMM2, XMM7  
        PMULLW   XMM3, XMM6
        PADDUSW  XMM2, XMM3
        PSRLW    XMM2, 1

        MOV      ECX, source;  
        PMOVZXBW XMM0, [ECX]    // XMM0 = I(n) | ... | I(n+7)
        PSLLW    XMM0, 8
        MOVDQU   XMM4, XMM0

        PSUBUSW  XMM0, XMM1    // 2C - (L+R)
        PSUBUSW  XMM0, XMM2    

        PADDUSW  XMM1, XMM2    // (L+R) - 2C
        PSUBUSW  XMM1, XMM4

        PADDUSW  XMM0, XMM1

        PSRLW    XMM0, 8
        PXOR     XMM1, XMM1
        PACKUSWB XMM1, XMM0

        MOV      EDX, target
        MOVHPS   [EDX], XMM1
      }
    }
  }
}

Above code can be executed at around 940MB/s