Computing Memoir: January 2013

Thursday, January 31, 2013

CUDA Study - shared memory

This time I am trying to use shared memory. The idea is to load source pixels collaboratively by multiple threads. If the bottle neck is the memory loading time then it will give me some boost. Though the short fall of this idea is that the last column of threads in a block should load additional pixel. Can' remove this additional load unless radically modify the whole algorithm.

Still pursuing this idea to see how it goes.

To define shared memory, it asks contant value for the size. It means blockDim.x and y can't be used for this. Though there is integer template which can conveniently declare width and height of the shared memory.

template< int W, int H >
__global__ void KernelPitch1( ... ) 
{
  ...
  __shared__ uint8_t sharedEast[ (W+1) * H ];
  __shared__ uint8_t sharedWest[ (W+1) * H ];

  int sharedIndex = threadIdx.y * (W+1) + threadIdx.x;
  sharedEast[ sharedIndex ] = d_src[ idx - pitch - 1 ];
  sharedWest[ sharedIndex ] = d_src[ idx + pitch ];

  if( threadIdx.x == (W-1) )  // one more column for each block
  {
    sharedEast[ sharedIndex + 1 ] = d_src[ idx - pitch ];
    sharedWest[ sharedIndex + 1 ] = d_src[ idx + pitch + 1 ];
  }

  __syncthreads();

  float east = 
    (float)(sharedEast[ sharedIndex   ]) * pc->ToCeiling +
    (float)(sharedEast[ sharedIndex+1 ]) * pc->ToFloor;
  float west = 
    (float)(sharedWest[ sharedIndex   ]) * pc->ToFloor + 
    (float)(sharedWest[ sharedIndex+1 ]) * pc->ToCeiling;

  float diff = ( (float)d_src[idx] * 2.0f  - ( east + west ) ) / 2.0f;
  
  if( isInBound )
  {
    d_dst[ idx ] = (unsigned char)( diff >= 0 ? diff : -diff  );
  }
}

void CUDA_Pitch1( ... )
{
  ...
  KernelPitch1< 16, 16 > <<< grids, threads >>>( ... );
  ...
}

Here is the Visual Profiler result that say slight improvement on the performance.

Though memory loading time might not be a big bottleneck in this case as the profiler points to the store operation as culprit.

One thing to notice is that the CUDA seems get confused with two level 'if' blocks. It is not a good idea to have multiple level of branches as the core has to keep all the branching but still suprised that it compiles and generate strange result. Here is the pattern that have to be avoided.

template< int W, int H >
__global__ void KernelPitch1( ... ) 
{
  ...
  
  if( isInBound )
  {
    ...

    if( threadIdx.x == (W-1) )  
    {
      // This second level if block causes trouble.
      // It seems that this block may not be executed with valid condition
      sharedEast[ sharedIndex + 1 ] = d_src[ idx - pitch ];
      sharedWest[ sharedIndex + 1 ] = d_src[ idx + pitch + 1 ];
    }

    ...
  }
}

Wednesday, January 30, 2013

CUDA Study - Pinned Memory

As CUDA advocates the pinned memory, only the memory allocation has been modified to see the effect. Refer below code.

struct AllocPinnedMemory {
  static uint8_t* Alloc( int width, int height ) { 
    uint8_t *p = NULL;
    cudaHostAlloc( (void**)&p, width*height, cudaHostAllocDefault );
    return p; 
  }
  static void Free( uint8_t* p ) { cudaFreeHost( p ); }
};

typedef Image < uint8_t , AllocPinnedMemory > PinnedImage;

BOOST_AUTO_TEST_CASE( TestProcess2 )
{  
  // ...
  PinnedImage src( width, height );
  PinnedImage dst( width, height );
  // ...
}

This makes the memory operation to be exactly two times faster than previous host memory which is allocated by 'new'. Refer below screenshot of the Visual Profiler. The non pinned memory copy takes 1.24 ms ( HtoD ) and 1.27 ms ( DtoH). The pinned memory takes 628 us ( both HtoD and DtoH ). Of course there is no difference in the kernel time.

Tuesday, January 29, 2013

CUDA Study - Pattern Inspection Algorithm #1

There is an image processing algorithm that I want to see how fast can it be implemented. It is the first step of a common pattern inspection algorithm used in the LCD or other regular patterns.

The algorithm is simply calculate the difference between pitch distance pixels. Here is a horizontal pitch comparison version in plain C with a fixed-point floating calculation.

uint8_t PlainC_HorizontalPitch( 
 uint8_t center,
 uint8_t eastLeft, uint8_t eastRight, uint8_t westLeft, uint8_t westRight, 
 float pitch )
{
 uint16_t toFloor = (uint16_t)( ((pitch - (int)pitch) * 0xFF)+0.5f);
 uint16_t toCeiling = 0xFF - toFloor;

 uint16_t east = toFloor * eastLeft + toCeiling * eastRight;
 uint16_t west = toFloor * westRight + toCeiling * westLeft;

 east = east >> 1;   // fx9.16
 west = west >> 1; 

 uint16_t diff1, diff2, diff;
 uint16_t c2 = center << 8;   // fx9.16   : this is 2c

 diff1 = max( 0,  c2 - east - west );
 diff2 = max( 0,  east + west - c2 );
 diff = ( diff1 + diff2 ) >> 8;   // fx8.8   : 

 return (uint8_t)diff;
}

Now this is modified to run on the CUDA kernel like below. I really don't expect that it performs well. This version is just a very crude verions and want to see where is the bottle neck and make a plan on optimization.

__global__ 
void KernelPitch0( 
 unsigned char* d_src, unsigned char* d_dst, 
 const Area* imgSize, 
 const Region* roi,
 const PitchContext* pc )
{
 int x = roi->Left + blockIdx.x * blockDim.x + threadIdx.x;
 int y = roi->Top + blockIdx.y * blockDim.y + threadIdx.y;
 int idx = x + y * imgSize->Width;

 bool isInBound = x < roi->Right && y < roi->Bottom;

 if( isInBound )
 {
  float east = 
   (float)d_src[ idx - pc->IntegerPitch - 1 ] * pc->ToCeiling +
   (float)d_src[ idx - pc->IntegerPitch ] * pc->ToFloor;
  float west = 
   (float)d_src[ idx + pc->IntegerPitch ] * pc->ToFloor + 
   (float)d_src[ idx + pc->IntegerPitch + 1 ] * pc->ToCeiling;

  float diff = ( (float)d_src[idx] * 2.0f  - ( east + west ) ) / 2.0f;

  d_dst[ idx ] = (unsigned char)( diff >= 0 ? diff : -diff  );
 }
}

Kernel is launched as below code snippet.

void CUDA_Pitch0( 
 const unsigned char *h_src, unsigned char *h_dst, 
 int width, int height, 
 int roiLeft, int roiTop, int roiRight, int roiBottom,
 float horPitch, float verPitch )
{
 using namespace boost;

 int imgSize = width*height;

 DeviceByteImagePtr devSrc( new DeviceByteImage( width, height ) );
 DeviceByteImagePtr devDst( new DeviceByteImage( width, height ) );

 dim3 grids( width / 32, height / 32 );
 dim3 threads( 32, 32 );

 HANDLE_ERROR( cudaMemcpy( devSrc->GetPixelPtr(), h_src, imgSize, cudaMemcpyHostToDevice ) );

 shared_ptr devImageDim = CreateSmartDeviceMemoryPointer();
 shared_ptr devROI = CreateSmartDeviceMemoryPointer();
 shared_ptr devHorPitch = CreateSmartDeviceMemoryPointer();
 
 {
  Area imageDim( width, height );
  HANDLE_ERROR( cudaMemcpy( devImageDim.get(), &imageDim, sizeof(Area), cudaMemcpyHostToDevice ) );

  Region roi( roiLeft, roiTop, roiRight, roiBottom );
  HANDLE_ERROR( cudaMemcpy( devROI.get(), &roi, sizeof(Region), cudaMemcpyHostToDevice ) );

  PitchContext pc;

  pc.Pitch = horPitch;
  pc.IntegerPitch = (int)horPitch;
  pc.ToFloor = horPitch - pc.IntegerPitch;
  pc.ToCeiling = 1.0f - pc.ToFloor;
  HANDLE_ERROR( cudaMemcpy( devHorPitch.get(), &pc, sizeof(PitchContext), cudaMemcpyHostToDevice ) );
 }

 KernelPitch0<<< grids, threads >>>( 
   devSrc->GetPixelPtr(), devDst->GetPixelPtr(), 
   devImageDim.get(), devROI.get(), 
   devHorPitch.get()
  );
  if ( cudaGetLastError() != cudaSuccess ) 
  {
   cerr << "Launch Error" << endl;
   return;
  }

 HANDLE_ERROR( cudaMemcpy( h_dst, devDst->GetPixelPtr(), imgSize, cudaMemcpyDeviceToHost ) );
}

When 8192x512 ( 4MB ) image, it takes around 9291 usec which give me 428.3 MB/sec. The environment is as below.

Geforce GT 640, CUDA 5.0
Windows7 64 bits, Pentium G2120 @ 3.10GHz, 4GB RAM
VS 2008

Looking into the profiling, the memory copy from host to device for source takes 1384 usec. And actual kernel takes 4081 usec. And then memory copy from devicde to host takes another 1384 usec. Refer below image. The first cudaMemcpy is the copy from host to device. The second cudaMemcpy is actually time for kernel and time for copy from device to host.

Computing Memoir