essay on programming languages, computer science, information techonlogies and all.

Friday, March 29, 2013

OpenCL - CPU Explicit Vectorization


Last blog, I got a disappointing result with CPU OpenCL code. And it is time to find out the right way to code in CPU side.

First of all, I tried to use float version of pitch comparison. This is to check whether the implicit vectorization can make any difference with 4 bytes type. At least, Intel OpenCL optimization book recommends to use int and float.

Refer below code. This code scores meager 39.3 MB/s. So still implicit vectorization with float isn't good. And it also tells that float operation is still slightly slower than integer operation. If precision doesn't matter, it is better stick to fixed floating point operation using integer type than float type. Though the code is clean and easy to read with float version. So if code maintenance matters, then float can be a helpful.
_kernel 
void Pitch( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int nPitch, 
  __const float toCeiling, __const float toFloor )
{
  int x = get_global_id(0);
  int y = get_global_id(1);
  int idx = y * width + x;

  float eastLeft = src[ idx + nPitch ];      // uchar pixel value is converted to float - 4 bytes
  float eastRight = src[ idx + nPitch + 1];
  float westLeft = src[ idx - nPitch - 1];
  float westRight = src[ idx - nPitch ];
  float center = src[ idx ];

  float east = toFloor * eastLeft + toCeiling * eastRight;
  float west = toFloor * westRight + toCeiling * westLeft;

  float diff = fabs( (2*center - (east+west)) / 2.0f );
 
  dst[ idx ] = (uchar)diff;                  // float difference value is converted back to uchar
}


Second try is a bold step of using just 2 kernels. And I expect that each kernel runs on each core nicely - my PC has dual core CPU. Then the kernel has to loop through each pixel in image.

Refer below code. This scores 144MB/s. This big jump on performance is saying that this is the right way. CPU side has to be coded quite differently to the GPU side. In GPU, you have to make lots of kernels. But in the CPU side, you have to make just as many kernels as number of cores.

// Each kernel processes half of full image in column.  The blockWidth is ( roiRight - roiLeft ) / 2. 
__kernel 
void Pitch( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int roiLeft, __const int roiTop, __const int roiRight, __const int roiBottom,
  __const int blockWidth,   
  __const int nPitch, 
  __const short toCeiling, __const short toFloor )
{
  int i = get_global_id(0);

  for( int y = roiTop; y <= roiBottom; ++y )
  {
    for( int x = roiLeft + blockWidth*i; x < min( roiLeft + blockWidth*(i+1), roiRight + 1 ); ++x )
    {
      int idx = y * width + x;

      short eastLeft = src[ idx + nPitch ];
      short eastRight = src[ idx + nPitch + 1];
      short westLeft = src[ idx - nPitch - 1];
      short westRight = src[ idx - nPitch ];
      short center = src[ idx ];

      short east = toFloor * eastLeft + toCeiling * eastRight;
      short west = toFloor * westRight + toCeiling * westLeft;

      east = east >> 1;
      west = west >> 1;

      short sum = east + west;
      short c2 = center << 7;
      short diff = abs_diff( c2, sum );

      diff = diff >> 7;

      if ( x <= roiRight ) 
      {
        dst[ idx ] = (uchar)diff; 
      }

    }
  }
}


Now time to combine the explicit loop with vector type. Here short8 is used as primary vector type considering that it fits with no loss at the 128bits xmm register. Also the built-in function abs_diff() supports only signed type. That's the reason why I choose short8 over ushort8. Though this is traded off with precision.

This code scores 986MB/s. This is another big jump. But best result with SSE instructions with 1 core is 940 MB/s. As this OpenCL code supposes to use dual core in full, it can be two fold increase but disappointingly, it is just above 1 core best score. There is still a wide room to improve.

__kernel 
void Pitch( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int roiLeft, __const int roiTop, __const int roiRight, __const int roiBottom,
  __const int blockWidth,
  __const int nPitch, 
  __const short toCeiling, __const short toFloor )
{
  int i = get_global_id(0);
  short8 toCeiling8 = toCeiling;  // toCeiling8 is filled with 8 of toCeiling.
  short8 toFloor8 = toFloor;

  for( int y = roiTop; y <= roiBottom; ++y )
  {
    int left = roiLeft + blockWidth*i;
    int right = left + blockWidth;

    for( int x = left; x < right; x+=8 )
    {
      int idx = y * width + x;

      short8 eastLeft = convert_short8( vload8( 0, src + idx + nPitch ) );      // uchar8 filled with 8 pixels is converted to ushort8.
      short8 eastRight = convert_short8( vload8( 0, src + idx + nPitch + 1 ) );
      short8 westLeft = convert_short8( vload8( 0, src + idx - nPitch - 1 ) );
      short8 westRight = convert_short8( vload8( 0, src + idx - nPitch ) );
      short8 center = convert_short8( vload8( 0, src + idx ) );

      short8 east = toFloor * eastLeft + toCeiling * eastRight;
      short8 west = toFloor * westRight + toCeiling * westLeft;
  
      east = east >> 1;
      west = west >> 1;
      center = center << 7;

      ushort8 diff = abs_diff( center, east+west );    // abs_diff is only available with signed type.

      diff = diff >> 7;
    
      if ( x <= roiRight ) 
      {
        vstore8( convert_uchar8( diff ), 0, dst + idx );  // convert short8 to uchar8 and stores 8 pixels at a time.
      }
    }

  }
}


In this analysis, I didn't put the memory copy time. This is because it doesn't need. In CPU side, you don't have to copy buffer to a device memory. The CPU's device memory is the host memory. So what we need is just map the device memory and get a pointer and use it to manipulate and un-map it.
  cl::Buffer devSrc( task.GetContext(), CL_MEM_READ_ONLY, width*height );
  cl::Buffer devDst( task.GetContext(), CL_MEM_WRITE_ONLY, width*height );

  ...
  // manipulate source image buffer. Note that we map to write
  uint8_t* hostSrcPtr = (uint8_t*)queue.enqueueMapBuffer( devSrc, CL_TRUE, CL_MAP_WRITE, 0, width*height );

  AttachedByteImage src( width, height, hostSrcPtr );
  ForEachPixels::Fill( src, SineWaveColumn( 20.2f ) );
  src.SetPixel( 50, 50, 50 );

  queue.enqueueUnmapMemObject( devSrc, hostSrcPtr );

  ...
  // run 2 kernels 
  cl::NDRange globalws( 2 );
  queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, cl::NullRange, &ve0, &steps[1] );
  ... 

  // manipulate destination buffer. Note that we map to read
  uint8_t* hostDstPtr = (uint8_t*)queue.enqueueMapBuffer( devDst, CL_TRUE, CL_MAP_READ, 0, width*height );
  
  AttachedByteImage dst( width, height, hostDstPtr );
  ThresholdAboveCounter counter( 5 );
  ForEachPixels::Visit( dst, counter, Region( roiLeft, roiTop, roiRight, roiBottom), RowAfterRow  );
  BOOST_CHECK_EQUAL( counter.Count(), 5 );

  queue.enqueueUnmapMemObject( devDst, hostDstPtr );




No comments: