essay on programming languages, computer science, information techonlogies and all.

Saturday, March 23, 2013

OpenCL - CPU Implicit Vectorization

Intel OpenCL SDK boasts that the it's clever implicit vectorization - or Autovectorization - can magically put SIMD instructions across kernel work item boundary. I doubts about it but tried.

Here is a naive translation of pitch comparison to OpenCL CPU kernel function. In here, I still use the fixed float arithmetic using short.

__kernel 
__attribute__((vec_type_hint(short8)))  // try to hint the compiler to pack 8 shorts to 128 bits xmm register
void Pitch0( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int nPitch, 
  __const short toCeiling, __const short toFloor )
{
  int x = get_global_id(0);
  int y = get_global_id(1);
  int idx = y * width + x;

  short eastLeft = src[ idx + nPitch ];
  short eastRight = src[ idx + nPitch + 1];
  short westLeft = src[ idx - nPitch - 1];
  short westRight = src[ idx - nPitch ];
  short center = src[ idx ];

  short east = toFloor * eastLeft + toCeiling * eastRight;
  short west = toFloor * westRight + toCeiling * westLeft;
  
  east = east >> 1;
  west = west >> 1;

  short sum = east + west;
  short c2 = center << 7;
  short diff = abs_diff( c2, sum );

  diff = diff >> 7;
  dst[ idx ] = (uchar)diff; 
}


// HOST code
  ...
  int integerPitch = (int)(pitch);
  short toFloor = (short)( ((pitch - integerPitch) * 0x7F)+0.5f);
  short toCeiling = (short)( 0x7F - toFloor );

  kernelPitch0.setArg( 0, devSrc );
  kernelPitch0.setArg( 1, devDst );
  kernelPitch0.setArg( 2, width );
  kernelPitch0.setArg( 3, height );
  kernelPitch0.setArg( 4, integerPitch );
  kernelPitch0.setArg( 5, toCeiling );
  kernelPitch0.setArg( 6, toFloor );

  cl::NDRange globalws( roiRight-roiLeft+1, roiBottom-roiTop+1 );
  cl::NDRange offset( roiLeft, roiTop );

  ...
  queue.enqueueNDRangeKernel( kernelPitch0, offset, globalws, cl::NullRange, &ve1, &e2 );
  ...




This code only gives me 42MB/s with Intel G2120 CPU. As the SIMD version gave me nearly 1GB/s, it is quite discouraging. Now I listed down number of points that I am not sure ...

  • Will there be too many context switching ? Intel says it want 10,000 - 100,000 instructions per kernel and warned that there will be too many context switch if not. Clearly above code won't make that many instructions.
  • Should I go read the assembler code using Intel Offline compiler ? I checked it but the assembler output is very long and it seems there are framework code also. Not an easy job at there. Not much information on the web also.
  • Can it be optimized with short also ? Intel mentions that it works well with 4 bytes type like int or float. Also mentioned that it prefer signed to unsigned.
  • I used events to synchronize kernel execution. Refer OpenCL for NVidia code. Though Intel says it is better to write blocking function call over non-blocking - Intel says it is an explicit synchronization.
  • Intel says it does not want to be restricted by local work size. Only want to be specified with global work size. Then how can I be sure that it's cache access is optimized ? Remember that 64x4 local size gave out good result compared to 16x16.
  • Should I use 'cl-fast-relazed-math' ?


Number of point to try are

  • Remove fixed float arithmetic - using float instead of short
  • Process multiple points at a kernel work item. Try to make two work items maybe so that these sit on the two processors.
  • Get your hand dirty. Give up implicit magic and try explicit vectorization using vector data types


Reference
  • Writing Optimal OpenCL* Code with the Intel® OpenCL SDK, Intel
  • OpenCL in Action, Matthew Scarpino


No comments: