Computing Memoir: OpenCL - CPU Implicit Vectorization

Intel OpenCL SDK boasts that the it's clever implicit vectorization - or Autovectorization - can magically put SIMD instructions across kernel work item boundary. I doubts about it but tried.

Here is a naive translation of pitch comparison to OpenCL CPU kernel function. In here, I still use the fixed float arithmetic using short.

__kernel 
__attribute__((vec_type_hint(short8)))  // try to hint the compiler to pack 8 shorts to 128 bits xmm register
void Pitch0( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int nPitch, 
  __const short toCeiling, __const short toFloor )
{
  int x = get_global_id(0);
  int y = get_global_id(1);
  int idx = y * width + x;

  short eastLeft = src[ idx + nPitch ];
  short eastRight = src[ idx + nPitch + 1];
  short westLeft = src[ idx - nPitch - 1];
  short westRight = src[ idx - nPitch ];
  short center = src[ idx ];

  short east = toFloor * eastLeft + toCeiling * eastRight;
  short west = toFloor * westRight + toCeiling * westLeft;
  
  east = east >> 1;
  west = west >> 1;

  short sum = east + west;
  short c2 = center << 7;
  short diff = abs_diff( c2, sum );

  diff = diff >> 7;
  dst[ idx ] = (uchar)diff; 
}


// HOST code
  ...
  int integerPitch = (int)(pitch);
  short toFloor = (short)( ((pitch - integerPitch) * 0x7F)+0.5f);
  short toCeiling = (short)( 0x7F - toFloor );

  kernelPitch0.setArg( 0, devSrc );
  kernelPitch0.setArg( 1, devDst );
  kernelPitch0.setArg( 2, width );
  kernelPitch0.setArg( 3, height );
  kernelPitch0.setArg( 4, integerPitch );
  kernelPitch0.setArg( 5, toCeiling );
  kernelPitch0.setArg( 6, toFloor );

  cl::NDRange globalws( roiRight-roiLeft+1, roiBottom-roiTop+1 );
  cl::NDRange offset( roiLeft, roiTop );

  ...
  queue.enqueueNDRangeKernel( kernelPitch0, offset, globalws, cl::NullRange, &ve1, &e2 );
  ...

This code only gives me 42MB/s with Intel G2120 CPU. As the SIMD version gave me nearly 1GB/s, it is quite discouraging. Now I listed down number of points that I am not sure ...

Will there be too many context switching ? Intel says it want 10,000 - 100,000 instructions per kernel and warned that there will be too many context switch if not. Clearly above code won't make that many instructions.
Should I go read the assembler code using Intel Offline compiler ? I checked it but the assembler output is very long and it seems there are framework code also. Not an easy job at there. Not much information on the web also.
Can it be optimized with short also ? Intel mentions that it works well with 4 bytes type like int or float. Also mentioned that it prefer signed to unsigned.
I used events to synchronize kernel execution. Refer OpenCL for NVidia code. Though Intel says it is better to write blocking function call over non-blocking - Intel says it is an explicit synchronization.
Intel says it does not want to be restricted by local work size. Only want to be specified with global work size. Then how can I be sure that it's cache access is optimized ? Remember that 64x4 local size gave out good result compared to 16x16.
Should I use 'cl-fast-relazed-math' ?

Number of point to try are

Remove fixed float arithmetic - using float instead of short
Process multiple points at a kernel work item. Try to make two work items maybe so that these sit on the two processors.
Get your hand dirty. Give up implicit magic and try explicit vectorization using vector data types

Reference

Writing Optimal OpenCL* Code with the Intel® OpenCL SDK, Intel
OpenCL in Action, Matthew Scarpino

Computing Memoir

Saturday, March 23, 2013

OpenCL - CPU Implicit Vectorization

No comments:

Blog Archive

About Me