Here is a naive translation of pitch comparison to OpenCL CPU kernel function. In here, I still use the fixed float arithmetic using short.
__kernel __attribute__((vec_type_hint(short8))) // try to hint the compiler to pack 8 shorts to 128 bits xmm register void Pitch0( __global const uchar *src, __global uchar *dst, __const int width, __const int height, __const int nPitch, __const short toCeiling, __const short toFloor ) { int x = get_global_id(0); int y = get_global_id(1); int idx = y * width + x; short eastLeft = src[ idx + nPitch ]; short eastRight = src[ idx + nPitch + 1]; short westLeft = src[ idx - nPitch - 1]; short westRight = src[ idx - nPitch ]; short center = src[ idx ]; short east = toFloor * eastLeft + toCeiling * eastRight; short west = toFloor * westRight + toCeiling * westLeft; east = east >> 1; west = west >> 1; short sum = east + west; short c2 = center << 7; short diff = abs_diff( c2, sum ); diff = diff >> 7; dst[ idx ] = (uchar)diff; } // HOST code ... int integerPitch = (int)(pitch); short toFloor = (short)( ((pitch - integerPitch) * 0x7F)+0.5f); short toCeiling = (short)( 0x7F - toFloor ); kernelPitch0.setArg( 0, devSrc ); kernelPitch0.setArg( 1, devDst ); kernelPitch0.setArg( 2, width ); kernelPitch0.setArg( 3, height ); kernelPitch0.setArg( 4, integerPitch ); kernelPitch0.setArg( 5, toCeiling ); kernelPitch0.setArg( 6, toFloor ); cl::NDRange globalws( roiRight-roiLeft+1, roiBottom-roiTop+1 ); cl::NDRange offset( roiLeft, roiTop ); ... queue.enqueueNDRangeKernel( kernelPitch0, offset, globalws, cl::NullRange, &ve1, &e2 ); ...
This code only gives me 42MB/s with Intel G2120 CPU. As the SIMD version gave me nearly 1GB/s, it is quite discouraging. Now I listed down number of points that I am not sure ...
- Will there be too many context switching ? Intel says it want 10,000 - 100,000 instructions per kernel and warned that there will be too many context switch if not. Clearly above code won't make that many instructions.
- Should I go read the assembler code using Intel Offline compiler ? I checked it but the assembler output is very long and it seems there are framework code also. Not an easy job at there. Not much information on the web also.
- Can it be optimized with short also ? Intel mentions that it works well with 4 bytes type like int or float. Also mentioned that it prefer signed to unsigned.
- I used events to synchronize kernel execution. Refer OpenCL for NVidia code. Though Intel says it is better to write blocking function call over non-blocking - Intel says it is an explicit synchronization.
- Intel says it does not want to be restricted by local work size. Only want to be specified with global work size. Then how can I be sure that it's cache access is optimized ? Remember that 64x4 local size gave out good result compared to 16x16.
- Should I use 'cl-fast-relazed-math' ?
Number of point to try are
- Remove fixed float arithmetic - using float instead of short
- Process multiple points at a kernel work item. Try to make two work items maybe so that these sit on the two processors.
- Get your hand dirty. Give up implicit magic and try explicit vectorization using vector data types
Reference
- Writing Optimal OpenCL* Code with the Intel® OpenCL SDK, Intel
- OpenCL in Action, Matthew Scarpino
No comments:
Post a Comment