essay on programming languages, computer science, information techonlogies and all.

Wednesday, April 3, 2013

OpenCL - one work group per CPU core

At the last post, I got around 1GB/s throughput but that is just utilizing only one core out of dual core CPU.

Digging up the correct multi-core usage with a kernel in Intel OpenCL SDK, found that the work group size should be same as number of cores. Then it will make each work group sits on a single core. For this OpenCL 1.1 spec says that
Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit. A compute unit is composed of one or more processing elements and local memory. A compute unit may also include dedicated texture filter units that can be accessed by its processing elements.
When making host code for GPU, I did carefully assign the work group size and local work item size. But with CPU case, I didn't pay much attention to it as there is just 2 cores. Also due to Intel recommendation at "Writing Optimal OpenCL* Code with the Intel® OpenCL SDK" as below.
2.7 Work-Group Size Considerations
We always recommend letting the OpenCL implementation to automatically determine the optimal work-group size (sometimes referred as “local work size”) for a given kernel. Simply pass NULL for a pointer to the local work size when calling clEnqueueNDRangeKernel.
But it turns out that this work group size and local work item size should be set according to number of cores. The required change is as below.
  // host code

  cl::NDRange globalws( 2 );  // 2 work items in total
  cl::NDRange localws( 1 );   // 1 work item per work group - makes up 2 work groups
  ...
  queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, localws, NULL, &ev );
  ...

  // kernel code
  __kernel 
  __attribute__((reqd_work_group_size(1,1,1)))   // this can be specified but makes no difference for me.
  void Pitch( 
    __global const uchar *src, __global uchar *dst, 
    ...
This change makes around 2.1GB/s throughput as below.
...
Total : duration 1961850 ns, 2022.74 MB/s
Total : duration 1787610 ns, 2219.90 MB/s
Total : duration 1796190 ns, 2209.29 MB/s
...

No comments: