Computing Memoir: March 2013

Friday, March 29, 2013

OpenCL - CPU Explicit Vectorization

Last blog, I got a disappointing result with CPU OpenCL code. And it is time to find out the right way to code in CPU side.

First of all, I tried to use float version of pitch comparison. This is to check whether the implicit vectorization can make any difference with 4 bytes type. At least, Intel OpenCL optimization book recommends to use int and float.

Refer below code. This code scores meager 39.3 MB/s. So still implicit vectorization with float isn't good. And it also tells that float operation is still slightly slower than integer operation. If precision doesn't matter, it is better stick to fixed floating point operation using integer type than float type. Though the code is clean and easy to read with float version. So if code maintenance matters, then float can be a helpful.

_kernel 
void Pitch( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int nPitch, 
  __const float toCeiling, __const float toFloor )
{
  int x = get_global_id(0);
  int y = get_global_id(1);
  int idx = y * width + x;

  float eastLeft = src[ idx + nPitch ];      // uchar pixel value is converted to float - 4 bytes
  float eastRight = src[ idx + nPitch + 1];
  float westLeft = src[ idx - nPitch - 1];
  float westRight = src[ idx - nPitch ];
  float center = src[ idx ];

  float east = toFloor * eastLeft + toCeiling * eastRight;
  float west = toFloor * westRight + toCeiling * westLeft;

  float diff = fabs( (2*center - (east+west)) / 2.0f );
 
  dst[ idx ] = (uchar)diff;                  // float difference value is converted back to uchar
}

Second try is a bold step of using just 2 kernels. And I expect that each kernel runs on each core nicely - my PC has dual core CPU. Then the kernel has to loop through each pixel in image.

Refer below code. This scores 144MB/s. This big jump on performance is saying that this is the right way. CPU side has to be coded quite differently to the GPU side. In GPU, you have to make lots of kernels. But in the CPU side, you have to make just as many kernels as number of cores.


// Each kernel processes half of full image in column.  The blockWidth is ( roiRight - roiLeft ) / 2. 
__kernel 
void Pitch( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int roiLeft, __const int roiTop, __const int roiRight, __const int roiBottom,
  __const int blockWidth,   
  __const int nPitch, 
  __const short toCeiling, __const short toFloor )
{
  int i = get_global_id(0);

  for( int y = roiTop; y <= roiBottom; ++y )
  {
    for( int x = roiLeft + blockWidth*i; x < min( roiLeft + blockWidth*(i+1), roiRight + 1 ); ++x )
    {
      int idx = y * width + x;

      short eastLeft = src[ idx + nPitch ];
      short eastRight = src[ idx + nPitch + 1];
      short westLeft = src[ idx - nPitch - 1];
      short westRight = src[ idx - nPitch ];
      short center = src[ idx ];

      short east = toFloor * eastLeft + toCeiling * eastRight;
      short west = toFloor * westRight + toCeiling * westLeft;

      east = east >> 1;
      west = west >> 1;

      short sum = east + west;
      short c2 = center << 7;
      short diff = abs_diff( c2, sum );

      diff = diff >> 7;

      if ( x <= roiRight ) 
      {
        dst[ idx ] = (uchar)diff; 
      }

    }
  }
}

Now time to combine the explicit loop with vector type. Here short8 is used as primary vector type considering that it fits with no loss at the 128bits xmm register. Also the built-in function abs_diff() supports only signed type. That's the reason why I choose short8 over ushort8. Though this is traded off with precision.

This code scores 986MB/s. This is another big jump. But best result with SSE instructions with 1 core is 940 MB/s. As this OpenCL code supposes to use dual core in full, it can be two fold increase but disappointingly, it is just above 1 core best score. There is still a wide room to improve.


__kernel 
void Pitch( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int roiLeft, __const int roiTop, __const int roiRight, __const int roiBottom,
  __const int blockWidth,
  __const int nPitch, 
  __const short toCeiling, __const short toFloor )
{
  int i = get_global_id(0);
  short8 toCeiling8 = toCeiling;  // toCeiling8 is filled with 8 of toCeiling.
  short8 toFloor8 = toFloor;

  for( int y = roiTop; y <= roiBottom; ++y )
  {
    int left = roiLeft + blockWidth*i;
    int right = left + blockWidth;

    for( int x = left; x < right; x+=8 )
    {
      int idx = y * width + x;

      short8 eastLeft = convert_short8( vload8( 0, src + idx + nPitch ) );      // uchar8 filled with 8 pixels is converted to ushort8.
      short8 eastRight = convert_short8( vload8( 0, src + idx + nPitch + 1 ) );
      short8 westLeft = convert_short8( vload8( 0, src + idx - nPitch - 1 ) );
      short8 westRight = convert_short8( vload8( 0, src + idx - nPitch ) );
      short8 center = convert_short8( vload8( 0, src + idx ) );

      short8 east = toFloor * eastLeft + toCeiling * eastRight;
      short8 west = toFloor * westRight + toCeiling * westLeft;
  
      east = east >> 1;
      west = west >> 1;
      center = center << 7;

      ushort8 diff = abs_diff( center, east+west );    // abs_diff is only available with signed type.

      diff = diff >> 7;
    
      if ( x <= roiRight ) 
      {
        vstore8( convert_uchar8( diff ), 0, dst + idx );  // convert short8 to uchar8 and stores 8 pixels at a time.
      }
    }

  }
}

In this analysis, I didn't put the memory copy time. This is because it doesn't need. In CPU side, you don't have to copy buffer to a device memory. The CPU's device memory is the host memory. So what we need is just map the device memory and get a pointer and use it to manipulate and un-map it.

  cl::Buffer devSrc( task.GetContext(), CL_MEM_READ_ONLY, width*height );
  cl::Buffer devDst( task.GetContext(), CL_MEM_WRITE_ONLY, width*height );

  ...
  // manipulate source image buffer. Note that we map to write
  uint8_t* hostSrcPtr = (uint8_t*)queue.enqueueMapBuffer( devSrc, CL_TRUE, CL_MAP_WRITE, 0, width*height );

  AttachedByteImage src( width, height, hostSrcPtr );
  ForEachPixels::Fill( src, SineWaveColumn( 20.2f ) );
  src.SetPixel( 50, 50, 50 );

  queue.enqueueUnmapMemObject( devSrc, hostSrcPtr );

  ...
  // run 2 kernels 
  cl::NDRange globalws( 2 );
  queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, cl::NullRange, &ve0, &steps[1] );
  ... 

  // manipulate destination buffer. Note that we map to read
  uint8_t* hostDstPtr = (uint8_t*)queue.enqueueMapBuffer( devDst, CL_TRUE, CL_MAP_READ, 0, width*height );
  
  AttachedByteImage dst( width, height, hostDstPtr );
  ThresholdAboveCounter counter( 5 );
  ForEachPixels::Visit( dst, counter, Region( roiLeft, roiTop, roiRight, roiBottom), RowAfterRow  );
  BOOST_CHECK_EQUAL( counter.Count(), 5 );

  queue.enqueueUnmapMemObject( devDst, hostDstPtr );

Saturday, March 23, 2013

OpenCL - CPU Implicit Vectorization

Intel OpenCL SDK boasts that the it's clever implicit vectorization - or Autovectorization - can magically put SIMD instructions across kernel work item boundary. I doubts about it but tried.

Here is a naive translation of pitch comparison to OpenCL CPU kernel function. In here, I still use the fixed float arithmetic using short.

__kernel 
__attribute__((vec_type_hint(short8)))  // try to hint the compiler to pack 8 shorts to 128 bits xmm register
void Pitch0( 
  __global const uchar *src, __global uchar *dst, 
  __const int width, __const int height,
  __const int nPitch, 
  __const short toCeiling, __const short toFloor )
{
  int x = get_global_id(0);
  int y = get_global_id(1);
  int idx = y * width + x;

  short eastLeft = src[ idx + nPitch ];
  short eastRight = src[ idx + nPitch + 1];
  short westLeft = src[ idx - nPitch - 1];
  short westRight = src[ idx - nPitch ];
  short center = src[ idx ];

  short east = toFloor * eastLeft + toCeiling * eastRight;
  short west = toFloor * westRight + toCeiling * westLeft;
  
  east = east >> 1;
  west = west >> 1;

  short sum = east + west;
  short c2 = center << 7;
  short diff = abs_diff( c2, sum );

  diff = diff >> 7;
  dst[ idx ] = (uchar)diff; 
}


// HOST code
  ...
  int integerPitch = (int)(pitch);
  short toFloor = (short)( ((pitch - integerPitch) * 0x7F)+0.5f);
  short toCeiling = (short)( 0x7F - toFloor );

  kernelPitch0.setArg( 0, devSrc );
  kernelPitch0.setArg( 1, devDst );
  kernelPitch0.setArg( 2, width );
  kernelPitch0.setArg( 3, height );
  kernelPitch0.setArg( 4, integerPitch );
  kernelPitch0.setArg( 5, toCeiling );
  kernelPitch0.setArg( 6, toFloor );

  cl::NDRange globalws( roiRight-roiLeft+1, roiBottom-roiTop+1 );
  cl::NDRange offset( roiLeft, roiTop );

  ...
  queue.enqueueNDRangeKernel( kernelPitch0, offset, globalws, cl::NullRange, &ve1, &e2 );
  ...

This code only gives me 42MB/s with Intel G2120 CPU. As the SIMD version gave me nearly 1GB/s, it is quite discouraging. Now I listed down number of points that I am not sure ...

Will there be too many context switching ? Intel says it want 10,000 - 100,000 instructions per kernel and warned that there will be too many context switch if not. Clearly above code won't make that many instructions.
Should I go read the assembler code using Intel Offline compiler ? I checked it but the assembler output is very long and it seems there are framework code also. Not an easy job at there. Not much information on the web also.
Can it be optimized with short also ? Intel mentions that it works well with 4 bytes type like int or float. Also mentioned that it prefer signed to unsigned.
I used events to synchronize kernel execution. Refer OpenCL for NVidia code. Though Intel says it is better to write blocking function call over non-blocking - Intel says it is an explicit synchronization.
Intel says it does not want to be restricted by local work size. Only want to be specified with global work size. Then how can I be sure that it's cache access is optimized ? Remember that 64x4 local size gave out good result compared to 16x16.
Should I use 'cl-fast-relazed-math' ?

Number of point to try are

Remove fixed float arithmetic - using float instead of short
Process multiple points at a kernel work item. Try to make two work items maybe so that these sit on the two processors.
Get your hand dirty. Give up implicit magic and try explicit vectorization using vector data types

Reference

Writing Optimal OpenCL* Code with the Intel® OpenCL SDK, Intel
OpenCL in Action, Matthew Scarpino

Tuesday, March 12, 2013

CUDA - PTX code analysis

Remember that OpenCL version is much faster than CUDA version. With almost identical code, I was puzzled to see two times slower in CUDA and thought that PTX assembly comparison can enlighten me.

Here is OpenCL version of C code and PTX assembly code. Those comments are added to follow.

__kernel
void Pitch0(
  __global uchar *d_src, 
  __global uchar *d_dst,
  int width, int height,
  int roiLeft, int roiTop, int roiRight, int roiBottom,
  int integerPitch, float toCeiling, float toFloor )
{
  int x = roiLeft + get_global_id(0);
  int y = roiTop + get_global_id(1);
  int idx = x + y * width;

  bool isInBound = x <= roiRight && y <= roiBottom;

  float east = 
    (float)d_src[ idx - integerPitch - 1 ] * toCeiling +
    (float)d_src[ idx - integerPitch ] * toFloor;
  float west = 
    (float)d_src[ idx + integerPitch ] * toFloor + 
    (float)d_src[ idx + integerPitch + 1 ] * toCeiling;

  float diff = ( (float)d_src[idx] * 2.0f  - ( east + west ) ) / 2.0f;
  diff = diff >= 0 ? diff : -diff;

  if( isInBound )
  {
    d_dst[ idx ] = (uchar)( diff  );
  }
}

.entry Pitch0(
  .param .u32 .ptr .global .align 1 Pitch0_param_0,
  .param .u32 .ptr .global .align 1 Pitch0_param_1,
  .param .u32 Pitch0_param_2,
  .param .u32 Pitch0_param_3,
  .param .u32 Pitch0_param_4,
  .param .u32 Pitch0_param_5,
  .param .u32 Pitch0_param_6,
  .param .u32 Pitch0_param_7,
  .param .u32 Pitch0_param_8,
  .param .f32 Pitch0_param_9,
  .param .f32 Pitch0_param_10
)
{
  .reg .f32   %f< 18 >;
  .reg .pred   %p< 5 >;
  .reg .s32   %r< 32 >;
  .reg .s16   %rc< 7 >;


  ld.param.u32   %r11, [Pitch0_param_0];  // d_src
  ld.param.u32   %r12, [Pitch0_param_2];  // width
  ld.param.u32   %r13, [Pitch0_param_4];  // roiLeft
  ld.param.u32   %r14, [Pitch0_param_5];  // roiTop
  ld.param.u32   %r15, [Pitch0_param_6];  // roiRight
  ld.param.u32   %r16, [Pitch0_param_7];  // roiBottom
  ld.param.u32   %r17, [Pitch0_param_8];  // integerPitch
  ld.param.f32   %f2, [Pitch0_param_9];   // toCeiling
  ld.param.f32   %f3, [Pitch0_param_10];  // toFloor

  mov.u32   %r3, %envreg3;
  mov.u32   %r4, %ntid.x;
  mov.u32   %r5, %ctaid.x;
  mov.u32   %r6, %tid.x;
  add.s32   %r18, %r3, %r13;           
  mad.lo.s32   %r19, %r5, %r4, %r18;   // r19 = ctaid.x * ntid.x + roiLeft
  add.s32   %r20, %r19, %r6;           // r20 = x = roiLeft + get_global_id(0) = tid.x + ctaid.x * ntid.x + roiLeft 

  mov.u32   %r7, %envreg4;
  mov.u32   %r8, %ntid.y;
  mov.u32   %r9, %ctaid.y;
  mov.u32   %r10, %tid.y;
  add.s32   %r21, %r7, %r14;           
  mad.lo.s32   %r22, %r9, %r8, %r21;   // r22 = ctaid.x * ntid.y + roiTop
  add.s32   %r23, %r22, %r10;          // r23 = y = roiTop + get_global_id(1) = tid.y + ctaid.y * ntid.y + roiTop
  
  mad.lo.s32   %r2, %r23, %r12, %r20;  // r2 = idx = width * y + x
  
  setp.le.s32   %p1, %r23, %r16;       // p1 = y <= roiBottom
  setp.le.s32   %p2, %r20, %r15;       // p2 = x <= roiRight
  and.pred    %p3, %p1, %p2;           // p3 = isInBound = p1 & p2
  
  not.b32   %r24, %r17;                
  add.s32   %r25, %r2, %r24;           
  add.s32   %r26, %r11, %r25;           
  ld.global.u8   %rc1, [%r26];          
  cvt.rn.f32.u8   %f4, %rc1;           // f4 = float( d_src[ idx - integerPitch - 1] )
  ld.global.u8   %rc2, [%r26+1];       
  cvt.rn.f32.u8   %f5, %rc2;           // f5 = float( d_src[ idx - integerPitch] )
  mul.f32   %f6, %f5, %f3;             // f6 = f5 * toFloor
  fma.rn.f32   %f7, %f4, %f2, %f6;     // f7 = east = f4 * toCeiling + f5 * toFloor
  
  add.s32   %r27, %r2, %r17;           
  add.s32   %r28, %r11, %r27;    
  ld.global.u8   %rc3, [%r28];    
  cvt.rn.f32.u8   %f8, %rc3;           // f8 = float( d_src[ idx + integerPitch ] )
  ld.global.u8   %rc4, [%r28+1];    
  cvt.rn.f32.u8   %f9, %rc4;           // f9 = float( d_src[ idx + integerPitch + 1 ] )
  mul.f32   %f10, %f9, %f2;            // f10 = f9 * toCeiling
  fma.rn.f32   %f11, %f8, %f3, %f10;   // f11 = west = f8 * toFloor + f10 * toCeiling
  
  add.s32   %r29, %r11, %r2;           
  ld.global.u8   %rc5, [%r29];    
  cvt.rn.f32.u8   %f12, %rc5;          // f12 = d_src[ idx ]
  add.f32   %f13, %f12, %f12;          
  add.f32   %f14, %f7, %f11;           
  sub.f32   %f15, %f13, %f14;          // f15 = d_src[ idx ]*2 - (east+west)
  div.full.f32   %f16, %f15, 0f40000000; // f16 = diff = (d_src[ idx ]*2 - (east+west)) / 2.0
  setp.ltu.f32   %p4, %f16, 0f00000000;  // p4 = diff > 0
  neg.f32   %f17, %f16;                
  selp.f32   %f1, %f17, %f16, %p4;     // f1 = diff = diff > 0 ? diff : diff 
  @%p3 bra   BB1_2;                    // if isInBound, jump BB1_2

  ret;

BB1_2:
  cvt.rzi.u16.f32   %rc6, %f1          // rc6 = ushort( diff )
  ld.param.u32   %r31, [Pitch0_param_1];  
  add.s32   %r30, %r31, %r2;           // r30 = d_dst + idx
  st.global.u8   [%r30], %rc6;         // d_dst[idx] = uchar( ushort(diff) )
  ret;
}

Now CUDA version of PTX code is similar. One noticeable difference is more 'ld.global' instructions due to structures used for kernel function arguments. Here is snippet.

__global__ 
void KernelPitch0( 
  unsigned char* d_src, unsigned char* d_dst, 
  const Area* imgSize, const Region* roi, const PitchContext* pc )
{
  ...
  int idx = x + y * imgSize->Width;
  bool isInBound = x < = roi->Right && y < = roi->Bottom;
  if( isInBound )
  {
    float east = 
      (float)d_src[ idx - pc->IntegerPitch - 1 ] * pc - >ToCeiling +
      (float)d_src[ idx - pc->IntegerPitch ] * pc - >ToFloor;
    ...
}

.visible .entry _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext(
  .param .u32 _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_0,
  .param .u32 _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_1,
  .param .u32 _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_2,
  .param .u32 _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_3,
  .param .u32 _Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_4
)
{
  ...
  ld.param.u32   %r9, [_Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_2];  // imgSize
  ld.param.u32   %r10, [_Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_3]; // roi
  ld.param.u32   %r11, [_Z12KernelPitch0PhS_PKN4SSKL10Inspection4AreaEPKNS1_6RegionEPK12PitchContext_param_4]; // pc

  ...
  ld.global.u32   %r16, [%r4];   // r16 = roi.Left
  add.s32   %r17, %r15, %r16   // r17 = roi.Left + tid.x

  ... 
  ld.global.u32   %r22, [%r4+4];  // r22 = roi.Top
  add.s32   %r23, %r21, %r22;
  mad.lo.s32   %r5, %r19, %r20, %r23;         // r5 = y = ntid.y * ctaid.y + roi.Top + tid.y

  ...
  ld.global.u32   %r24, [%r12];   // r24 = imgSize.x
  mad.lo.s32   %r6, %r5, %r24, %r18;         // r6 = idx = imgSize.x * y + x 

  ...
  ld.global.u32   %r25, [%r4+8];  // r25 = roi.Right
  setp.gt.s32   %p1, %r18, %r25;  // p1 = x > roi.Right
  @%p1 bra   BB0_3;
  ...

With this difference in mind, I have replaced all the structure based argument with value type like OpenCL C. Then got 1GB/s. Before it was meager 428MB/s.

Entering test case "TestProcess2"
Total time taken : 3955.87 micro sec 
Size of image : 3.98 MB 
Processing time : 1005.97 MB/s 
Leaving test case "TestProcess2"; testing time: 540ms

Using structure on the kernel function arguments makes the structure resides on the device global memory and it inevitably ask device to copy the structure member to the thread register. And it consumes up the valuable memory traffic.

Now without structure, the function just needs number of ld.param.u32. ".param" state space is said to be defined per-grid. Then the physical location is likely to be global memory - NVidia says it is implementation specific and says some use global memory. If it comes from the global memory, then using structure as argument shouldn't be two times slower than value type argument. I mean in either way, it has to be loaded from the global memory.

Other possible explanation may come from the the locality of ld.param.u32. This instruction is being used at the beginning of function to load all the argument values regardless usage in the code. I guess those argument may stay in close proximity which can be good for cache.

Another thing to note is the kernel time of 1.7ms. WIth OpenCL, I got 1.2ms. Still there is .6ms slowness in the CUDA.

shared memory version
Total time taken : 4406.42 micro sec 
Size of image : 3.98 MB 
Processing time : 903.11 MB/s 

texture memory version
Total time taken : 4372.73 micro sec 
Size of image : 3.98 MB 
Processing time : 910.07 MB/s

Shared memory version and texture memory version are tried as above result but these are not as fast as plain version. Another thing to figure out.

Friday, March 8, 2013

OpenCL - PTX code generation

With NVidia GPU, you can dump the PTX assembly code using CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES. When you get binary, it is actually PTX assembly code as demonstrated below.

Below code is from lecture 4 of Supercomputing on Graphics Cards - An Introduction to OpenCL and the C++ Bindings . For your convenience, here is copy and paste of the code with slight modification.


const std::vector< size_t > binSizes = _Program.getInfo< CL_PROGRAM_BINARY_SIZES >();
std::vector< unsigned char > binData( 
  std::accumulate( binSizes.begin(), binSizes.end(), 0 )
);

unsigned char* binChunk = &binData[0];

std::vector< unsigned char* > binaries;
for( size_t i=0; i < binSizes.size(); ++i) {
  binaries.push_back( binChunk );
  binChunk += binSizes[i];
}

_Program.getInfo( CL_PROGRAM_BINARIES, &binaries[0] );

std::ofstream binaryFile( filename.c_str() );
if( ! binaryFile.good() )
  std::runtime_error( "Failed to open binary file for kernel" );

for( size_t i=0; i < binaries.size(); ++i)
  binaryFile << binaries[i];

Just to compare the result with previous post, same kernel code modified to OpenCL style is used as below.

__kernel
void KernelFourInstructions1( float a, float b, float c, __global float *d_a ) {
  a = a * b + c;
  d_a[ get_global_id(0) ] = a;
}

And finally, the code is as below with added comment.

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Sat Sep 29 23:04:42 2012 (1348927482)
// Driver 
//

.version 3.0
.target sm_30, texmode_independent
.address_size 32

.entry KernelFourInstructions1(
  .param .f32 KernelFourInstructions1_param_0,  // float a
  .param .f32 KernelFourInstructions1_param_1,  // float b
  .param .f32 KernelFourInstructions1_param_2,  // float c
  .param .u32 .ptr .global .align 4 KernelFourInstructions1_param_3 // float *d_a
)
{
  .reg .f32   %f< 5 >;
  .reg .s32   %r< 10 >;

  ld.param.f32   %f1, [KernelFourInstructions1_param_0];
  ld.param.f32   %f2, [KernelFourInstructions1_param_1];
  ld.param.f32   %f3, [KernelFourInstructions1_param_2];
  ld.param.u32   %r5, [KernelFourInstructions1_param_3];
  fma.rn.f32   %f4, %f1, %f2, %f3;   // f4 <- a * b + c
  mov.u32   %r1, %envreg3;           // What is envreg3 ? Must be zero.
  mov.u32   %r2, %ntid.x;            // ntid.x is 1024 on sm_30
  mov.u32   %r3, %ctaid.x;           // CTA identifier within a grid
  mov.u32   %r4, %tid.x;             // thread identifier within a CTA
  add.s32   %r6, %r4, %r1;           // r6 <- tid.x + envreg3
  mad.lo.s32   %r7, %r3, %r2, %r6;   // r7 <- ctaid.x * ntid.x + r6 : equivalent of get_global_id(0)
  shl.b32   %r8, %r7, 2;           
  add.s32   %r9, %r5, %r8;           // r9 <- r5 + r8 : &d_a[ get_global_id(0) ]
  st.global.f32   [%r9], %f4;        // [ r9 ] <- f4
  ret;
}

This PTX code is slightly bloated compared to previous CUDA PTX code. It is due to the get_global_id(0) instead of threadIdx.x. get_global_id(0) should count for the block size and it has to refer ntid and ctaid.

There is one alien "envreg3". The PTX doc says it is driver-defined read-only registers and says it is described at the driver documentation which I can't find where it is. The value should be zero in the above code context. I guess OpenCL run-time kernel compilation allows the NVidia to use very hardware specific variables. And maybe that can be the real strength of the OpenCL run-time compile versus CUDA static compile.

Tuesday, March 5, 2013

CUDA - PTX assembly 1

One way to figure out why OpenCL output is two times faster than CUDA is to looking at the machine code. Though to look at the assembly code, you need to train yourself. From now on, there will be number of post on the CUDA PTX assembly code.

One of CUDA NVCC compiler option is -ptx which generate PTX assembly code and the code can be found at xxx.cu.obj.

Below is a cu code and it's PTX assembly code with added comment.

__global__ 
void KernelFourInstructions1( float a, float b, float c, float *d_a )
{
  a = a * b + c;
  d_a[ threadIdx.x ] = a;
}

.visible .entry _Z23KernelFourInstructions1fffPf(
  .param .f32 _Z23KernelFourInstructions1fffPf_param_0,  // float a
  .param .f32 _Z23KernelFourInstructions1fffPf_param_1,  // float b
  .param .f32 _Z23KernelFourInstructions1fffPf_param_2,  // float c
  .param .u32 _Z23KernelFourInstructions1fffPf_param_3   // float *d_a
)
{
  .reg .s32   %r<6>;  // local variable r0 to r5 as 32 bit signed integer
  .reg .f32   %f<5>;  // local variable f0 to f4 as 32 bit floating point 


  ld.param.f32   %f1, [_Z23KernelFourInstructions1fffPf_param_0]; 
  ld.param.f32   %f2, [_Z23KernelFourInstructions1fffPf_param_1];  
  ld.param.f32   %f3, [_Z23KernelFourInstructions1fffPf_param_2];
  ld.param.u32   %r1, [_Z23KernelFourInstructions1fffPf_param_3];
  cvta.to.global.u32   %r2, %r1;    // r2 <- d_a address as global memory
  .loc 2 31 1                       // .loc is debugging directive pointing source file location
  fma.rn.f32   %f4, %f1, %f2, %f3;  // f4 <- a * b + c
  .loc 2 33 1
  mov.u32   %r3, %tid.x;
  shl.b32   %r4, %r3, 2;
  add.s32   %r5, %r2, %r4;          // r5 <- &d_a[ threadIdx.x ]
  st.global.f32   [%r5], %f4;       // d_a[ threadIdx.x ] <- f4
  .loc 2 34 2
  ret;
}

It is a bit surprising to see that it needs 6 integer registers and 5 floating point registers. For example r3 can be reused as the destination of shl.b.32 and add.s32. Or f1 also can be as destination of fma.rn.f32. I am wondering below code is doing exactly same with less number of local variable.


  ...
  cvta.to.global.u32   %r2, %r1;
  fma.rn.f32   %f1, %f1, %f2, %f3;  // f1 <- a * b + c, f1 instead of f4
  mov.u32   %r3, %tid.x;
  shl.b32   %r3, %r3, 2;            // r3 is reused as destination
  add.s32   %r3, %r2, %r3;          // again r3 is reused
  st.global.f32   [%r3], %f1;       
  ...

Reference : Parallel Thread Execution ISA Version 3.1, NVIDIA

Computing Memoir