essay on programming languages, computer science, information techonlogies and all.

Friday, March 8, 2013

OpenCL - PTX code generation


With NVidia GPU, you can dump the PTX assembly code using CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES. When you get binary, it is actually PTX assembly code as demonstrated below.

Below code is from lecture 4 of Supercomputing on Graphics Cards - An Introduction to OpenCL and the C++ Bindings . For your convenience, here is copy and paste of the code with slight modification.


const std::vector< size_t > binSizes = _Program.getInfo< CL_PROGRAM_BINARY_SIZES >();
std::vector< unsigned char > binData( 
  std::accumulate( binSizes.begin(), binSizes.end(), 0 )
);

unsigned char* binChunk = &binData[0];

std::vector< unsigned char* > binaries;
for( size_t i=0; i < binSizes.size(); ++i) {
  binaries.push_back( binChunk );
  binChunk += binSizes[i];
}

_Program.getInfo( CL_PROGRAM_BINARIES, &binaries[0] );

std::ofstream binaryFile( filename.c_str() );
if( ! binaryFile.good() )
  std::runtime_error( "Failed to open binary file for kernel" );

for( size_t i=0; i < binaries.size(); ++i)
  binaryFile << binaries[i];



Just to compare the result with previous post, same kernel code modified to OpenCL style is used as below.
__kernel
void KernelFourInstructions1( float a, float b, float c, __global float *d_a ) {
  a = a * b + c;
  d_a[ get_global_id(0) ] = a;
}

And finally, the code is as below with added comment.
//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Sat Sep 29 23:04:42 2012 (1348927482)
// Driver 
//

.version 3.0
.target sm_30, texmode_independent
.address_size 32

.entry KernelFourInstructions1(
  .param .f32 KernelFourInstructions1_param_0,  // float a
  .param .f32 KernelFourInstructions1_param_1,  // float b
  .param .f32 KernelFourInstructions1_param_2,  // float c
  .param .u32 .ptr .global .align 4 KernelFourInstructions1_param_3 // float *d_a
)
{
  .reg .f32   %f< 5 >;
  .reg .s32   %r< 10 >;

  ld.param.f32   %f1, [KernelFourInstructions1_param_0];
  ld.param.f32   %f2, [KernelFourInstructions1_param_1];
  ld.param.f32   %f3, [KernelFourInstructions1_param_2];
  ld.param.u32   %r5, [KernelFourInstructions1_param_3];
  fma.rn.f32   %f4, %f1, %f2, %f3;   // f4 <- a * b + c
  mov.u32   %r1, %envreg3;           // What is envreg3 ? Must be zero.
  mov.u32   %r2, %ntid.x;            // ntid.x is 1024 on sm_30
  mov.u32   %r3, %ctaid.x;           // CTA identifier within a grid
  mov.u32   %r4, %tid.x;             // thread identifier within a CTA
  add.s32   %r6, %r4, %r1;           // r6 <- tid.x + envreg3
  mad.lo.s32   %r7, %r3, %r2, %r6;   // r7 <- ctaid.x * ntid.x + r6 : equivalent of get_global_id(0)
  shl.b32   %r8, %r7, 2;           
  add.s32   %r9, %r5, %r8;           // r9 <- r5 + r8 : &d_a[ get_global_id(0) ]
  st.global.f32   [%r9], %f4;        // [ r9 ] <- f4
  ret;
}


This PTX code is slightly bloated compared to previous CUDA PTX code. It is due to the get_global_id(0) instead of threadIdx.x. get_global_id(0) should count for the block size and it has to refer ntid and ctaid.

There is one alien "envreg3". The PTX doc says it is driver-defined read-only registers and says it is described at the driver documentation which I can't find where it is. The value should be zero in the above code context. I guess OpenCL run-time kernel compilation allows the NVidia to use very hardware specific variables. And maybe that can be the real strength of the OpenCL run-time compile versus CUDA static compile.

No comments: