essay on programming languages, computer science, information techonlogies and all.

Tuesday, March 5, 2013

CUDA - PTX assembly 1

One way to figure out why OpenCL output is two times faster than CUDA is to looking at the machine code. Though to look at the assembly code, you need to train yourself. From now on, there will be number of post on the CUDA PTX assembly code.

One of CUDA NVCC compiler option is -ptx which generate PTX assembly code and the code can be found at xxx.cu.obj.

Below is a cu code and it's PTX assembly code with added comment.


__global__ 
void KernelFourInstructions1( float a, float b, float c, float *d_a )
{
  a = a * b + c;
  d_a[ threadIdx.x ] = a;
}


.visible .entry _Z23KernelFourInstructions1fffPf(
  .param .f32 _Z23KernelFourInstructions1fffPf_param_0,  // float a
  .param .f32 _Z23KernelFourInstructions1fffPf_param_1,  // float b
  .param .f32 _Z23KernelFourInstructions1fffPf_param_2,  // float c
  .param .u32 _Z23KernelFourInstructions1fffPf_param_3   // float *d_a
)
{
  .reg .s32   %r<6>;  // local variable r0 to r5 as 32 bit signed integer
  .reg .f32   %f<5>;  // local variable f0 to f4 as 32 bit floating point 


  ld.param.f32   %f1, [_Z23KernelFourInstructions1fffPf_param_0]; 
  ld.param.f32   %f2, [_Z23KernelFourInstructions1fffPf_param_1];  
  ld.param.f32   %f3, [_Z23KernelFourInstructions1fffPf_param_2];
  ld.param.u32   %r1, [_Z23KernelFourInstructions1fffPf_param_3];
  cvta.to.global.u32   %r2, %r1;    // r2 <- d_a address as global memory
  .loc 2 31 1                       // .loc is debugging directive pointing source file location
  fma.rn.f32   %f4, %f1, %f2, %f3;  // f4 <- a * b + c
  .loc 2 33 1
  mov.u32   %r3, %tid.x;
  shl.b32   %r4, %r3, 2;
  add.s32   %r5, %r2, %r4;          // r5 <- &d_a[ threadIdx.x ]
  st.global.f32   [%r5], %f4;       // d_a[ threadIdx.x ] <- f4
  .loc 2 34 2
  ret;
}


It is a bit surprising to see that it needs 6 integer registers and 5 floating point registers. For example r3 can be reused as the destination of shl.b.32 and add.s32. Or f1 also can be as destination of fma.rn.f32. I am wondering below code is doing exactly same with less number of local variable.

  ...
  cvta.to.global.u32   %r2, %r1;
  fma.rn.f32   %f1, %f1, %f2, %f3;  // f1 <- a * b + c, f1 instead of f4
  mov.u32   %r3, %tid.x;
  shl.b32   %r3, %r3, 2;            // r3 is reused as destination
  add.s32   %r3, %r2, %r3;          // again r3 is reused
  st.global.f32   [%r3], %f1;       
  ...



Reference : Parallel Thread Execution ISA Version 3.1, NVIDIA

No comments: