One of CUDA NVCC compiler option is -ptx which generate PTX assembly code and the code can be found at xxx.cu.obj.
Below is a cu code and it's PTX assembly code with added comment.
__global__ void KernelFourInstructions1( float a, float b, float c, float *d_a ) { a = a * b + c; d_a[ threadIdx.x ] = a; }
.visible .entry _Z23KernelFourInstructions1fffPf( .param .f32 _Z23KernelFourInstructions1fffPf_param_0, // float a .param .f32 _Z23KernelFourInstructions1fffPf_param_1, // float b .param .f32 _Z23KernelFourInstructions1fffPf_param_2, // float c .param .u32 _Z23KernelFourInstructions1fffPf_param_3 // float *d_a ) { .reg .s32 %r<6>; // local variable r0 to r5 as 32 bit signed integer .reg .f32 %f<5>; // local variable f0 to f4 as 32 bit floating point ld.param.f32 %f1, [_Z23KernelFourInstructions1fffPf_param_0]; ld.param.f32 %f2, [_Z23KernelFourInstructions1fffPf_param_1]; ld.param.f32 %f3, [_Z23KernelFourInstructions1fffPf_param_2]; ld.param.u32 %r1, [_Z23KernelFourInstructions1fffPf_param_3]; cvta.to.global.u32 %r2, %r1; // r2 <- d_a address as global memory .loc 2 31 1 // .loc is debugging directive pointing source file location fma.rn.f32 %f4, %f1, %f2, %f3; // f4 <- a * b + c .loc 2 33 1 mov.u32 %r3, %tid.x; shl.b32 %r4, %r3, 2; add.s32 %r5, %r2, %r4; // r5 <- &d_a[ threadIdx.x ] st.global.f32 [%r5], %f4; // d_a[ threadIdx.x ] <- f4 .loc 2 34 2 ret; }
It is a bit surprising to see that it needs 6 integer registers and 5 floating point registers. For example r3 can be reused as the destination of shl.b.32 and add.s32. Or f1 also can be as destination of fma.rn.f32. I am wondering below code is doing exactly same with less number of local variable.
... cvta.to.global.u32 %r2, %r1; fma.rn.f32 %f1, %f1, %f2, %f3; // f1 <- a * b + c, f1 instead of f4 mov.u32 %r3, %tid.x; shl.b32 %r3, %r3, 2; // r3 is reused as destination add.s32 %r3, %r2, %r3; // again r3 is reused st.global.f32 [%r3], %f1; ...
Reference : Parallel Thread Execution ISA Version 3.1, NVIDIA
No comments:
Post a Comment