One of CUDA NVCC compiler option is -ptx which generate PTX assembly code and the code can be found at xxx.cu.obj.
Below is a cu code and it's PTX assembly code with added comment.
__global__
void KernelFourInstructions1( float a, float b, float c, float *d_a )
{
a = a * b + c;
d_a[ threadIdx.x ] = a;
}
.visible .entry _Z23KernelFourInstructions1fffPf(
.param .f32 _Z23KernelFourInstructions1fffPf_param_0, // float a
.param .f32 _Z23KernelFourInstructions1fffPf_param_1, // float b
.param .f32 _Z23KernelFourInstructions1fffPf_param_2, // float c
.param .u32 _Z23KernelFourInstructions1fffPf_param_3 // float *d_a
)
{
.reg .s32 %r<6>; // local variable r0 to r5 as 32 bit signed integer
.reg .f32 %f<5>; // local variable f0 to f4 as 32 bit floating point
ld.param.f32 %f1, [_Z23KernelFourInstructions1fffPf_param_0];
ld.param.f32 %f2, [_Z23KernelFourInstructions1fffPf_param_1];
ld.param.f32 %f3, [_Z23KernelFourInstructions1fffPf_param_2];
ld.param.u32 %r1, [_Z23KernelFourInstructions1fffPf_param_3];
cvta.to.global.u32 %r2, %r1; // r2 <- d_a address as global memory
.loc 2 31 1 // .loc is debugging directive pointing source file location
fma.rn.f32 %f4, %f1, %f2, %f3; // f4 <- a * b + c
.loc 2 33 1
mov.u32 %r3, %tid.x;
shl.b32 %r4, %r3, 2;
add.s32 %r5, %r2, %r4; // r5 <- &d_a[ threadIdx.x ]
st.global.f32 [%r5], %f4; // d_a[ threadIdx.x ] <- f4
.loc 2 34 2
ret;
}
It is a bit surprising to see that it needs 6 integer registers and 5 floating point registers. For example r3 can be reused as the destination of shl.b.32 and add.s32. Or f1 also can be as destination of fma.rn.f32. I am wondering below code is doing exactly same with less number of local variable.
... cvta.to.global.u32 %r2, %r1; fma.rn.f32 %f1, %f1, %f2, %f3; // f1 <- a * b + c, f1 instead of f4 mov.u32 %r3, %tid.x; shl.b32 %r3, %r3, 2; // r3 is reused as destination add.s32 %r3, %r2, %r3; // again r3 is reused st.global.f32 [%r3], %f1; ...
Reference : Parallel Thread Execution ISA Version 3.1, NVIDIA
No comments:
Post a Comment