With NVidia GPU, you can dump the PTX assembly code using CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES. When you get binary, it is actually PTX assembly code as demonstrated below.
Below code is from lecture 4 of Supercomputing on Graphics Cards - An Introduction to OpenCL and the C++ Bindings . For your convenience, here is copy and paste of the code with slight modification.
const std::vector< size_t > binSizes = _Program.getInfo< CL_PROGRAM_BINARY_SIZES >();
std::vector< unsigned char > binData(
std::accumulate( binSizes.begin(), binSizes.end(), 0 )
);
unsigned char* binChunk = &binData[0];
std::vector< unsigned char* > binaries;
for( size_t i=0; i < binSizes.size(); ++i) {
binaries.push_back( binChunk );
binChunk += binSizes[i];
}
_Program.getInfo( CL_PROGRAM_BINARIES, &binaries[0] );
std::ofstream binaryFile( filename.c_str() );
if( ! binaryFile.good() )
std::runtime_error( "Failed to open binary file for kernel" );
for( size_t i=0; i < binaries.size(); ++i)
binaryFile << binaries[i];
Just to compare the result with previous post, same kernel code modified to OpenCL style is used as below.
__kernel
void KernelFourInstructions1( float a, float b, float c, __global float *d_a ) {
a = a * b + c;
d_a[ get_global_id(0) ] = a;
}
And finally, the code is as below with added comment.
//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Sat Sep 29 23:04:42 2012 (1348927482)
// Driver
//
.version 3.0
.target sm_30, texmode_independent
.address_size 32
.entry KernelFourInstructions1(
.param .f32 KernelFourInstructions1_param_0, // float a
.param .f32 KernelFourInstructions1_param_1, // float b
.param .f32 KernelFourInstructions1_param_2, // float c
.param .u32 .ptr .global .align 4 KernelFourInstructions1_param_3 // float *d_a
)
{
.reg .f32 %f< 5 >;
.reg .s32 %r< 10 >;
ld.param.f32 %f1, [KernelFourInstructions1_param_0];
ld.param.f32 %f2, [KernelFourInstructions1_param_1];
ld.param.f32 %f3, [KernelFourInstructions1_param_2];
ld.param.u32 %r5, [KernelFourInstructions1_param_3];
fma.rn.f32 %f4, %f1, %f2, %f3; // f4 <- a * b + c
mov.u32 %r1, %envreg3; // What is envreg3 ? Must be zero.
mov.u32 %r2, %ntid.x; // ntid.x is 1024 on sm_30
mov.u32 %r3, %ctaid.x; // CTA identifier within a grid
mov.u32 %r4, %tid.x; // thread identifier within a CTA
add.s32 %r6, %r4, %r1; // r6 <- tid.x + envreg3
mad.lo.s32 %r7, %r3, %r2, %r6; // r7 <- ctaid.x * ntid.x + r6 : equivalent of get_global_id(0)
shl.b32 %r8, %r7, 2;
add.s32 %r9, %r5, %r8; // r9 <- r5 + r8 : &d_a[ get_global_id(0) ]
st.global.f32 [%r9], %f4; // [ r9 ] <- f4
ret;
}
This PTX code is slightly bloated compared to previous CUDA PTX code. It is due to the get_global_id(0) instead of threadIdx.x. get_global_id(0) should count for the block size and it has to refer ntid and ctaid.
There is one alien "envreg3". The PTX doc says it is driver-defined read-only registers and says it is described at the driver documentation which I can't find where it is. The value should be zero in the above code context. I guess OpenCL run-time kernel compilation allows the NVidia to use very hardware specific variables. And maybe that can be the real strength of the OpenCL run-time compile versus CUDA static compile.
No comments:
Post a Comment