With NVidia GPU, you can dump the PTX assembly code using CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES. When you get binary, it is actually PTX assembly code as demonstrated below.
Below code is from lecture 4 of Supercomputing on Graphics Cards - An Introduction to OpenCL and the C++ Bindings . For your convenience, here is copy and paste of the code with slight modification.
const std::vector< size_t > binSizes = _Program.getInfo< CL_PROGRAM_BINARY_SIZES >(); std::vector< unsigned char > binData( std::accumulate( binSizes.begin(), binSizes.end(), 0 ) ); unsigned char* binChunk = &binData[0]; std::vector< unsigned char* > binaries; for( size_t i=0; i < binSizes.size(); ++i) { binaries.push_back( binChunk ); binChunk += binSizes[i]; } _Program.getInfo( CL_PROGRAM_BINARIES, &binaries[0] ); std::ofstream binaryFile( filename.c_str() ); if( ! binaryFile.good() ) std::runtime_error( "Failed to open binary file for kernel" ); for( size_t i=0; i < binaries.size(); ++i) binaryFile << binaries[i];
Just to compare the result with previous post, same kernel code modified to OpenCL style is used as below.
__kernel void KernelFourInstructions1( float a, float b, float c, __global float *d_a ) { a = a * b + c; d_a[ get_global_id(0) ] = a; }
And finally, the code is as below with added comment.
// // Generated by NVIDIA NVVM Compiler // Compiler built on Sat Sep 29 23:04:42 2012 (1348927482) // Driver // .version 3.0 .target sm_30, texmode_independent .address_size 32 .entry KernelFourInstructions1( .param .f32 KernelFourInstructions1_param_0, // float a .param .f32 KernelFourInstructions1_param_1, // float b .param .f32 KernelFourInstructions1_param_2, // float c .param .u32 .ptr .global .align 4 KernelFourInstructions1_param_3 // float *d_a ) { .reg .f32 %f< 5 >; .reg .s32 %r< 10 >; ld.param.f32 %f1, [KernelFourInstructions1_param_0]; ld.param.f32 %f2, [KernelFourInstructions1_param_1]; ld.param.f32 %f3, [KernelFourInstructions1_param_2]; ld.param.u32 %r5, [KernelFourInstructions1_param_3]; fma.rn.f32 %f4, %f1, %f2, %f3; // f4 <- a * b + c mov.u32 %r1, %envreg3; // What is envreg3 ? Must be zero. mov.u32 %r2, %ntid.x; // ntid.x is 1024 on sm_30 mov.u32 %r3, %ctaid.x; // CTA identifier within a grid mov.u32 %r4, %tid.x; // thread identifier within a CTA add.s32 %r6, %r4, %r1; // r6 <- tid.x + envreg3 mad.lo.s32 %r7, %r3, %r2, %r6; // r7 <- ctaid.x * ntid.x + r6 : equivalent of get_global_id(0) shl.b32 %r8, %r7, 2; add.s32 %r9, %r5, %r8; // r9 <- r5 + r8 : &d_a[ get_global_id(0) ] st.global.f32 [%r9], %f4; // [ r9 ] <- f4 ret; }
This PTX code is slightly bloated compared to previous CUDA PTX code. It is due to the get_global_id(0) instead of threadIdx.x. get_global_id(0) should count for the block size and it has to refer ntid and ctaid.
There is one alien "envreg3". The PTX doc says it is driver-defined read-only registers and says it is described at the driver documentation which I can't find where it is. The value should be zero in the above code context. I guess OpenCL run-time kernel compilation allows the NVidia to use very hardware specific variables. And maybe that can be the real strength of the OpenCL run-time compile versus CUDA static compile.
No comments:
Post a Comment