Still pursuing this idea to see how it goes.
To define shared memory, it asks contant value for the size. It means blockDim.x and y can't be used for this. Though there is integer template which can conveniently declare width and height of the shared memory.
template< int W, int H > __global__ void KernelPitch1( ... ) { ... __shared__ uint8_t sharedEast[ (W+1) * H ]; __shared__ uint8_t sharedWest[ (W+1) * H ]; int sharedIndex = threadIdx.y * (W+1) + threadIdx.x; sharedEast[ sharedIndex ] = d_src[ idx - pitch - 1 ]; sharedWest[ sharedIndex ] = d_src[ idx + pitch ]; if( threadIdx.x == (W-1) ) // one more column for each block { sharedEast[ sharedIndex + 1 ] = d_src[ idx - pitch ]; sharedWest[ sharedIndex + 1 ] = d_src[ idx + pitch + 1 ]; } __syncthreads(); float east = (float)(sharedEast[ sharedIndex ]) * pc->ToCeiling + (float)(sharedEast[ sharedIndex+1 ]) * pc->ToFloor; float west = (float)(sharedWest[ sharedIndex ]) * pc->ToFloor + (float)(sharedWest[ sharedIndex+1 ]) * pc->ToCeiling; float diff = ( (float)d_src[idx] * 2.0f - ( east + west ) ) / 2.0f; if( isInBound ) { d_dst[ idx ] = (unsigned char)( diff >= 0 ? diff : -diff ); } } void CUDA_Pitch1( ... ) { ... KernelPitch1< 16, 16 > <<< grids, threads >>>( ... ); ... }
Here is the Visual Profiler result that say slight improvement on the performance.
Though memory loading time might not be a big bottleneck in this case as the profiler points to the store operation as culprit.
One thing to notice is that the CUDA seems get confused with two level 'if' blocks. It is not a good idea to have multiple level of branches as the core has to keep all the branching but still suprised that it compiles and generate strange result. Here is the pattern that have to be avoided.
template< int W, int H > __global__ void KernelPitch1( ... ) { ... if( isInBound ) { ... if( threadIdx.x == (W-1) ) { // This second level if block causes trouble. // It seems that this block may not be executed with valid condition sharedEast[ sharedIndex + 1 ] = d_src[ idx - pitch ]; sharedWest[ sharedIndex + 1 ] = d_src[ idx + pitch + 1 ]; } ... } }