essay on programming languages, computer science, information techonlogies and all.

Thursday, January 31, 2013

CUDA Study - shared memory

This time I am trying to use shared memory. The idea is to load source pixels collaboratively by multiple threads. If the bottle neck is the memory loading time then it will give me some boost. Though the short fall of this idea is that the last column of threads in a block should load additional pixel. Can' remove this additional load unless radically modify the whole algorithm.

Still pursuing this idea to see how it goes.

To define shared memory, it asks contant value for the size. It means blockDim.x and y can't be used for this. Though there is integer template which can conveniently declare width and height of the shared memory.

template< int W, int H >
__global__ void KernelPitch1( ... ) 
{
  ...
  __shared__ uint8_t sharedEast[ (W+1) * H ];
  __shared__ uint8_t sharedWest[ (W+1) * H ];

  int sharedIndex = threadIdx.y * (W+1) + threadIdx.x;
  sharedEast[ sharedIndex ] = d_src[ idx - pitch - 1 ];
  sharedWest[ sharedIndex ] = d_src[ idx + pitch ];

  if( threadIdx.x == (W-1) )  // one more column for each block
  {
    sharedEast[ sharedIndex + 1 ] = d_src[ idx - pitch ];
    sharedWest[ sharedIndex + 1 ] = d_src[ idx + pitch + 1 ];
  }

  __syncthreads();

  float east = 
    (float)(sharedEast[ sharedIndex   ]) * pc->ToCeiling +
    (float)(sharedEast[ sharedIndex+1 ]) * pc->ToFloor;
  float west = 
    (float)(sharedWest[ sharedIndex   ]) * pc->ToFloor + 
    (float)(sharedWest[ sharedIndex+1 ]) * pc->ToCeiling;

  float diff = ( (float)d_src[idx] * 2.0f  - ( east + west ) ) / 2.0f;
  
  if( isInBound )
  {
    d_dst[ idx ] = (unsigned char)( diff >= 0 ? diff : -diff  );
  }
}

void CUDA_Pitch1( ... )
{
  ...
  KernelPitch1< 16, 16 > <<< grids, threads >>>( ... );
  ...
}


Here is the Visual Profiler result that say slight improvement on the performance.













Though memory loading time might not be a big bottleneck in this case as the profiler points to the store operation as culprit.





One thing to notice is that the CUDA seems get confused with two level 'if' blocks. It is not a good idea to have multiple level of branches as the core has to keep all the branching but still suprised that it compiles and generate strange result. Here is the pattern that have to be avoided.
template< int W, int H >
__global__ void KernelPitch1( ... ) 
{
  ...
  
  if( isInBound )
  {
    ...

    if( threadIdx.x == (W-1) )  
    {
      // This second level if block causes trouble.
      // It seems that this block may not be executed with valid condition
      sharedEast[ sharedIndex + 1 ] = d_src[ idx - pitch ];
      sharedWest[ sharedIndex + 1 ] = d_src[ idx + pitch + 1 ];
    }

    ...
  }
}

No comments: