essay on programming languages, computer science, information techonlogies and all.

Thursday, February 28, 2013

OpenCL - pinned memory

Handling page-able memory must require constant intervention from CPU side. Or maybe DMA transfer can't be employed in full. I don't know what is the cause but the NVidia ask to use pinned memory for critical operation. It says that you may get 5GBps. Remember the last post which says that the copy throughput is around 3.3 GB/s.

Below is the modified code snippet for the pinned memory. Host memory is allocated with flag CL_MEM_ALLOC_HOST_PTR and get the raw pointer with mapping function. And these mapped raw pointer is used for buffer copy operation. It is a bit odd to use raw pointer not the host buffer handle but I guess it's for maintaining function interface.

...
cl::Buffer hostSrc( task.GetContext(), CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, width*height );
cl::Buffer hostDst( task.GetContext(), CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, width*height );

uint8_t *hostSrcPtr = 
  (unsigned char*) 
  queue.enqueueMapBuffer( hostSrc, CL_TRUE, CL_MAP_WRITE, 0, width*height );
uint8_t *hostDstPtr = 
  (unsigned char*) 
  queue.enqueueMapBuffer( hostSrc, CL_TRUE, CL_MAP_READ, 0, width*height );

...
queue.enqueueWriteBuffer( devSrc, CL_FALSE, 0, width*height, hostSrcPtr, &ve0, &steps[1] );
...
queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, localws, &ve1, &steps[2] );
...
queue.enqueueReadBuffer( devDst, CL_FALSE, 0, width*height, hostDstPtr, &ve2, &steps[3]  );


The result is as expected. The achieved memory copy throughput is around 6.1GB/s. And it makes overall throughput to be 1.4 GB/s. First time over the GIGA. Huray !

Entering test case "TestPitch0Pinned"
Step 1 : start 0 ns, end 647168 ns, duration 647168 ns, 6136.32 MB/s
Step 2 : start 879072 ns, end 2166112 ns, duration 1287040 ns, 3085.55 MB/s
Step 3 : start 2196800 ns, end 2824128 ns, duration 627328 ns, 6330.39 MB/s
Total : duration 2824128 ns, 1406.18 MB/s


Reference : OpenCL Best Practice Guide, Chapter 3.1.1. Pinned Memory

Wednesday, February 27, 2013

OpenCL - first pitch comparison

In here, the pitch comparsion will be tested with OpenCL from NVidia CUDA toolkit.

First, to simplify the platfrom, device, context and program creation, I make up a new class WorkBench which do all the chores. Refer below code. Note that this class rely on the reference counting of OpenCL c++ wrapper.


//
// usage :
//   Workbench task;
//   task.ConnectDevice( "GT 640" )
//       .BuildContext()
//       .BuildProgram( "OpenCLKernels.cl" );
//
class Workbench
{
public:
  Workbench() {}

  Workbench& ConnectDevice( const string& deviceName )
  {
    vector< cl::platform > platforms;  
    cl::Platform::get( &platforms );

    for( vector< cl::platform >::iterator I=platforms.begin(); I!=platforms.end(); ++I )
    {
      cl::Platform& platform( *I );
      
      vector< cl::device > devices;
      platform.getDevices( CL_DEVICE_TYPE_ALL, &devices );

      for( vector< cl::device >::iterator J=devices.begin(); J!=devices.end(); ++J )
      {
        cl::Device& device( *J );
        string discoveredName( device.getInfo< cl_device_name >() );

        if( discoveredName.find( deviceName ) != discoveredName.npos )
        {
          _Devices.push_back( device );
          return *this;
        }
      }
    }

    throw std::runtime_error( string("Cannot find the device with the name") + deviceName );
  }

  Workbench& BuildContext()
  {
    _Context = cl::Context( GetConnectedDevices() );
    return *this;
  }

  Workbench& BuildProgram( const string& sourceFile )
  {
    std::ifstream programFile( sourceFile.c_str() );
    std::string programString( 
      std::istreambuf_iterator< char >(programFile), 
      (std::istreambuf_iterator< char >()));
    cl::Program::Sources source( 
      1, 
      std::make_pair( programString.c_str(), programString.length()+1 )
      );
    _Program = cl::Program( _Context, source );
    _Program.build( GetConnectedDevices() );
    return *this;
  }

  vector< cl::Device >& GetConnectedDevices() { return _Devices; }
  cl::Context& GetContext() { return _Context; }
  cl::Program& GetProgram() { return _Program; }
    

private:
  vector< cl::Device > _Devices;
  cl::Context _Context;
  cl::Program _Program;
};

Then the host code snippet is as below
...
ByteImage src( width, height ), dst( width, height );
...
Workbench task;
task.ConnectDevice( "GT 640" )
  .BuildContext()
  .BuildProgram( "OpenCLKernels.cl" );

cl::Kernel kernelPitch0( task.GetProgram(), "Pitch0" );
cl::CommandQueue queue( task.GetContext(), task.GetConnectedDevices()[0], CL_QUEUE_PROFILING_ENABLE );
cl::Buffer devSrc( task.GetContext(), CL_MEM_READ_ONLY, width*height );
cl::Buffer devDst( task.GetContext(), CL_MEM_WRITE_ONLY, width*height );
...

vector< cl::Event > steps;
steps.push_back( cl::UserEvent( task.GetContext() ) );
steps.push_back( cl::Event() );
steps.push_back( cl::Event() );
steps.push_back( cl::Event() );

vector< cl::Event > ve0; ve0.push_back( steps[0] );
queue.enqueueWriteBuffer( devSrc, CL_FALSE, 0, width*height, src.GetPixelPtr(), &ve0, &steps[1] );

kernelPitch0.setArg( 0, devSrc );
kernelPitch0.setArg( 1, devDst );
...

const int LOCAL_WIDTH = 64, LOCAL_HEIGHT = 4;
cl::NDRange localws( LOCAL_WIDTH, LOCAL_HEIGHT );
cl::NDRange globalws( 
  LOCAL_WIDTH * (int)ceil( (double)(roiRight-roiLeft+1) / (double)LOCAL_WIDTH ), 
  LOCAL_HEIGHT * (int)ceil( (double)(roiBottom-roiTop+1) / (double)LOCAL_HEIGHT )
);

vector< cl::Event > ve1; ve1.push_back( steps[1] );
queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, localws, &ve1, &steps[2] );

vector< cl::Event > ve2; ve2.push_back( steps[2] );
queue.enqueueReadBuffer( devDst, CL_FALSE, 0, width*height, dst.GetPixelPtr(), &ve2, &steps[3]  );

(static_cast< cl::userevent >(&steps[0]))->setStatus( CL_COMPLETE );
steps[3].wait();
queue.finish();

...
The device code is as below
__kernel
void Pitch0(
  __global uchar *d_src, 
  __global uchar *d_dst,
  int width, int height,
  int roiLeft, int roiTop, int roiRight, int roiBottom,
  int integerPitch, float toCeiling, float toFloor )
{
  int x = roiLeft + get_global_id(0);
  int y = roiTop + get_global_id(1);
  int idx = x + y * width;

  bool isInBound = x <= roiRight && y <= roiBottom;

  float east = 
    (float)d_src[ idx - integerPitch - 1 ] * toCeiling +
    (float)d_src[ idx - integerPitch ] * toFloor;
  float west = 
    (float)d_src[ idx + integerPitch ] * toFloor + 
    (float)d_src[ idx + integerPitch + 1 ] * toCeiling;

  float diff = ( (float)d_src[idx] * 2.0f  - ( east + west ) ) / 2.0f;
  diff = diff >= 0 ? diff : -diff;

  if( isInBound )
  {
    d_dst[ idx ] = (uchar)( diff  );
  }
}

The resulst is as below
Step 1 : start 0 ns, end 1165280 ns, duration 1165280 ns, 3407.96 MB/s
Step 2 : start 1388384 ns, end 2675712 ns, duration 1287328 ns, 3084.86 MB/s
Step 3 : start 2854816 ns, end 4050816 ns, duration 1196000 ns, 3320.43 MB/s
Total : duration 4050816 ns, 980.35 MB/s


The memory copy from host to device and device to host is around 1.2 milli second (ms). This is similar to the non-pinned memory copy time. And it means I have to find an equivalent of pinned memory in OpenCL.

But somewhat confusing result is the kernel time which is just around 1.3 ms. This is almost two times faster than the CUDA kernel time - it took 2.1 ms at best with texture access version.

I don't expect that NVidia spent more time optimizing OpenCL compiler than CUDA compiler. Or is Apple such a big customer which can't be disappointed ?


Saturday, February 23, 2013

OpenCL - Multiple SDKs

As there are multiple vendors supports OpenCL, multiple devices should be able to be loaded up in an application with one of vendor's OpenCL SDK.

Here I have NVidia CUDA SDK & Intel OpenCL SDK installed in my machine. And here is code to enumerate all the devices in all the platforms.

  vector< cl::Platform > platforms;  

  try 
  {
    cl::Platform::get( &platforms );

    for( cl_uint n=0; n < platforms.size(); ++n )
    {
      string platformName = platforms[n].getInfo< CL_PLATFORM_NAME >();
      cout << "Platform : @ " << platformName.c_str() << endl;
      
      vector< cl::Device > devices, ctxDevices;

      platforms[n].getDevices( CL_DEVICE_TYPE_ALL, &devices );
      
      cl::Context context( devices );
      ctxDevices = context.getInfo< CL_CONTEXT_DEVICES >();
      for( cl_uint i=0; i < ctxDevices.size(); ++i )
      {
        string deviceName = ctxDevices[i].getInfo< CL_DEVICE_NAME >();
        cout << "Device: @ " << deviceName.c_str() << endl;
      }
    }
  }
  catch( cl::Error e ) {
    cout << e.what() << ": Error Code " << e.err() << endl;
  }
 
Above code is linked with library OpenCL.lib come with NVidia CUDA SDK. It prints out below which enumerate all the OpenCL enabled devices incluing Intel OpenCL device.
 
Platform : @ NVIDIA CUDA
Device: @ GeForce GT 640
Platform : @ Intel(R) OpenCL
Device: @         Intel(R) Pentium(R) CPU G2120 @ 3.10GHz

This collaboration is based on the registery key \HKLM\SOFTWARE\Khronos\OpenCL\Vendors . Refer below.  I guess OpenCL.lib must loadup all these dll and asks to communicate it's devices under control.

Sunday, February 17, 2013

Next Move

CUDA
  • Using GT 640. PTX instructions. VLIW aware kernel programming

SSE
  • Any benefit using store without cache ?
  • Calculate throughput.
    • EB = ( Br + Bw ) / T , EB : effective bandwidth, Br ; read bytes, Bw : written bytes, T : time

HW
  • Study hardware - cache, bank and so on

OpenCL
  • Using NVidia OpenCL with GT 640.  Compare result with CUDA. Using buffer. Using Image.
  • Using NVidia OpenCL with GT 640.  vector type 
  • Using Intel OpenCL. Compare result with SSE code
  • Run kernels on GPU GT 640 and CPU G2120 at the same time.
  • Buy 3rd generation processor i3 - 3220 - to access GPU HD2500.

Inspection algorithm
  • Image preprocessing
    • Geometric Distortion corretion - radial in area camera
    • Shading correction - Area, line scan. How to define the compensation ?
  • periodic pattern inspection - 2 points ( hor, ver ), 4 points, 8 points, many points in horizontal and vertical
  • threshold - constant, adaptive threshold, ...
  • binary image handling
  • morphology - open, close, kernel size, iteration count, iteration order
  • segmentation - contour based, region based, bug follower, area, bounding box calculation

Distributed system
  • Group communication to manage multiple inspection processor
  • Join/Leave the group
  • Send packet that goes out to all group member fast and reliable using broadcast or multi-cast. 
  • Designated member can get the burst of REP from all the group member. 
  • A processor can join more than 1 group
  • Streaming images to a member - or designated member
  • REQ, REP, ACK, AYA( Are you alive ? ), IAA ( I am alive ), TA( Try Again ), AU ( Address unknown )
  • Make API that is flexible and easy to use ?  Allow to have different type of REQ possible ?  Refer other API for network ?

Wednesday, February 13, 2013

VC++ - intrinsic optimization

VC++ does not support inline assembler when it compiles to x64 target. But it says you don't have to use inline assembler as there is intrinisics. Intrinisics has methods that match with assembler instructions.

void SSE_Pitch0( 
  const uint8_t *h_src, uint8_t *h_dst, 
  int width, int height, 
  int roiLeft, int roiTop, int roiRight, int roiBottom,
  float horPitch, float verPitch )
{
  int sizeOfXMMInWords = SSE::SizeOfDataType / 2;
  int integerPitch = (int)horPitch;

  uint16_t toFloor = (uint16_t)( (horPitch - (int)horPitch) * 0xFF);
  uint16_t toCeiling = 0xFF - toFloor;

  std::vector< uint16_t > toFloors( sizeOfXMMInWords, toFloor );
  std::vector< uint16_t > toCeilings( sizeOfXMMInWords, toCeiling );

  __m128i xmmFloor, xmmCeiling;
  xmmFloor = _mm_loadu_si128( (const __m128i*)( toFloors.front()) );
  xmmCeiling = _mm_loadu_si128( (const __m128i*)( toCeilings.front()) );

  for ( int y=roiTop; y<=roiBottom; y++ ) 
  {
    int x = roiLeft;
    const uint8_t *source = h_src + width*y+x;
    uint8_t *target = h_dst + width*y+x;

    for ( ; x<=roiRight; 
      x+=sizeOfXMMInWords, source+=sizeOfXMMInWords, target+=sizeOfXMMInWords ) 
    {
      __m128i xmmTemp, xmmLeft, xmmRight, xmmEast, xmmWest, xmmCenter;

      // _mm_cvtepu8_epi16 can't load value from memory directly unlike PMOVZXBW
      xmmTemp = _mm_loadu_si128( (const __m128i*)(source-integerPitch-1) );
      xmmEast = _mm_cvtepu8_epi16( xmmTemp );
      xmmTemp = _mm_srli_si128( xmmTemp, 1 );
      xmmWest = _mm_cvtepu8_epi16( xmmTemp );

      xmmEast = _mm_mullo_epi16( xmmEast, xmmFloor );
      xmmWest = _mm_mullo_epi16( xmmWest, xmmCeiling );
      xmmLeft = _mm_adds_epu16( xmmEast, xmmWest );
      xmmLeft = _mm_srli_epi16( xmmLeft, 1 );

      xmmTemp = _mm_loadu_si128( (const __m128i*)(source+integerPitch) );
      xmmEast = _mm_cvtepu8_epi16( xmmTemp );
      xmmTemp = _mm_srli_si128( xmmTemp, 1 );
      xmmWest = _mm_cvtepu8_epi16( xmmTemp );

      xmmEast = _mm_mullo_epi16( xmmEast, xmmCeiling );
      xmmWest = _mm_mullo_epi16( xmmWest, xmmFloor );
      xmmRight = _mm_adds_epu16( xmmEast, xmmWest );
      xmmRight = _mm_srli_epi16( xmmRight, 1 );

      xmmTemp = _mm_loadu_si128( (const __m128i*)(source) );
      xmmCenter = _mm_cvtepu8_epi16( xmmTemp );
      xmmCenter = _mm_slli_epi16( xmmCenter, 8 );
      xmmTemp = xmmCenter;

      xmmCenter = _mm_subs_epu16( xmmCenter, xmmLeft );
      xmmCenter = _mm_subs_epu16( xmmCenter, xmmRight );

      xmmLeft = _mm_adds_epu16( xmmLeft, xmmRight);
      xmmLeft = _mm_subs_epu16( xmmLeft, xmmTemp );

      xmmCenter = _mm_adds_epu16( xmmCenter, xmmLeft );
      xmmCenter = _mm_srli_epi16( xmmCenter, 8 );
      xmmTemp = _mm_xor_si128( xmmTemp, xmmTemp);
      xmmTemp = _mm_packus_epi16( xmmCenter, xmmTemp );

      _mm_storel_epi64( (__m128i*)(target), xmmTemp );
    }
  }
}
 
Suprisingly, this is just making 160 MB/s throughput. The reason is obvious when we see the disassembled code. The compiler doesn't utilize 16 xmm registers. It seems that it think there is only one xmm register. It is constantly push/pop xmm register value to the stack. This doesn't improve with -msse2 option. Or other optimization flag.

__m128i xmmTemp, xmmLeft, xmmRight, xmmEast, xmmWest, xmmCenter;
   xmmTemp = _mm_loadu_si128( (const __m128i*)(source-integerPitch-1) );
00403616  mov         edx,dword ptr [source] 
0040361C  sub         edx,dword ptr [ebp-1Ch] 
0040361F  sub         edx,1 
00403622  movdqu      xmm0,xmmword ptr [edx] 
00403626  movdqa      xmmword ptr [ebp-0D0h],xmm0 
0040362E  movdqa      xmm0,xmmword ptr [ebp-0D0h] 
00403636  movdqa      xmmword ptr [xmmTemp],xmm0 
   xmmEast = _mm_cvtepu8_epi16( xmmTemp );
0040363E  pmovzxbw    xmm0,mmword ptr [xmmTemp] 
00403647  movdqa      xmmword ptr [ebp-0F0h],xmm0 
0040364F  movdqa      xmm0,xmmword ptr [ebp-0F0h] 
00403657  movdqa      xmmword ptr [xmmEast],xmm0 
   xmmTemp = _mm_srli_si128( xmmTemp, 1 );
0040365F  movdqa      xmm0,xmmword ptr [xmmTemp] 
00403667  psrldq      xmm0,1 
0040366C  movdqa      xmmword ptr [ebp-110h],xmm0 
00403674  movdqa      xmm0,xmmword ptr [ebp-110h] 
0040367C  movdqa      xmmword ptr [xmmTemp],xmm0 

The compiler used in here is VC++ 2008 sp1. In the internet, there is saying that VC++2010 is much better than old version. But as there is intel compiler or GCC that allows inline assembler, is still a merit to pursue the intrinsics ? Wolud the compiler makes better code than hand-written code in the end ? I guess  the answer depends.  Still the fact that the VC++2008 instrincs is incompentent on SSE stands.

Tuesday, February 12, 2013

SSE - Pitch comparison

How fast can it be processed in the CPU side ?  We have seen that the GPU can make around 700 MB/s throughput. But what if we let it be processed in the CPU with it's all strength ?

CPU has SIMD instructions. These instructions is quite well suited with the algorithm. The pitch comparison algorithm asks the difference of each pixel. It doesn't need to know it's adjacent neighbor pixel value nor dependant to any other pixel's intermediate processing result.

Also the pitch comarision doesn't need high precision. Who cares whether the difference is 10.1 or 10.2 pixels ? The pixel value is 0 - 255 and we knows that each pixel can vary at least 1 or 2 pixels. So all we needs is at most one more point afte digit. This allows us to employ a fixed point floating point calculation.

16 bits is more than enough for the pitch comaparison. SSE 64bit provides 128 bits wides xmm registers. Each xmm registers can hold 8 words - 16 bits pixels. Each instruction can process 8 pixels at a time.

VC++ supports inline assembler on the x86 target. Here is the code.

void SSE_Pitch0( 
  const uint8_t *h_src, uint8_t *h_dst, 
  int width, int height, 
  int roiLeft, int roiTop, int roiRight, int roiBottom,
  float horPitch, float verPitch )
{
  int sizeOfDataTypeInWords = SSE::SizeOfDataType / 2;
  int integerPitch = (int)horPitch;

  uint16_t toFloor = (uint16_t)( (horPitch - (int)horPitch) * 0xFF);
  uint16_t toCeiling = 0xFF - toFloor;

  std::vector< uint16_t > toFloors( sizeOfDataTypeInWords, toFloor );
  std::vector< uint16_t > toCeilings( sizeOfDataTypeInWords, toCeiling );
  const uint16_t* toFloorsPtr = &toFloors.front();
  const uint16_t* toCeilingsPtr = &toCeilings.front();

  _asm {
    MOV     ECX, toFloorsPtr  
    MOVDQU  XMM6, [ECX]
    MOV     EDX, toCeilingsPtr
    MOVDQU  XMM7, [EDX]
  }

  for ( int y=roiTop; y<=roiBottom; y++ ) 
  {
    for ( int x=roiLeft; x<=roiRight; x+=sizeOfDataTypeInWords ) 
    {
      const uint8_t *source = h_src + width*y+x;
      uint8_t *target = h_dst + width*y+x;
      
      _asm {

        MOV      ECX, source  
        SUB      ECX, integerPitch  // east pitch pixel
        PMOVZXBW XMM1, [ECX-1]  // left
        PMOVZXBW XMM2, [ECX]    // right

        PMULLW   XMM1, XMM6  
        PMULLW   XMM2, XMM7
        PADDUSW  XMM1, XMM2  
        PSRLW    XMM1, 1

        MOV      ECX, source  
        ADD      ECX, integerPitch  // west pitch pixel

        PMOVZXBW XMM2, [ECX]    // left
        PMOVZXBW XMM3, [ECX+1]  // right

        PMULLW   XMM2, XMM7  
        PMULLW   XMM3, XMM6
        PADDUSW  XMM2, XMM3
        PSRLW    XMM2, 1

        MOV      ECX, source;  
        PMOVZXBW XMM0, [ECX]    // XMM0 = I(n) | ... | I(n+7)
        PSLLW    XMM0, 8
        MOVDQU   XMM4, XMM0

        PSUBUSW  XMM0, XMM1    // 2C - (L+R)
        PSUBUSW  XMM0, XMM2    

        PADDUSW  XMM1, XMM2    // (L+R) - 2C
        PSUBUSW  XMM1, XMM4

        PADDUSW  XMM0, XMM1

        PSRLW    XMM0, 8
        PXOR     XMM1, XMM1
        PACKUSWB XMM1, XMM0

        MOV      EDX, target
        MOVHPS   [EDX], XMM1
      }
    }
  }
}
 
Above code can be executed at around 940MB/s

Wednesday, February 6, 2013

CUDA Study - texture memory

CUDA provides a special memroy type so called texture memory. When image is stored in this memory, it's layout is not row-major but interleaving. This layout is supposed to give better hit and give boost on the load operation when spatially distanced pixels are used. And it provides a linear filtering in hardware which eliminate liner interpolation in the kernel.

As the memory does not have a common layout, it should be created with dedicated method - cuaMallocArray and is read-only and only accessible using special texture fetch functions.  Well in 5.0 CUDA, they say it can be written using Surface API but that's not discussed in here

The CUDA programming guide tells that there are two API for texture. One is Texture Reference API and the other is Texture Object API. The Texture Reference API is used in here. It seems that static texture variable is the only short-fall compared to Texture Object API. And Object API is introduced at 5.0 which means there is not many help if stuck.

// texture<> is only allowed as file scope static variable
// cudaReadModeElementType can't be used with cudaFilterModeLinear with uint8_t texel 
texture< uint8_t, cudaTextureType2D, cudaReadModeNormalizedFloat > gPitch2Texture;

__global__ void KernelPitch2()
{
  ...
  // add 0.5f to all texture coordinates for center offset
  float east, west, center;
  east = tex2D( gPitch2Texture, (float)( x - pc->Pitch + 0.5f), (float)(y+0.5f) ) * 255;
  west = tex2D( gPitch2Texture, (float)( x + pc->Pitch + 0.5f), (float)(y+0.5f) ) * 255;
  center = tex2D( gPitch2Texture, (float)(x+0.5f), (float)(y+0.5f) ) * 255;

  float diff = ( center * 2.0f  - ( east + west ) ) / 2.0f;
  ...
}

void CUDA_Pitch2( ... )
{
  ...
  cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc< uint8_t >();

  // texture memory should be created using cudaArray and binded with texture handle
  struct cudaArray *devSrcArray;
  cudaMallocArray( &devSrcArray, &channelDesc, width, height );
  cudaMemcpyToArray( devSrcArray, 0, 0, h_src, imgSize, cudaMemcpyHostToDevice );

  gPitch2Texture.addressMode[0] = cudaAddressModeClamp;
  gPitch2Texture.addressMode[1] = cudaAddressModeClamp;
  gPitch2Texture.filterMode = cudaFilterModeLinear;
  gPitch2Texture.normalized = false;

  cudaBindTextureToArray( gPitch2Texture, devSrcArray, channelDesc );

  const int THREAD_WIDTH = 64, THREAD_HEIGHT = 4;
  dim3 blocks( THREAD_WIDTH, THREAD_HEIGHT );
  dim3 grids( 
    (int)ceil( (double)(roiRight-roiLeft+1) / (double)blocks.x ), 
    (int)ceil( (double)(roiBottom-roiTop+1) / (double)blocks.y )
  );
 
  ...
  KernelPitch2<<< grids, blocks >>>( ... )
  ...

  cudaFreeArray( devSrcArray );
}

This makes KernelPitch2 finish in 2.117ms and makes total through put to be around 740MB/s. The last score was 630MB/s - pinned host memory and shared load version.

Sunday, February 3, 2013

CUDA Study - L1 cache

Changed the thread dimensions from 16 x 16 to 64 x 4. Then the Global Memory Store Efficiency rises up from 33% to 49.9%.

Actually it rise to 49.9% at 32 x 8 and stay there no matter how large the width is. In opposite direction, if it is 8 x 32, it goes down to 20%.

This must be related with the L1 and L2 cache size. L1 cache is 16 KB and L2 256KB. L1 cache line is 128 B. When a block has 16 x 16 threads, it needs 32 ( = 16 x 2 ) cache lines - the pitch is sufficiently small like less than 64 pixels (  = 128B / 2  ). 32 cache lines corresponds to 4 KB ( = 32 cache lines x 128 B ). When maximum threads per processors is 2048, it can hold 8 blocks ( = 2048 threads / 256 threads ), and it needs 32 KB ( = 8 blocks x 4 KB ) cache to fillup all the read/write reqeust. This is exceeding 16KB limit of L1 cache. When a block is 32 x 8, it needs half and it can fit in 16KB L1 cache.

Another test is to remove the store instruction and it makes around two times fatser. What is the store efficiency ? How good can it be ? How does it being calculated ? 

The GT 640 has 0.891GB/s DDR3 with 128 Bits bus width. And it means 28.5 GB /s  ( = 0.891 x 2 x 128 / 8 ) throughput. The horizontal pitch comparison algorithm needs 5 pixels to read and 1 pixel to write.

If there is no calculation and just pure memory transaction, the maximal throughput is 4.75GB/s ( = 28.5 / 6 ). This also assumes that threads are cooperating to fully utilize the cache - 1 reads get 16 B ( = 128bits / 8 ) which will be stored in the cache and eventually consumed by 16 threads.

The current implementaion only runs around 0.4GB/s which is just 10% of maximum.