essay on programming languages, computer science, information techonlogies and all.

Tuesday, February 12, 2013

SSE - Pitch comparison

How fast can it be processed in the CPU side ?  We have seen that the GPU can make around 700 MB/s throughput. But what if we let it be processed in the CPU with it's all strength ?

CPU has SIMD instructions. These instructions is quite well suited with the algorithm. The pitch comparison algorithm asks the difference of each pixel. It doesn't need to know it's adjacent neighbor pixel value nor dependant to any other pixel's intermediate processing result.

Also the pitch comarision doesn't need high precision. Who cares whether the difference is 10.1 or 10.2 pixels ? The pixel value is 0 - 255 and we knows that each pixel can vary at least 1 or 2 pixels. So all we needs is at most one more point afte digit. This allows us to employ a fixed point floating point calculation.

16 bits is more than enough for the pitch comaparison. SSE 64bit provides 128 bits wides xmm registers. Each xmm registers can hold 8 words - 16 bits pixels. Each instruction can process 8 pixels at a time.

VC++ supports inline assembler on the x86 target. Here is the code.

void SSE_Pitch0( 
  const uint8_t *h_src, uint8_t *h_dst, 
  int width, int height, 
  int roiLeft, int roiTop, int roiRight, int roiBottom,
  float horPitch, float verPitch )
{
  int sizeOfDataTypeInWords = SSE::SizeOfDataType / 2;
  int integerPitch = (int)horPitch;

  uint16_t toFloor = (uint16_t)( (horPitch - (int)horPitch) * 0xFF);
  uint16_t toCeiling = 0xFF - toFloor;

  std::vector< uint16_t > toFloors( sizeOfDataTypeInWords, toFloor );
  std::vector< uint16_t > toCeilings( sizeOfDataTypeInWords, toCeiling );
  const uint16_t* toFloorsPtr = &toFloors.front();
  const uint16_t* toCeilingsPtr = &toCeilings.front();

  _asm {
    MOV     ECX, toFloorsPtr  
    MOVDQU  XMM6, [ECX]
    MOV     EDX, toCeilingsPtr
    MOVDQU  XMM7, [EDX]
  }

  for ( int y=roiTop; y<=roiBottom; y++ ) 
  {
    for ( int x=roiLeft; x<=roiRight; x+=sizeOfDataTypeInWords ) 
    {
      const uint8_t *source = h_src + width*y+x;
      uint8_t *target = h_dst + width*y+x;
      
      _asm {

        MOV      ECX, source  
        SUB      ECX, integerPitch  // east pitch pixel
        PMOVZXBW XMM1, [ECX-1]  // left
        PMOVZXBW XMM2, [ECX]    // right

        PMULLW   XMM1, XMM6  
        PMULLW   XMM2, XMM7
        PADDUSW  XMM1, XMM2  
        PSRLW    XMM1, 1

        MOV      ECX, source  
        ADD      ECX, integerPitch  // west pitch pixel

        PMOVZXBW XMM2, [ECX]    // left
        PMOVZXBW XMM3, [ECX+1]  // right

        PMULLW   XMM2, XMM7  
        PMULLW   XMM3, XMM6
        PADDUSW  XMM2, XMM3
        PSRLW    XMM2, 1

        MOV      ECX, source;  
        PMOVZXBW XMM0, [ECX]    // XMM0 = I(n) | ... | I(n+7)
        PSLLW    XMM0, 8
        MOVDQU   XMM4, XMM0

        PSUBUSW  XMM0, XMM1    // 2C - (L+R)
        PSUBUSW  XMM0, XMM2    

        PADDUSW  XMM1, XMM2    // (L+R) - 2C
        PSUBUSW  XMM1, XMM4

        PADDUSW  XMM0, XMM1

        PSRLW    XMM0, 8
        PXOR     XMM1, XMM1
        PACKUSWB XMM1, XMM0

        MOV      EDX, target
        MOVHPS   [EDX], XMM1
      }
    }
  }
}
 
Above code can be executed at around 940MB/s

No comments: