CPU has SIMD instructions. These instructions is quite well suited with the algorithm. The pitch comparison algorithm asks the difference of each pixel. It doesn't need to know it's adjacent neighbor pixel value nor dependant to any other pixel's intermediate processing result.
Also the pitch comarision doesn't need high precision. Who cares whether the difference is 10.1 or 10.2 pixels ? The pixel value is 0 - 255 and we knows that each pixel can vary at least 1 or 2 pixels. So all we needs is at most one more point afte digit. This allows us to employ a fixed point floating point calculation.
16 bits is more than enough for the pitch comaparison. SSE 64bit provides 128 bits wides xmm registers. Each xmm registers can hold 8 words - 16 bits pixels. Each instruction can process 8 pixels at a time.
VC++ supports inline assembler on the x86 target. Here is the code.
void SSE_Pitch0( const uint8_t *h_src, uint8_t *h_dst, int width, int height, int roiLeft, int roiTop, int roiRight, int roiBottom, float horPitch, float verPitch ) { int sizeOfDataTypeInWords = SSE::SizeOfDataType / 2; int integerPitch = (int)horPitch; uint16_t toFloor = (uint16_t)( (horPitch - (int)horPitch) * 0xFF); uint16_t toCeiling = 0xFF - toFloor; std::vector< uint16_t > toFloors( sizeOfDataTypeInWords, toFloor ); std::vector< uint16_t > toCeilings( sizeOfDataTypeInWords, toCeiling ); const uint16_t* toFloorsPtr = &toFloors.front(); const uint16_t* toCeilingsPtr = &toCeilings.front(); _asm { MOV ECX, toFloorsPtr MOVDQU XMM6, [ECX] MOV EDX, toCeilingsPtr MOVDQU XMM7, [EDX] } for ( int y=roiTop; y<=roiBottom; y++ ) { for ( int x=roiLeft; x<=roiRight; x+=sizeOfDataTypeInWords ) { const uint8_t *source = h_src + width*y+x; uint8_t *target = h_dst + width*y+x; _asm { MOV ECX, source SUB ECX, integerPitch // east pitch pixel PMOVZXBW XMM1, [ECX-1] // left PMOVZXBW XMM2, [ECX] // right PMULLW XMM1, XMM6 PMULLW XMM2, XMM7 PADDUSW XMM1, XMM2 PSRLW XMM1, 1 MOV ECX, source ADD ECX, integerPitch // west pitch pixel PMOVZXBW XMM2, [ECX] // left PMOVZXBW XMM3, [ECX+1] // right PMULLW XMM2, XMM7 PMULLW XMM3, XMM6 PADDUSW XMM2, XMM3 PSRLW XMM2, 1 MOV ECX, source; PMOVZXBW XMM0, [ECX] // XMM0 = I(n) | ... | I(n+7) PSLLW XMM0, 8 MOVDQU XMM4, XMM0 PSUBUSW XMM0, XMM1 // 2C - (L+R) PSUBUSW XMM0, XMM2 PADDUSW XMM1, XMM2 // (L+R) - 2C PSUBUSW XMM1, XMM4 PADDUSW XMM0, XMM1 PSRLW XMM0, 8 PXOR XMM1, XMM1 PACKUSWB XMM1, XMM0 MOV EDX, target MOVHPS [EDX], XMM1 } } } }
No comments:
Post a Comment