essay on programming languages, computer science, information techonlogies and all.

Sunday, February 3, 2013

CUDA Study - L1 cache

Changed the thread dimensions from 16 x 16 to 64 x 4. Then the Global Memory Store Efficiency rises up from 33% to 49.9%.

Actually it rise to 49.9% at 32 x 8 and stay there no matter how large the width is. In opposite direction, if it is 8 x 32, it goes down to 20%.

This must be related with the L1 and L2 cache size. L1 cache is 16 KB and L2 256KB. L1 cache line is 128 B. When a block has 16 x 16 threads, it needs 32 ( = 16 x 2 ) cache lines - the pitch is sufficiently small like less than 64 pixels (  = 128B / 2  ). 32 cache lines corresponds to 4 KB ( = 32 cache lines x 128 B ). When maximum threads per processors is 2048, it can hold 8 blocks ( = 2048 threads / 256 threads ), and it needs 32 KB ( = 8 blocks x 4 KB ) cache to fillup all the read/write reqeust. This is exceeding 16KB limit of L1 cache. When a block is 32 x 8, it needs half and it can fit in 16KB L1 cache.

Another test is to remove the store instruction and it makes around two times fatser. What is the store efficiency ? How good can it be ? How does it being calculated ? 

The GT 640 has 0.891GB/s DDR3 with 128 Bits bus width. And it means 28.5 GB /s  ( = 0.891 x 2 x 128 / 8 ) throughput. The horizontal pitch comparison algorithm needs 5 pixels to read and 1 pixel to write.

If there is no calculation and just pure memory transaction, the maximal throughput is 4.75GB/s ( = 28.5 / 6 ). This also assumes that threads are cooperating to fully utilize the cache - 1 reads get 16 B ( = 128bits / 8 ) which will be stored in the cache and eventually consumed by 16 threads.

The current implementaion only runs around 0.4GB/s which is just 10% of maximum.

No comments: