Below is the modified code snippet for the pinned memory. Host memory is allocated with flag CL_MEM_ALLOC_HOST_PTR and get the raw pointer with mapping function. And these mapped raw pointer is used for buffer copy operation. It is a bit odd to use raw pointer not the host buffer handle but I guess it's for maintaining function interface.
... cl::Buffer hostSrc( task.GetContext(), CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, width*height ); cl::Buffer hostDst( task.GetContext(), CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, width*height ); uint8_t *hostSrcPtr = (unsigned char*) queue.enqueueMapBuffer( hostSrc, CL_TRUE, CL_MAP_WRITE, 0, width*height ); uint8_t *hostDstPtr = (unsigned char*) queue.enqueueMapBuffer( hostSrc, CL_TRUE, CL_MAP_READ, 0, width*height ); ... queue.enqueueWriteBuffer( devSrc, CL_FALSE, 0, width*height, hostSrcPtr, &ve0, &steps[1] ); ... queue.enqueueNDRangeKernel( kernelPitch0, cl::NullRange, globalws, localws, &ve1, &steps[2] ); ... queue.enqueueReadBuffer( devDst, CL_FALSE, 0, width*height, hostDstPtr, &ve2, &steps[3] );
The result is as expected. The achieved memory copy throughput is around 6.1GB/s. And it makes overall throughput to be 1.4 GB/s. First time over the GIGA. Huray !
Entering test case "TestPitch0Pinned" Step 1 : start 0 ns, end 647168 ns, duration 647168 ns, 6136.32 MB/s Step 2 : start 879072 ns, end 2166112 ns, duration 1287040 ns, 3085.55 MB/s Step 3 : start 2196800 ns, end 2824128 ns, duration 627328 ns, 6330.39 MB/s Total : duration 2824128 ns, 1406.18 MB/s
Reference : OpenCL Best Practice Guide, Chapter 3.1.1. Pinned Memory
No comments:
Post a Comment