Read Next

(cuda/opencl) Leveraging implicit intra-warp synchronization in reduction algorithms

Let's consider a program computing a 40-sized histogram of a random array of float elements. The naive approach would be obviously to use atomics at block and grid scale with atomicAdd (atomic_add for OpenCL) which will increment the target bin value by one every time it gets there. Since a lot of atomic operations create bottlenecks in bandwidth, we could think of another approach that would get rid of atomics at block scale:

We can achieve that by computing local histograms on shared memory (here 32 for instance) in every block of threads, then performing a reduction on them, and finally updating the global histogram buffer (with atomics at grid scale).

Kernel with explicit synchronization

However, during the reduction step, we are explicitly synchronizing the shared memory to avoid race conditions. Keep in mind that we are doing that for all the blocks ! Therefore, we could think of removing the barriers to boost performance once again. But, would that affect the output correctness ?

A Second's Value

On The Words of Focus Project

Well. Cancer.

I was going to write about how today was strangly a good day, mostly because of the obstacles I faced during it and the ways in which they were overcome.

I had an intense stomach ache all day...yet I took care of the Stroopwafel, planned out the Project Bamboo Idea Funnel with Jernej and Marine.

I had an intense stomach ache, and felt what it is to have others take care of you. Jenny came home, when Nico & I were talking, and just began giving me a massage. It felt AMAZING.

Rendering New Theme...