hide

Read Next

Second down

On The Jittery Bee

10am at Paragon...

One more interview down...

Was so nervous, found myself stumbled quite a bit as I was reading the book to the children...

During the post interview, I rated myself a 7 / 10 for my performance. Was told that I was a bit soft and monotonous when I was reading the books. Well, I accepted it as I found that it was not easy to speak up when there were 4 pairs of eyes observing me. It was lucky that the children were lenient on me =))

Shall see how the next two interviews will be later on...

(cuda/opencl) Leveraging Block/Workgroup unrolling in reduction algorithms

On GPGPnotes

In the previous post, we have seen the benefits of skipping synchronization at warp scale. Certainly, this trick provided a significant latency boost but didn't actually hide the famous thread idleness present in reduction problems.

To deal with that, we could think of increasing the work per thread block - allowing us to launch the kernel with a smaller grid. Remember that smaller grid means shorter execution time.

Let's delve into that by computing the sum of 1677216 integers.

Kernel performing sum only with classic reduction (implicit warp synchronization included)

Rendering New Theme...