Most of modern architectures have a Weakly-Ordered Memory Model, meaning that the memory is not automatically accessed in the order in which we specify it in the program. Therefore, to guarantee the program correctness, we must explicitly force a certain ordering. This can be achieved with the use of Memory Fences and Synchronization Barriers.
In GPGPU, Barriers are very familiar and are widely used in the community. Memory Fences however tend to be ignored.
- Synchronization Barrier acts as a point at which block threads wait until all of them have reached it.
- Memory Fence ensure that all writes made by a thread before the fence are visible to all block threads after the fence. All threads are not forced to execute it though .
Let's take the case of a kernel in which we are launching thread blocks where the block data is mapped at runtime: For instance, the algorithm will select randomly X threads that will perform global and/or local memory loads and stores to do some processing with it later. The remaining REST of the threads will however perform no-writing work and won't affect kernel instructions that follows the barrier / fence (cf. drawing).
Without thread ordering, the kernel would obviously turn into a mess because of race conditions. Synchronizing with a barrier is the standard way to guarantee correctness, but since only X threads are performing writes, it would be useless to wait for the REST of the threads to synchronize. By fencing the threads, only writing threads are synchronized, allowing us save time !
To put it buntly, I would say that Synchronization is a bruteforce way for ordering threads at block scale.
Note that this example is one of the several situations where we could leverage the use of Memory fences over standard synchronization.
Nevertheless, I must admit that these functions are intended to be used by experienced GPGPU devs who already know what they're doing (That is, having the ability to find the best tradeoffs between Branch divergences, Streams, G/L access patterns and Thread ordering when mapping the data to the grid is not controllable).
So, in what kind on situations would you use memory fences ?
Feel free to share your thoughts in the comment section below.