Inter-block communication
Remarks#
Blocks in CUDA operate semi-independently. There is no safe way to synchronize them all. However, it does not mean that they cannot interact with each other in any way.
Last-block guard
Consider a grid working on some task, e.g. a parallel reduction. Initially, each block can do its work independently, producing some partial result. At the end however, the partial results need to be combined and merged together. A typical example is a reduction algorithm on a big data.
A typical approach is to invoke two kernels, one for the partial computation and the other for merging.
However, if the merging can be done efficiently by a single block, only one kernel call is required.
This is achieved by a lastBlock
guard defined as:
__device__ bool lastBlock(int* counter) {
__threadfence(); //ensure that partial result is visible by all blocks
int last = 0;
if (threadIdx.x == 0)
last = atomicAdd(counter, 1);
return __syncthreads_or(last == gridDim.x-1);
}
__device__ bool lastBlock(int* counter) {
__shared__ int last;
__threadfence(); //ensure that partial result is visible by all blocks
if (threadIdx.x == 0) {
last = atomicAdd(counter, 1);
}
__syncthreads();
return last == gridDim.x-1;
}
With such a guard the last block is guaranteed to see all the results produced by all other blocks and can perform the merging.
__device__ void computePartial(T* out) { ... }
__device__ void merge(T* partialResults, T* out) { ... }
__global__ void kernel(int* counter, T* partialResults, T* finalResult) {
computePartial(&partialResults[blockIdx.x]);
if (lastBlock(counter)) {
//this is executed by all threads of the last block only
merge(partialResults,finalResult);
}
}
Assumptions:
- The counter must be a global memory pointer, initialized to 0 before the kernel is invoked.
- The
lastBlock
function is invoked uniformly by all threads in all blocks - The kernel is invoked in one-dimentional grid (for simplicity of the example)
T
names any type you like, but the example is not intended to be a template in C++ sense
Global work queue
Consider an array of work items. The time needed for an each work item to complete varies greatly. In order to balance the work distribution between blocks it may be prudent for each block to fetch the next item only when previous one is complete. This is in contrast to a-priori assigning items to blocks.
class WorkQueue {
private:
WorkItem* gItems;
size_t totalSize;
size_t current;
public:
__device__ WorkItem& fetch() {
__shared__ WorkItem item;
if (threadIdx.x == 0) {
size_t itemIdx = atomicAdd(current,1);
if (itemIdx<totalSize)
item = gItems[itemIdx];
else
item = WorkItem::none();
}
__syncthreads();
return item; //returning reference to smem - ok
}
}
Assumptions:
- The WorkQueue object, as well as gItem array reside in global memory
- No new work items are added to the WorkQueue object in the kernel that is fetching from it
- The
WorkItem
is a small representation of the work assignment, e.g. a pointer to another object WorkItem::none()
static member function creates aWorkItem
object that represents no work at allWorkQueue::fetch()
must be called uniformly by all threads in the block- There are no 2 invocations of
WorkQueue::fetch()
without another__syncthreads()
in between. Otherwise a race condition will appear!
The example does not include how the initialize the WorkQueue
or populate it. It is done by another kernel or CPU code and should be pretty straight-forward.