Performance guidelines: 5/7: Sorting algorithms on a GPU

This article is part of a series on performance guidelines.

We’ve seen how important it is that threads in a warp work together while accessing device memory. To that effect, one has to abandon the notion of threads acting on their own, so familiar with CPU programming, and use instead the block or the warp as the primary unit of work, where its underlying threads are mere cogs of a larger machine. I find that it helps when I mentally picture the warp as a woven fabric whose threads are either the loom’s needles or the fabric’s mesh. As an example, consider the following graphical representation of a well-known parallel sorting algorithm – Batcher’s Bitonic Sort – applied to an array of 32 elements.

Each horizontal line represents an index of the length 32 buffer to be sorted; each arrow depicts a thread performing a “compare-and-swap” operation on two memory indexes – two values are compared and, if the inequality holds, swapped.

This class of algorithms (called Sorting Networks) has the characteristic that at any point in time, each thread’s actions are independent from any previous action of any thread. This is an important feature since transferring information between warps within a parallel setup can often be expensive in terms of synchronization. Also, it makes it possible to represent the algorithm through a diagram such as the one pictured above.

The diagram is split into columns (A to O) so that all threads within each column may act in unison without risking a race condition.

Bitonic Sort

For Batcher’s Bitonic Sort, it turns out that each column always uses the same number of threads – precisely half the number of indexes in the buffer – and that powers of two are ubiquitous, implying a significant speed boost for many arithmetic operations.

There is also an obvious compartmentalization of buffer access that we can exploit to take advantage of the memory access pattern for GPUs:

Let’s, for the sake of the argument, assume we are using the above diagram to sort a list of 32-bit words using a GPU with a warp size of 8 and a shared memory maximum size of 16 32-bit words per warp. Therefore we’d need to use two warps working concurrently to sort this list, which we’ll call Warp1 and Warp2. We’ll assign the sorting buffer’s indexes 0 through 15 to Warp1 and indexes 16 through 31 to Warp2. Under this setup, if each warp loads the values of their assigned indexes into their respective shared memory buffer, they will be able to work exclusively with shared memory right to the point when the threads reach column K. At this juncture, each warp has no alternative but to store the result of their computations back into global memory, making it visible to the remaining warps. Even though we can’t use shared memory while processing column K, we can at least have coalesced global memory access, since the indexes to be accessed are all sequential and aligned in power-of-two boundaries. Once column K is processed, we can revert to our original method of using shared memory as temporary storage until our list is fully sorted. We have, therefore, used two distinct algorithms for different stages of the sorting – one using shared memory, the other enforcing coalesced global memory access. Unfortunately, we can’t seem to avoid shared memory bank conflicts, as can be easily seen by examining, for instance, column H, where thread #0 in Warp1 accesses Bank 0 to read index 0 and thread #4 also accesses Bank 0 to read index 8 (because there are 8 banks, as many as the warp size, and 0 == 8 % 8).

We conclude that the Bitonic Sort algorithm exhibits good locality and good use of warp resources. But it also has problems with bank access violations and isn’t always able to use shared memory as a means to avoid reading from global memory, particularly as the array size becomes larger.

Odd-Even Merge Sort

When it comes to algorithmic complexity, it turns out that the best general-purpose sorting network (i.e., which works for an arbitrary input array size) is Batcher’s Odd-Even Merge Sort (apart from some theoretical work with no practical implementation), with its
¼*n*Log(n)*(Log(n)-1)+n-1 instructions, comparing favourably against Bitonic’s
¼*n*Log(n)*(Log(n)+1) instructions. While both have a worst case complexity of O(n*Log(n)^2), their hidden constants can still play a significant part in our decision process.

That, unfortunately, is its only advantage against Bitonic Sort, since features such as full-warp utilization, spatial locality and shared memory pattern only apply at the very beginning of the sorting process, when the memory access per warp is still properly segmented.

Performance comparison

# Rows	Bitonic (ms)	Oddeven (ms)
16K	0.47	1.20
32K	0.53	1.39
64K	0.71	1.70
128K	0.75	2.01
256K	0.80	2.32
512K	1.07	6.47
1M	4.49	14.71
2M	12.15	33.64
4M	30.27	75.34
8M	72.95	169.90
16M	173.42	382.45
32M	405.50	848.96
64M	932.00	1875.55

Is the lower worst-case complexity still a good enough justification to choose it over Bitonic sort? To find out, I adapted NVIDIA’s CUDA Toolkit’s samples to CUDAfy and ran some tests on a GeForce 680 GTX. The charted times are in milliseconds, and the test run times were averaged over 100 repetitions, after an adequate warm-up period. Another chart plots the exact number of compare-and-swap operations for both sorting algorithms. Both charts are in Logarithmic (base 2) scale. The Size of List is measured in number of list elements, where each element is a 32-bit pair of key-value; for example, a list size of 512K will result in an underlying buffer of exactly 4 megabytes

As it turns out, our implementation of Bitonic sort wins for array lengths of up to 64 million, in spite of a larger constant factor on the algorithmic complexity, clearly showing how important is the memory access design patterns previously discussed.

It's also worth noticing the slope in the vicinity of the 256K list size, and one can't help wondering if it is in any way related with the GTX 680's L2 cache size of 512KB. A list of 256*1024 elements fits exactly in 2MB, for times the size of the L2 cache. Even though the whole list will no longer fit fully in the L2 cache, the cache's hit/miss ratio will still remain fairly high for buffers that won't far exceed the L2 cache's size. This effect will drop sharply once the buffer becomes too large for the L2.

Further considerations

Sorting Networks aren’t the only choice available when it comes to parallel implementations, but they certainly are the most enticing. Although viable alternatives exist, such as parallel radix sort (implemented in thrust library’s adaptive sort, for example), they are either too dependent on the data type being sorted, are fiendishly complex to implement or have poor locality (usually it’s all of the above).

The Log-squared limitation on Sorting Networks isn’t really very important given that it only becomes a factor with very large array sizes, usually beyond the memory available in typical desktop GPUs.

Given the prohibitively slow data transfer rate between host and device, usually one only sorts arrays on the GPU as part of a larger GPU-bound algorithm. Otherwise it would be simpler to just sort on the CPU and do away with the additional memory transfers. Unless, that is, we opt for the extra complexity of a hybrid CPU-GPU sorter, where 3 distinct operations are simultaneously carried out:

1 – The GPU sorts small to medium-sized arrays;

2 – The CPU merges the already sorted arrays from (1);

3 – Memory is transferred between host-device. This step may be further subdivided in two distinct lanes, one copying data to be sorted onto the device, while the other copies sorted data back to the host.

Each of these steps may happen simultaneously because they use different resources – the GPU, the CPU and the PICe bus. The hybrid approach also has the advantage of easily fixing the lack of a straightforward implementation of GPU sorting for non-powers of two, by delegating on the CPU the sorting of the non-power-of-two leftover arrays and their subsequent merge with the GPU-sorted power-of-two arrays:

1 – Split the list into a larger power-of-two list A and a smaller list B with the remainder.

2 – Sort A normally on the GPU.

3 – Either sort B on the CPU, or apply step (1) recursively until the remainder is small enough to be sorted on the CPU.

4 – Merge the sorted lists together on the CPU.

Performance guidelines

About Me

Thursday, August 8, 2013

5/7: Sorting algorithms on a GPU

This article is part of a series on performance guidelines.

Bitonic Sort

Odd-Even Merge Sort

Performance comparison

Further considerations