This article is part of a series on performance guidelines.
We’ve seen how important it is that threads in a warp work
together while accessing device memory. To that effect, one has to abandon the
notion of threads acting on their own, so familiar with CPU programming, and
use instead the block or the warp as the primary unit of work,
where its underlying threads are mere cogs of a larger machine. I find that it
helps when I mentally picture the warp as a woven fabric whose threads
are either the loom’s needles or the fabric’s mesh. As an example, consider the
following graphical representation of a well-known parallel sorting algorithm –
Batcher’s Bitonic Sort
– applied to an array of 32 elements.
Each horizontal line represents an index of the length 32 buffer
to be sorted; each arrow depicts a thread performing a “compare-and-swap”
operation on two memory indexes – two values are compared and, if the
inequality holds, swapped.
This class of algorithms (called Sorting Networks) has
the characteristic that at any point in time, each thread’s actions are
independent from any previous action of any thread. This is an important feature
since transferring information between warps within a parallel setup can often
be expensive in terms of synchronization. Also, it makes it possible to
represent the algorithm through a diagram such as the one pictured above.
The diagram is split into columns (A to O) so that all
threads within each column may act in unison without risking a race condition.
Bitonic Sort
For Batcher’s Bitonic Sort, it turns out that each column
always uses the same number of threads – precisely half the number of indexes in
the buffer – and that powers of two are ubiquitous, implying a significant
speed boost for many arithmetic operations.
There is also an obvious compartmentalization of buffer
access that we can exploit to take advantage of the memory access pattern for
GPUs:
Let’s, for the sake of the argument, assume we are using the
above diagram to sort a list of 32-bit words using a GPU with a warp size of 8
and a shared memory maximum size of 16 32-bit words per warp. Therefore we’d
need to use two warps working concurrently to sort this list, which we’ll call
Warp1 and Warp2. We’ll assign the sorting buffer’s indexes 0 through 15 to
Warp1 and indexes 16 through 31 to Warp2. Under this setup, if each warp loads
the values of their assigned indexes into their respective shared memory
buffer, they will be able to work exclusively with shared memory right to the
point when the threads reach column K. At this juncture, each warp has no
alternative but to store the result of their computations back into global
memory, making it visible to the remaining warps. Even though we can’t use
shared memory while processing column K, we can at least have coalesced global memory
access, since the indexes to be accessed are all sequential and aligned in
power-of-two boundaries. Once column K is processed, we can revert to our
original method of using shared memory as temporary storage until our list is
fully sorted. We have, therefore, used two distinct algorithms for different
stages of the sorting – one using shared memory, the other enforcing coalesced
global memory access. Unfortunately, we can’t seem to avoid shared memory bank
conflicts, as can be easily seen by examining, for instance, column H, where
thread #0 in Warp1 accesses Bank 0 to read index 0 and thread #4 also accesses Bank
0 to read index 8 (because there are 8 banks, as many as the warp size, and 0
== 8 % 8).
We conclude that the Bitonic Sort algorithm exhibits good
locality and good use of warp resources. But it also has problems with bank
access violations and isn’t always able to use shared memory as a means to
avoid reading from global memory, particularly as the array size becomes larger.
Odd-Even Merge Sort
When it comes to algorithmic complexity, it turns out that
the best general-purpose sorting network (i.e., which works for an arbitrary
input array size) is Batcher’s
Odd-Even Merge Sort (apart from some theoretical work with no practical
implementation), with its
¼*n*Log(n)*(Log(n)-1)+n-1 instructions, comparing favourably against Bitonic’s
¼*n*Log(n)*(Log(n)+1) instructions. While both have a worst case complexity of O(n*Log(n)^2), their hidden constants can still play a significant part in our decision process.
¼*n*Log(n)*(Log(n)-1)+n-1 instructions, comparing favourably against Bitonic’s
¼*n*Log(n)*(Log(n)+1) instructions. While both have a worst case complexity of O(n*Log(n)^2), their hidden constants can still play a significant part in our decision process.
That, unfortunately, is its only advantage against Bitonic Sort,
since features such as full-warp utilization, spatial locality and shared
memory pattern only apply at the very beginning of the sorting process, when
the memory access per warp is still properly segmented.
Performance comparison
# Rows
|
Bitonic
(ms)
|
Oddeven
(ms)
|
16K
|
0.47
|
1.20
|
32K
|
0.53
|
1.39
|
64K
|
0.71
|
1.70
|
128K
|
0.75
|
2.01
|
256K
|
0.80
|
2.32
|
512K
|
1.07
|
6.47
|
1M
|
4.49
|
14.71
|
2M
|
12.15
|
33.64
|
4M
|
30.27
|
75.34
|
8M
|
72.95
|
169.90
|
16M
|
173.42
|
382.45
|
32M
|
405.50
|
848.96
|
64M
|
932.00
|
1875.55
|
Is the lower worst-case complexity still a good enough justification
to choose it over Bitonic sort? To find out, I adapted NVIDIA’s CUDA Toolkit’s
samples to CUDAfy and ran some tests on a GeForce 680 GTX. The charted times are in milliseconds, and the test run
times were averaged over 100 repetitions, after an adequate warm-up period. Another
chart plots the exact number of compare-and-swap operations for both sorting
algorithms. Both charts are in Logarithmic (base 2) scale. The Size of List is measured in number of list elements, where each element is a 32-bit pair of key-value; for example, a list size of 512K will result in an underlying buffer of exactly 4 megabytes
As it turns out, our
implementation of Bitonic sort wins for array lengths of up to 64 million, in
spite of a larger constant factor on the algorithmic complexity, clearly
showing how important is the memory access design patterns previously
discussed.
It's also worth noticing the slope in the vicinity of the 256K list size, and one can't help wondering if it is in any way related with the GTX 680's L2 cache size of 512KB. A list of 256*1024 elements fits exactly in 2MB, for times the size of the L2 cache. Even though the whole list will no longer fit fully in the L2 cache, the cache's hit/miss ratio will still remain fairly high for buffers that won't far exceed the L2 cache's size. This effect will drop sharply once the buffer becomes too large for the L2.
Further considerations
Sorting Networks aren’t the only choice available when it
comes to parallel implementations, but they certainly are the most enticing.
Although viable alternatives exist, such as parallel radix sort (implemented in
thrust library’s adaptive sort, for example), they are either too dependent on
the data type being sorted, are fiendishly complex to implement or have poor
locality (usually it’s all of the above).
The Log-squared limitation on Sorting Networks isn’t really
very important given that it only becomes a factor with very large array sizes,
usually beyond the memory available in typical desktop GPUs.
Given the prohibitively slow data transfer rate between host
and device, usually one only sorts arrays on the GPU as part of a larger
GPU-bound algorithm. Otherwise it would be simpler to just sort on the CPU and
do away with the additional memory transfers. Unless, that is, we opt for the
extra complexity of a hybrid CPU-GPU sorter, where 3 distinct operations are
simultaneously carried out:
1 – The GPU sorts small to medium-sized arrays;
2 – The CPU merges the already sorted arrays from (1);
3 – Memory is transferred between host-device. This step may
be further subdivided in two distinct lanes, one copying data to be sorted onto
the device, while the other copies sorted data back to the host.
Each of these steps may happen simultaneously because they
use different resources – the GPU, the CPU and the PICe bus. The hybrid
approach also has the advantage of easily fixing the lack of a straightforward
implementation of GPU sorting for non-powers of two, by delegating on the CPU
the sorting of the non-power-of-two leftover arrays and their subsequent merge
with the GPU-sorted power-of-two arrays:
1 – Split the list into a larger power-of-two list A and a
smaller list B with the remainder.
2 – Sort A normally on the GPU.
3 – Either sort B on the CPU, or apply step (1) recursively until
the remainder is small enough to be sorted on the CPU.
4 – Merge the sorted lists together on the CPU.