This article is part of a series on performance guidelines.
After spending a long time answering the same questions,
over and over again, about performance techniques on CPUs and, lately, on GPUs,
I decided to write this text using sorting algorithms as an example, since
their proper implementation will force us to review all the fundamental
directives needed for writing performance-optimized code. Even though sorting
algorithms are at the core of this discussion, it turns out that all of the
techniques described will apply to nearly every class of algorithm.
Throughout this text, N is the problem’s input size.
Fundamental concepts
To make any significant headway into this text, the reader
should already be familiar with the fundamental concepts behind GPU
programming, preferably having already written parallel code. The CUDA
Toolkit’s “C Programming Guide” is an excellent starting point. Knowledge of
the basics on computational
complexity is also required.
CUDAfy.NET
All sample source code is written in C#, using CUDAfy for all GPU-bound code.
“CUDAfy .NET allows easy development of high performance
GPGPU applications completely from the Microsoft .NET framework”
Before I used CUDAfy, I had to wrap my C code within a
static library and link it against my .NET application, which led to a
significant effort in code maintenance and the annoyance of dealing with two
languages simultaneously. With CUDAfy I can write CPU and GPU code directly in
C#, inside the same assembly and without ever leaving my visual studio project.
Except for very contrived scenarios, everything I used to be able to do in C
can now be done in C# without loss of functionality. In spite of its name,
CUDAfy can not only interface with the GPU through CUDA, but also via OpenCL,
allowing the developer to target most architectures at will from the same C#
code with (barely) no modifications.
Language and Compiler choice
The first step most programmers usually take while tuning
performance is to use a lower level language (ex. C++ vs C#) on their inner
loops, usually gaining approximately a 10% to 20% speed improvement with the
correct compiler settings. But there’s so much gain to be had elsewhere that
it’s the last thing I usually do. Rewriting your inner loop in a different
language, and setting up a parallel project to link against is time consuming,
error prone and will lead to a greater effort in code maintenance. Do yourself
a favour and leave it for last.
GPU nomenclature
Throughout this text I’ll use NVIDIA/CUDA terms such as warps
(vs Wavefronts for AMD), shared memory (vs Local Memory
for OpenCL), etc. But in nearly every case, your favourite GPU architecture
will have concepts that match CUDA’s. Everything written herein applies only
for NVIDIA architectures with compute >= 2.0.
Diminishing returns
It may seem odd giving such an advice on a document about
improving performance, but I have seen this behaviour so often that I felt the
need to make my point early on:
Don’t overdo it. Focus your valuable engineer time on that
which may give the greater returns on your investment. Work only on your inner
loops, where the performance gains are greatest. Why spend weeks optimizing the
rest of a piece of software for a measly 5% speed improvement when you can
really tune the inner loop of your algorithms and gain a whooping 50%?
Prioritize, and know when to quit.