About Me

My photo
I'm an IT professional living in Lisbon

Thursday, August 8, 2013

1/7: Introduction


This article is part of a series on performance guidelines.


After spending a long time answering the same questions, over and over again, about performance techniques on CPUs and, lately, on GPUs, I decided to write this text using sorting algorithms as an example, since their proper implementation will force us to review all the fundamental directives needed for writing performance-optimized code. Even though sorting algorithms are at the core of this discussion, it turns out that all of the techniques described will apply to nearly every class of algorithm.
Throughout this text, N is the problem’s input size.

Fundamental concepts

To make any significant headway into this text, the reader should already be familiar with the fundamental concepts behind GPU programming, preferably having already written parallel code. The CUDA Toolkit’s “C Programming Guide” is an excellent starting point. Knowledge of the basics on computational complexity is also required.

CUDAfy.NET

All sample source code is written in C#, using CUDAfy for all GPU-bound code.
“CUDAfy .NET allows easy development of high performance GPGPU applications completely from the Microsoft .NET framework”
Before I used CUDAfy, I had to wrap my C code within a static library and link it against my .NET application, which led to a significant effort in code maintenance and the annoyance of dealing with two languages simultaneously. With CUDAfy I can write CPU and GPU code directly in C#, inside the same assembly and without ever leaving my visual studio project. Except for very contrived scenarios, everything I used to be able to do in C can now be done in C# without loss of functionality. In spite of its name, CUDAfy can not only interface with the GPU through CUDA, but also via OpenCL, allowing the developer to target most architectures at will from the same C# code with (barely) no modifications.

Language and Compiler choice

The first step most programmers usually take while tuning performance is to use a lower level language (ex. C++ vs C#) on their inner loops, usually gaining approximately a 10% to 20% speed improvement with the correct compiler settings. But there’s so much gain to be had elsewhere that it’s the last thing I usually do. Rewriting your inner loop in a different language, and setting up a parallel project to link against is time consuming, error prone and will lead to a greater effort in code maintenance. Do yourself a favour and leave it for last.

GPU nomenclature

Throughout this text I’ll use NVIDIA/CUDA terms such as warps (vs Wavefronts for AMD), shared memory (vs Local Memory for OpenCL), etc. But in nearly every case, your favourite GPU architecture will have concepts that match CUDA’s. Everything written herein applies only for NVIDIA architectures with compute >= 2.0.

Diminishing returns

It may seem odd giving such an advice on a document about improving performance, but I have seen this behaviour so often that I felt the need to make my point early on:
Don’t overdo it. Focus your valuable engineer time on that which may give the greater returns on your investment. Work only on your inner loops, where the performance gains are greatest. Why spend weeks optimizing the rest of a piece of software for a measly 5% speed improvement when you can really tune the inner loop of your algorithms and gain a whooping 50%? Prioritize, and know when to quit.