This article is part of a series on performance guidelines.
Nearly every developer is at least vaguely aware of the
existence of a caching system on most modern computers, but not many actually
program their inner loops and algorithms with it in mind. And yet, once
algorithmic complexity considerations are resolved, memory cache should become
the main focus while trying to increase a program’s performance.
While a typical CPU instruction can take, on average, one
clock cycle (sometimes a lot less), uncached memory access can easily take
between 50 and 100 clock cycles (this ratio has been increasing over the years,
since CPU clock frequency, driven by market forces, has raised faster than
memory speed and bus bandwidth). These days, accessing memory can easily be 100
times slower than executing one single typical assembly instruction. There’s a
lot of room for improvement here by working with the memory cache, and not
against it.
Architecture
In this diagram of a pseudo-CPU memory access model, the arrows
represent dataflow paths, the thicker the arrow, the wider the flow capacity.
Usually L1 memory resides on the same core; L2 cache is usually on-chip and may
either reside within each core, or is shared between them; L3 cache, if at all
present, may either be on or off-chip. L1 cache is the fastest and usually the
tiniest, while L3 has the greatest capacity and is the slowest.
A simple guideline
While the greatest performance gains may only be achieved by
knowing the exact cache characteristics
of the CPU we’re programming against (look-aside vs look-through, write
policies, cache line and cache page sizes, cache organization – fully
associative vs direct map, etc), in the vast majority of cases we can get
excellent results by following a simple guideline:
Make sure your algorithm accesses only the same small
regions of data for as long as possible before moving elsewhere.
Programming techniques – loop blocking
A simple programming technique to achieve this goal is loop
blocking: Divide your data-scanning loops into coupled sub-loops and
reorder them within the hierarchy of nested loops. Here’s a typical example,
for a matrix square product (X.Y=Z):
1 – Trivial implementation
for (int I = 0; I < M; I++)
for (int K = 0;
K < M; K++)
for (int J = 0;
J < M; J++)
Z[J, I] = Z[J, I] + X[K, I] * Y[J, K];
2 – Loop split, for block size B <= M
for (int I = 0; I < M; I++)
// K loop, part 1
for (int K2 = 0; K2 < M; K2 += B)
// K loop, part 2
for (int K1 =
K2; K1 < Math.Min(K2 + B, M); K1++)
// J loop, part 1
for (int J2 =
0; J2 < M; J2 += B)
// J loop, part 2
for (int J1 =
J2; J1 < Math.Min(J2 + B, M); J1++)
Z[J1, I] = Z[J1, I] + X[K1, I] * Y[J1, K1];
3 – Reordered loops, becoming cache-optimized
// K loop, part 1, reordered
for (int K2 = 0; K2 < M; K2 += B)
// J loop, part 1, reordered
for (int J2 = 0; J2 < M; J2 += B)
for (int I = 0; I < M; I++) // reordered I loop
// K loop, part 2, reordered
for (int K1 =
K2; K1 < Math.Min(K2 + B, M); K1++)
// J loop, part 2, reordered
for (int J1 =
J2; J1 < Math.Min(J2 + B, M); J1++)
Z[J1, I] = Z[J1, I] + X[K1, I] * Y[J1, K1];
Given the varying nature of the underlying hardware’s
architecture, the best choice for the block size “B” must be empirically
determined over a range of reasonable values.
Other similar techniques exist, variations on the method
above.
Programming techniques – memory buffer layout
The way data is laid out onto memory may affect cache
performance. As an example, consider the following two schemes for storing
key/value 32-bit word pairs:
1 – Two uint[] arrays, one holding the sequence of keys, the
other the corresponding sequence of values. Useful when using algorithms that
iterate over either the key array or the value array, but not both at once. For
example, computing the maximum of the value array. This way, no unnecessary
memory elements are loaded.
uint[]
keys, values;
2 – An array of key/value pair structs. Used when both key
and value are needed within each inner loop iteration - for example, when
sorting the array. Whenever a key is loaded from memory it’s likely that, due
to proximity, the corresponding value has also been loaded into cache,
resulting in a cache hit once it’s requested.
public struct kvp
{
public uint
key;
public uint
value;
}
kvp[]
kvps;