Performance guidelines: 2/7: The CPU memory cache

This article is part of a series on performance guidelines.

Nearly every developer is at least vaguely aware of the existence of a caching system on most modern computers, but not many actually program their inner loops and algorithms with it in mind. And yet, once algorithmic complexity considerations are resolved, memory cache should become the main focus while trying to increase a program’s performance.

While a typical CPU instruction can take, on average, one clock cycle (sometimes a lot less), uncached memory access can easily take between 50 and 100 clock cycles (this ratio has been increasing over the years, since CPU clock frequency, driven by market forces, has raised faster than memory speed and bus bandwidth). These days, accessing memory can easily be 100 times slower than executing one single typical assembly instruction. There’s a lot of room for improvement here by working with the memory cache, and not against it.

Architecture

In this diagram of a pseudo-CPU memory access model, the arrows represent dataflow paths, the thicker the arrow, the wider the flow capacity. Usually L1 memory resides on the same core; L2 cache is usually on-chip and may either reside within each core, or is shared between them; L3 cache, if at all present, may either be on or off-chip. L1 cache is the fastest and usually the tiniest, while L3 has the greatest capacity and is the slowest.

A simple guideline

While the greatest performance gains may only be achieved by knowing the exact cache characteristics of the CPU we’re programming against (look-aside vs look-through, write policies, cache line and cache page sizes, cache organization – fully associative vs direct map, etc), in the vast majority of cases we can get excellent results by following a simple guideline:

Make sure your algorithm accesses only the same small regions of data for as long as possible before moving elsewhere.

Programming techniques – loop blocking

A simple programming technique to achieve this goal is loop blocking: Divide your data-scanning loops into coupled sub-loops and reorder them within the hierarchy of nested loops. Here’s a typical example, for a matrix square product (X.Y=Z):

1 – Trivial implementation

for (int I = 0; I < M; I++)

for (int K = 0; K < M; K++)

for (int J = 0; J < M; J++)

Z[J, I] = Z[J, I] + X[K, I] * Y[J, K];

2 – Loop split, for block size B <= M

for (int I = 0; I < M; I++)

// K loop, part 1

for (int K2 = 0; K2 < M; K2 += B)

// K loop, part 2

for (int K1 = K2; K1 < Math.Min(K2 + B, M); K1++)

// J loop, part 1

for (int J2 = 0; J2 < M; J2 += B)

// J loop, part 2

for (int J1 = J2; J1 < Math.Min(J2 + B, M); J1++)

Z[J1, I] = Z[J1, I] + X[K1, I] * Y[J1, K1];

3 – Reordered loops, becoming cache-optimized

// K loop, part 1, reordered

for (int K2 = 0; K2 < M; K2 += B)

// J loop, part 1, reordered

for (int J2 = 0; J2 < M; J2 += B)

for (int I = 0; I < M; I++) // reordered I loop

// K loop, part 2, reordered

for (int K1 = K2; K1 < Math.Min(K2 + B, M); K1++)

// J loop, part 2, reordered

for (int J1 = J2; J1 < Math.Min(J2 + B, M); J1++)

Z[J1, I] = Z[J1, I] + X[K1, I] * Y[J1, K1];

Given the varying nature of the underlying hardware’s architecture, the best choice for the block size “B” must be empirically determined over a range of reasonable values.

Other similar techniques exist, variations on the method above.

Programming techniques – memory buffer layout

The way data is laid out onto memory may affect cache performance. As an example, consider the following two schemes for storing key/value 32-bit word pairs:

1 – Two uint[] arrays, one holding the sequence of keys, the other the corresponding sequence of values. Useful when using algorithms that iterate over either the key array or the value array, but not both at once. For example, computing the maximum of the value array. This way, no unnecessary memory elements are loaded.

uint[] keys, values;

2 – An array of key/value pair structs. Used when both key and value are needed within each inner loop iteration - for example, when sorting the array. Whenever a key is loaded from memory it’s likely that, due to proximity, the corresponding value has also been loaded into cache, resulting in a cache hit once it’s requested.

public struct kvp

{

public uint key;

public uint value;

}

kvp[] kvps;

Performance guidelines

About Me

Thursday, August 8, 2013

2/7: The CPU memory cache

This article is part of a series on performance guidelines.

Architecture

A simple guideline

Programming techniques – loop blocking

Programming techniques – memory buffer layout