Writing cache-friendly code
Get the latest in your inbox
Get the latest in your inbox
Cache-friendly code is crucial in a system like Stardog. We share all the secrets. That’s how we roll.
Cache-friendly code matters a lot in systems programming. Stardog Knowledge Graph is based on the Stardog graph database in which performance is critical. In this post we look at issues around cache-friendly code particularly as it relates to Stardog’s native memory management.
CPU registers can move data around in single clock cycles, but they’re expensive. Most computer cores have less than a few dozen registers, which can store a few hundred to maybe a thousand bytes total. On the other hand DRAM memory is cheap—i.e. literally millions of times cheaper—but takes hundreds of cycles to return data.
To bridge this gap between registers—super fast and expensive—and RAM—super slow and cheap—modern systems use caches, which are named L1, L2, L3 in decreasing speed and cost.
The idea is that most of the executing code will be hitting a small set of variables often and hitting the rest of a much larger set of variables far less often. If the processor can’t find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache; and if not there, main memory. Each of these “misses” is expensive in time.
To achieve really good performance we need to reduce the number of cache misses and try to prevent so-called false sharing.
An important aspect of cache-friendly code is the principle of locality, the goal of which is to place related data close together in the register-RAM-cache hierarchy to allow efficient caching. In terms of the CPU cache, we have to talk about cache lines.
First, lets look at temporal locality. When a given memory location is accessed, the same location may be accessed again soon. Ideally, this information will still be cached.
Second, lets look at spatial locality, that is, placing related data close together. When you read from RAM, typically a larger chunk of memory is fetched than was requested because the program may require that data soon. Disk caches do the same.
Now let’s consider how cache-locality affects performance of our application. We will do that using Stardog 5’s new native memory management collections.
Here’s an interesting example of how temporal locality affects performance:
class TestHashSet {
private void testCheckAll() {
final HashSet aHashStructure = createHashSet();
// Insert elements from 1 to 10_000_000 to the MM-HashSet
//...
//...
long aStartTime = System.currentTimeMillis();
for (int i = 0; i < 10_000_000; i++) {
long aElement=i;
assertTrue(checkContains(aHashStructure, aElement));
}
System.out.println("Time (ms)=" + (System.currentTimeMillis() - aStartTime));
}
//Output:
//Time (ms) = 3573 ms.
}
Now we see a 2x improvement doing the same number of operations but checking different data.
class TestHashSet {
private void testCheckSingle() {
final HashSet aHashStructure = createHashSet();
// Insert elements from 1 to 10_000_000 to the MM-HashSet
//...
//...
long aStartTime = System.currentTimeMillis();
long aElement=0;
for (int i = 0; i < 10_000_000; i++) {
assertTrue(checkContains(aHashStructure, aElement));
}
System.out.println("Time (ms)=" + (System.currentTimeMillis() - aStartTime));
}
//Output:
//Time (ms) = 1810 ms.
}
In the second example we see far fewer cache misses and better temporal locality. Because the elements are always the same, the processor can always take this data from the cache.
The next example shows us growth of performance depending on data variety:
class TestHashSet {
public void testCheckWindow() throws IOException {
final HashSet aHashStructure = createHashSet();
// Insert elements from 1 to 10_000_000 to the MM-HashSet
//...
long aElement;
long aStartTime = System.currentTimeMillis();
int aWindow=1000;
for (int i = 0; i < 10_000_000; i++) {
assertTrue(checkContains(aHashStructure, aElement));
aElement = (aElement+1)%aWindow;
}
System.out.println("Time (ms)=" + (System.currentTimeMillis() - aStartTime));
}
//Output:
//Time (ms) = 2041 ms.
}
As we described in Saying Goodby to Garbage, we have two types of sorting collection, sorted array and long-optimized sorted array. The second exists solely for better temporal locality. Lets look at both.
A sorted array stores a pointer to the element in the address block:
Address block Data block
.___________. .___________.
| address_1 | -> | data_1 |
| ... | | ... |
| ... | | ... |
| | | |
.-----------. .-----------.
When a sorted array sorts data inside the address block, it has to access to the address blocks and then compares elements access to the data blocks. Generally address blocks and data blocks are in different parts of memory, and the processor can’t use cached data during this read-compare operation.
How do long-optimized sorted arrays improve this?
Address block Data block
.___________. .___________.
| address_1 | -> | data_1 |
| ========= | | |
| long_1 | | |
.-----------. | |
| ... | | ... |
| | | |
.-----------. .-----------.
When long-optimized sorted array sorts data inside the address block it just reads a long value from the same block and does not access the data blocks. It can efficiently uses data stored in cache for the read-compare-swap operations.
This small change results in significant performance increase for sorting 5,000,000 solutions: Sorted array takes 11,619 ms, while long-optimized sorted array takes 4,759 ms.
We see real performance difference by improving cache locality. We have a compact format of data in long-optimized sorted arrays which gives us better spatial locality, and we don’t have to jump between memory regions during sorting which gives us better temporal locality, too.
We also have two kinds of hash set: conventional DefaultHashSet
and an
optimized LongHashSet
.
The conventional DefaultHashSet
stores data like this:
Address block Data block
.___________. .___________.
|address_1 | -> | data_1 |
|===========| | |
|hash_code_1| | |
| ... | | ... |
| ... | | ... |
| | | |
.-----------. .-----------.
A lookup reads the corresponding hashcode value—using open addressing probes—and then tries to compare user’s element with the element in the data block. And that requires jumps between memory blocks in different regions of RAM, destroying temporal locality.
In our optimized LongHashSet
we store one long value in the address block and
don’t use data blocks at all.
_____________
|long_value |
|===========|
| ... |
| ... |
| |
------------
We access the one block during lookup and achieve good temporal and spatial
locality. The performance results of the optimized hash set include insertion of
10,000,000 solutions and 10,000,000 lookups using 1 long are promising. The
conventional HashSet
’s insert time is 2883ms and lookup time is 3545ms. The
total time is 6428ms. In comparison, LongHashSet
is faster across the board:
insert 2070ms, lookup 1859ms, and 3929ms total.
Again, about 2x improvement with cache-friendly code.
In the original version of our HashTable
we stored address data in the
partitions and each partition may have many blocks. We improved spatial
locality by reducing the size of the element’s slot from 32 bytes
to 16 bytes and by using a dynamic number of partitions to keep single memory
blocks for each partition. Again, another roughly 2x performance improvement
over conventional HashTable
Before we had the following design:
Partition1
.----------------. .----------------.
| MemoryBlock1 | | MemoryBlock 2 |
.----------------. .----------------.
|slot1 (32 bytes)| |slot1 (32 bytes)|
|slot2 (32 bytes)| |slot2 (32 bytes)| ....
|slot3 (32 bytes)| |slot3 (32 bytes)|
|slot4 (32 bytes)| |slot4 (32 bytes)|
.----------------. .----------------.
...
Parition_N
.----------------. .----------------.
| MemoryBlock1 | | MemoryBlock 2 |
.----------------. .----------------.
|slot1 (32 bytes)| |slot1 (32 bytes)|
|slot2 (32 bytes)| |slot2 (32 bytes)| ....
|slot3 (32 bytes)| |slot3 (32 bytes)|
|slot4 (32 bytes)| |slot4 (32 bytes)|
.----------------. .-----------------
With this approach for 10,000,000 “adds” and “contains” the insert time was 8123ms and lookup time was 6321ms—14,444ms total. Data is smeared across the memory blocks and spatial locality is poor.
Now our data layout looks like this:
Partition1
------------------
| MemoryBlock1 |
------------------
|slot1 (16 bytes)|
|slot2 (16 bytes)| ...
|slot3 (16 bytes)|
|slot4 (16 bytes)|
------------------
...
Partition_N
------------------
| MemoryBlock1 |
------------------
|slot1 (16 bytes)|
|slot2 (16 bytes)| ...
|slot3 (16 bytes)|
|slot4 (16 bytes)|
------------------
The HashSet
’s tries to keep one memory block for the partition using a
resize procedure. But it can use overflow blocks in case of nasty data and
hashing behavior. But in most cases it will be one block for each partition, and
we get better spatial data locality. The new performance numbers are 2,883ms for
insertion, 3,545ms for lookups, and 6,428ms total.
By paying careful attention to cache-friendliness, that is, spatial and temporal locality, we were able to improvement performance for some core Stardog 5 data structures that impact all graph query evaluation and much besides. And we’re able to do this in a native memory management scheme that spends dramatically less time making the JVM collect garbage. Win win!
How to Overcome a Major Enterprise Liability and Unleash Massive Potential
Download for free