Writing cache-friendly code

Alexander Toktarev

Jun 21, 2017, 7 minute read

Cache-friendly code is crucial in a system like Stardog. We share all the secrets. That’s how we roll.

Cache-friendly code matters a lot in systems programming. Stardog Knowledge Graph is based on the Stardog graph database in which performance is critical. In this post we look at issues around cache-friendly code particularly as it relates to Stardog’s native memory management.

Background

CPU registers can move data around in single clock cycles, but they’re expensive. Most computer cores have less than a few dozen registers, which can store a few hundred to maybe a thousand bytes total. On the other hand DRAM memory is cheap—i.e. literally millions of times cheaper—but takes hundreds of cycles to return data.

To bridge this gap between registers—super fast and expensive—and RAM—super slow and cheap—modern systems use caches, which are named L1, L2, L3 in decreasing speed and cost.

The idea is that most of the executing code will be hitting a small set of variables often and hitting the rest of a much larger set of variables far less often. If the processor can’t find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache; and if not there, main memory. Each of these “misses” is expensive in time.

To achieve really good performance we need to reduce the number of cache misses and try to prevent so-called false sharing.

All the localities!

An important aspect of cache-friendly code is the principle of locality, the goal of which is to place related data close together in the register-RAM-cache hierarchy to allow efficient caching. In terms of the CPU cache, we have to talk about cache lines.

First, lets look at temporal locality. When a given memory location is accessed, the same location may be accessed again soon. Ideally, this information will still be cached.

A street sign that reads, “Friendly St.” hung on a brick wall.

Second, lets look at spatial locality, that is, placing related data close together. When you read from RAM, typically a larger chunk of memory is fetched than was requested because the program may require that data soon. Disk caches do the same.

Now let’s consider how cache-locality affects performance of our application. We will do that using Stardog 5’s new native memory management collections.

HashTables look Ups

Here’s an interesting example of how temporal locality affects performance:

class TestHashSet {
	private void testCheckAll() {
		final HashSet aHashStructure = createHashSet();

		// Insert  elements from 1 to 10_000_000 to the MM-HashSet
		//...
		//...
		long aStartTime = System.currentTimeMillis();

		for (int i = 0; i < 10_000_000; i++) {
			long aElement=i;
            assertTrue(checkContains(aHashStructure, aElement));
		}

		System.out.println("Time (ms)=" + (System.currentTimeMillis() - aStartTime));
	}

	//Output:
	//Time (ms) = 3573 ms.
}

Now we see a 2x improvement doing the same number of operations but checking different data.

class TestHashSet {
	private void testCheckSingle() {
		final HashSet aHashStructure = createHashSet();

		// Insert  elements from 1 to 10_000_000 to the MM-HashSet
		//...
		//...
		long aStartTime = System.currentTimeMillis();

		long aElement=0;
		for (int i = 0; i < 10_000_000; i++) {
            assertTrue(checkContains(aHashStructure, aElement));
		}

		System.out.println("Time (ms)=" + (System.currentTimeMillis() - aStartTime));
	}

	//Output:
	//Time (ms) = 1810 ms.
}

In the second example we see far fewer cache misses and better temporal locality. Because the elements are always the same, the processor can always take this data from the cache.

Statues

The next example shows us growth of performance depending on data variety:

class TestHashSet {
    public void testCheckWindow() throws IOException {
        final HashSet aHashStructure = createHashSet();

        // Insert  elements from 1 to 10_000_000 to the MM-HashSet
        //...


        long aElement;
        long aStartTime = System.currentTimeMillis();
        int aWindow=1000;

        for (int i = 0; i < 10_000_000; i++) {
           assertTrue(checkContains(aHashStructure, aElement));
           aElement = (aElement+1)%aWindow;
        }

        System.out.println("Time (ms)=" + (System.currentTimeMillis() - aStartTime));
     }

     //Output:
     //Time (ms) = 2041  ms.
}

Sorting solutions

As we described in Saying Goodby to Garbage, we have two types of sorting collection, sorted array and long-optimized sorted array. The second exists solely for better temporal locality. Lets look at both.

A sorted array stores a pointer to the element in the address block:

Address block       Data block
.___________.       .___________.
| address_1 |   ->  | data_1    |
|    ...    |       |    ...    |
|    ...    |       |    ...    |
|           |       |           |
.-----------.       .-----------.

When a sorted array sorts data inside the address block, it has to access to the address blocks and then compares elements access to the data blocks. Generally address blocks and data blocks are in different parts of memory, and the processor can’t use cached data during this read-compare operation.

How do long-optimized sorted arrays improve this?

Address block       Data block
.___________.       .___________.
| address_1 |   ->  | data_1    |
| ========= |       |           |
| long_1    |       |           |
.-----------.       |           |
|    ...    |       |    ...    |
|           |       |           |
.-----------.       .-----------.

When long-optimized sorted array sorts data inside the address block it just reads a long value from the same block and does not access the data blocks. It can efficiently uses data stored in cache for the read-compare-swap operations.

This small change results in significant performance increase for sorting 5,000,000 solutions: Sorted array takes 11,619 ms, while long-optimized sorted array takes 4,759 ms.

A scenic shot, focused on clouds

We see real performance difference by improving cache locality. We have a compact format of data in long-optimized sorted arrays which gives us better spatial locality, and we don’t have to jump between memory regions during sorting which gives us better temporal locality, too.

Cache-friendly HashSets

We also have two kinds of hash set: conventional DefaultHashSet and an optimized LongHashSet.

The conventional DefaultHashSet stores data like this:

Address block       Data block
.___________.       .___________.
|address_1  |   ->  | data_1    |
|===========|       |           |
|hash_code_1|       |           |
|    ...    |       |    ...    |
|    ...    |       |    ...    |
|           |       |           |
.-----------.       .-----------.

A lookup reads the corresponding hashcode value—using open addressing probes—and then tries to compare user’s element with the element in the data block. And that requires jumps between memory blocks in different regions of RAM, destroying temporal locality.

In our optimized LongHashSet we store one long value in the address block and don’t use data blocks at all.

_____________
|long_value |
|===========|
|    ...    |
|    ...    |
|           |
------------

We access the one block during lookup and achieve good temporal and spatial locality. The performance results of the optimized hash set include insertion of 10,000,000 solutions and 10,000,000 lookups using 1 long are promising. The conventional HashSet’s insert time is 2883ms and lookup time is 3545ms. The total time is 6428ms. In comparison, LongHashSet is faster across the board: insert 2070ms, lookup 1859ms, and 3929ms total.

Again, about 2x improvement with cache-friendly code.

HashTable data locality

In the original version of our HashTable we stored address data in the partitions and each partition may have many blocks. We improved spatial locality by reducing the size of the element’s slot from 32 bytes to 16 bytes and by using a dynamic number of partitions to keep single memory blocks for each partition. Again, another roughly 2x performance improvement over conventional HashTable

A table filled with cold beer bottles and one filled glass

Before we had the following design:

Partition1
.----------------.   .----------------.
|  MemoryBlock1  |   | MemoryBlock 2  |
.----------------.   .----------------.
|slot1 (32 bytes)|   |slot1 (32 bytes)|
|slot2 (32 bytes)|   |slot2 (32 bytes)|       ....
|slot3 (32 bytes)|   |slot3 (32 bytes)|
|slot4 (32 bytes)|   |slot4 (32 bytes)|
.----------------.   .----------------.

...

Parition_N

.----------------.   .----------------.
|  MemoryBlock1  |   | MemoryBlock 2  |
.----------------.   .----------------.
|slot1 (32 bytes)|   |slot1 (32 bytes)|
|slot2 (32 bytes)|   |slot2 (32 bytes)|       ....
|slot3 (32 bytes)|   |slot3 (32 bytes)|
|slot4 (32 bytes)|   |slot4 (32 bytes)|
.----------------.   .-----------------

With this approach for 10,000,000 “adds” and “contains” the insert time was 8123ms and lookup time was 6321ms—14,444ms total. Data is smeared across the memory blocks and spatial locality is poor.

Now our data layout looks like this:

Partition1
------------------
|  MemoryBlock1  |
------------------
|slot1 (16 bytes)|
|slot2 (16 bytes)|  ...
|slot3 (16 bytes)|
|slot4 (16 bytes)|
------------------

...

Partition_N

------------------
|  MemoryBlock1  |
------------------
|slot1 (16 bytes)|
|slot2 (16 bytes)| ...
|slot3 (16 bytes)|
|slot4 (16 bytes)|
------------------

The HashSet’s tries to keep one memory block for the partition using a resize procedure. But it can use overflow blocks in case of nasty data and hashing behavior. But in most cases it will be one block for each partition, and we get better spatial data locality. The new performance numbers are 2,883ms for insertion, 3,545ms for lookups, and 6,428ms total.

Conclusion

By paying careful attention to cache-friendliness, that is, spatial and temporal locality, we were able to improvement performance for some core Stardog 5 data structures that impact all graph query evaluation and much besides. And we’re able to do this in a native memory management scheme that spends dramatically less time making the JVM collect garbage. Win win!