Saying Goodbye to Garbage

Alexander Toktarev

Jan 31, 2017, 10 minute read

The main problem of all heavily-loaded Java applications which operate on a huge amount of data is memory management. We solve this problem with a new, byte-based memory management scheme that will debut in the upcoming Stardog 5 release.

The new scheme will make Stardog more efficient with respect to memory, more stable for very heavy load workloads, and in some cases more performant. The first use of the new scheme will be in query evaluation—the part of Stardog most vulnerable to memory pressure. Over the 5.x release cycle we will integrate the new approach into other parts of the system.

In this post we describe that scheme and its advantages. In next week’s post we will look at some micro-benchmarks to get an idea about the overhead of the new approach.

Background & Inspiration

JVM’s garbage collector has two fundamental problems:

The resource consumption during garbage collection process (stop-the-world pauses, CPU consumption, etc.) can be considerable.
OutOfMemoryException can be thrown if there is insufficient memory.

How do we solve this without rewriting Stardog in C++?

From Objects to Bytes

Consider the main problem of general Java approach. Using simple Java objects in heavily-loaded application, we allocate memory for the object, use this memory, and then we release strong reference on the corresponding object. If the number of such objects is large, that inevitably leads to big pressure on GC, and then we cannot control memory overflow.

To prevent this problem we don’t use Java objects to store data. Instead we use a byte-based representation of the data. We operate directly with serialized data and always control memory allocations and used memory directly, byte by byte.

If there is no more memory, we can always use disk. Because we always control each byte used for our data, we can prevent OutOfMemoryException.

We were inspired by the Apache Flink approach as described in Juggling with Bits and Bytes. The same technique is used in Apache Drill, Apache Ignite (incubating), Apache Geode (incubating), and Apache Spark (Project Tungsten) to name a few.

Byte-based Memory Management

Here are some design basics and constraints we’ve implemented:

We allocate memory by atomic blocks of memory.
Each block of memory is represented by MemoryBlock interface and can on or off heap.
If the block is on heap, then it’s backed by Java’s byte[]-array.
If the block is off heap, then it’s backed by native memory.
The size of block is 32Kb by default but can be modified.
Blocks are allocated and never released, but they can be reused.

The main advantages of this approach:

Resilience. Able to use disk on demand. Can also handle difficult cases in a principled fashion: everything will work even when no more available memory in pool. New elements just flow to disk.
Reduce memory pressure. Because of block’s long life-time no need to run garbage collector so considerable reduction of garbage collection pressure happens.
More memory efficient. Simple Java object causes bytes-overhead for the header, in case of huge number of Java Objects this overhead will be considerable.
Performance. Good cache locality. All functional objects are long-lived and fly-weight.

We think it’s obvious how each of these benefits will improve Stardog.

Implementation

So how does this approach get implemented and then integrated into the existing codebase? Primarily through implementation of various collections, including simple array, sorted array, optimized sorted array, hash table, and aggregator.

Each collection is implemented using the atomic memory block scheme described in this post. Each collection has a main chain of blocks where data are written. The element can be spread between blocks. Element can have start at one blocks and finish at another block. So elements of any length can be written.

Simple Array

All elements are being subsequently written to the chain of memory blocks. If there are no blocks to write, then all used blocks are spilled to the disk and memory blocks are reused.

The data layout of a simple array:

   _________________    __________________       __________________
   |  MemoryBlock1 |    |  MemoryBlock2  |       |  MemoryBlock_N |
   -----------------    ------------------       ------------------
   |length1 | data1|    |     data3      |       |     ...        |
   -----------------    ------------------       |                |
   |length2 | data2|    |length4 | data4 |  .... |                |
   -----------------    ------------------       |    data_K      |
   |length3 | data3|    |                |       |                |
   -----------------    |    ....        |       |      ...       |
   |      data3    |    |                |       |                |
   ────────────────-    ────────────────--       ──────────────────

A file on disk with previously spilled elements:

   ----------------------------------------
   |length_1|data_1|length_2|data_2  .... |
   ----------------------------------------

Sorted Array

All elements are written to the chain of memory blocks. The index and offset of the element in the data layout is written to the blocks of the address layout.

If no more blocks exist to write pointer in address layout, pairs <index,offset> are sorted with a binary comparator. Using merge sort, data from data layout are written to the disk in the sorted format. On iteration data are sorted on address layout and are emitted as output using merge sort. If search is required, we build solid sorted index in memory or on disk. Using binary search elements can be found inside the collection with O(log(N)) time complexity.

The address layout:

   _________________    __________________       __________________
   |  MemoryBlock1 |    |  MemoryBlock2  |       | MemoryBlock_N  |
   -----------------    ------------------       ------------------
   |index1 |offset1|    |index5 | offset5|       |index9 |offset9 |
   -----------------    ------------------       ------------------
   |index2 |offset2|    |index6 | offset6|  .... |index10|offset10|
   -----------------    ------------------       ------------------
   |index3 |offset3|    |index7 | offset7|       |index11|offset11|
   -----------------    ------------------       ------------------
   |index4 |offset4|    |index8 | offset8|       |index12|offset12|
   ────────────────-    ────────────────--       ──────────────────

The data layout:

   _________________    __________________       __________________
   |  MemoryBlock1 |    |  MemoryBlock2  |       | MemoryBlock_N  |
   -----------------    ------------------       ------------------
   |length1 | data1|    |     data3      |       |     ...        |
   -----------------    ------------------       |                |
   |length2 | data2|    |length4 | data4 |  .... |                |
   -----------------    ------------------       |      data_K    |
   |length3 | data3|    |                |       |                |
   -----------------    |    ....        |       |      ...       |
   |      data3    |    |                |       |                |
   ────────────────-    ────────────────--       ──────────────────

On disk with previously spilled bytes in sorted order:

   ----------------------------------------
   |length_1|data_1|length_2|data_2  .... |
   ----------------------------------------

Sorted Array

When we can compare values using only 1 long-value, we write the corresponding long-value of each element to the address layout, which gives significant performance improvement because of good cache locality.

Address layout:

   __________________________       __________________________
   |        MemoryBlock1    |       |        MemoryBlockN    |
   |------------------------|       |------------------------|
   |index1 |offset1| long_1 |       |                        |
   |------------------------|       |                        |
   |index2 |offset2| long_2 |   ... |          ....          |
   |------------------------|       |                        |
   |index3 |offset3| long_3 |       |                        |
   |------------------------|       |------------------------|
   |index4 |offset4| long_4 |       |indexK | offsetK |long_K|
   |────────────────--------|       |────────────────--------|

And the data layout:

   _________________    __________________       __________________
   |  MemoryBlock1 |    |  MemoryBlock2  |       | MemoryBlock_N  |
   -----------------    ------------------       ------------------
   |length1 | data1|    |     data3      |       |     ...        |
   -----------------    ------------------       |                |
   |length2 | data2|    |length4 | data4 |  .... |                |
   -----------------    ------------------       |      data_K    |
   |length3 | data3|    |                |       |                |
   -----------------    |    ....        |       |      ...       |
   |      data3    |    |                |       |                |
   ────────────────-    ────────────────--       ──────────────────

On disk with previously spilled bytes in the sorted order:

   ----------------------------------------
   |length_1|data_1|length_2|data_2  .... |
   ----------------------------------------

HashTable

All elements are written to the chain of memory blocks. We calculate partition hash code of the element to obtain the partition. Using current memory block of the partition, we insert current element to the open addressing table of this block. If an element already exists, the reference to the current element is to the tail of the previous written elements, so all duplicate elements are stored as linked list in the data layout. If only unique keys are required (HashSet keys) only one element will be stored in the address block. When we look up by key, the partition will be calculated. After than memory block of the partition, plus index on the on disk, are used for the probe.

Firstly, number of partitions is either one or can be calculated with preliminary passed estimated number of distinct keys. If the number of free slots is exceeds during insertion process the resize process is started. It iterates over partitions and copies slots data to the new partition’s memory blocks which number is in two times greater than the number of partitions before. So the number of partitions is dinamic and can be increased.

Because of the good cache locality the performance of the resize procedure is pretty good.

Address layout with open addressing tables in each block:

   ____________________________________________________________
   | Partition1                                               |
   ------------------------------------------------------------
   |address1 (long)|hashCode1 (int)|partitionHashCode1 (int)  |
   ------------------------------------------------------------
   |address2 (long)|hashCode2 (int)|partitionHashCode2 (int)  |
   ------------------------------------------------------------
   |address3 (long)|hashCode3 (int)|partitionHashCode3 (int)  |
   ------------------------------------------------------------
   |address4 (long)|hashCode4 (int)|partitionHashCode4 (int)  |
   ────────────────────────────────────────────────────────────
   |                .........................                 |
   ────────────────────────────────────────────────────────────
...

   ____________________________________________________________
   |  PartitionK                                              |
   ------------------------------------------------------------
   |address1 (long)|hashCode1 (int)|partitionHashCode1 (int)  |
   ------------------------------------------------------------
   |address2 (long)|hashCode2 (int)|partitionHashCode1 (int)  |
   ------------------------------------------------------------
   |address3 (long)|hashCode3 (int)|partitionHashCode1 (int)  |
   ------------------------------------------------------------
   |address4 (long)|hashCode4 (int)|partitionHashCode1 (int)  |
   ────────────────────────────────────────────────────────────
   |                .........................                 |
   ────────────────────────────────────────────────────────────

Data layout:

   _________________    __________________       __________________
   |  MemoryBlock1 |    |  MemoryBlock2  |       | MemoryBlock_N  |
   -----------------    ------------------       ------------------
   |length1 | data1|    |     data3      |       |     ...        |
   -----------------    ------------------       |                |
   |length2 | data2|    |length4 | data4 |  .... |                |
   -----------------    ------------------       |      data_K    |
   |length3 | data3|    |                |       |                |
   -----------------    |    ....        |       |      ...       |
   |      data3    |    |                |       |                |
   ────────────────-    ────────────────--       ──────────────────

On disk with the spilled hash tables:

 Index with data sorted by hash codes in the following format:

 -------------------------------------------------------------------------------------
 |segment_address (long)|hash_code (int)| partition_hashcode(int)|records_count(long)|
 -------------------------------------------------------------------------------------

File on the disk with previously spilled elements:

   ----------------------------------------
   |length_1|data_1|length_2|data_2  .... |
   ----------------------------------------

Aggregator

Aggregator’s structure is very similar to HashTable. The difference occurs during spilling to disk. Data are written to the file in pre-aggregated format. Additionally, we support Functor for the aggregator.

For example, when we need to calculate a function over all elements in the group, we don’t write elements as a sequence. Rather, we calculate the function’s value directly in the block of memory. That allows us to reduce memory consumption in addition to writing pre-calculated function values to disk.

Further, we support second level of aggregation when elements in the first level can be aggregated using second level aggregation strategies.

Address layout with open addressing tables in each block:

   ____________________________________________________________
   |  Partition_1                                             |
   ------------------------------------------------------------
   |address1 (long)|hashCode1 (int)|partitionHashCode1 (int)  |
   ------------------------------------------------------------
   |address2 (long)|hashCode2 (int)|partitionHashCode2 (int)  |
   ------------------------------------------------------------
   |address3 (long)|hashCode3 (int)|partitionHashCode3 (int)  |
   ------------------------------------------------------------
   |address4 (long)|hashCode4 (int)|partitionHashCode4 (int)  |
   ────────────────────────────────────────────────────────────
   |                .........................                 |
   ────────────────────────────────────────────────────────────

...

   ____________________________________________________________
   |  Partition_K                                             |
   ------------------------------------------------------------
   |address1 (long)|hashCode1 (int)|partitionHashCode1 (int)  |
   ------------------------------------------------------------
   |address2 (long)|hashCode2 (int)|partitionHashCode1 (int)  |
   ------------------------------------------------------------
   |address3 (long)|hashCode3 (int)|partitionHashCode1 (int)  |
   ------------------------------------------------------------
   |address4 (long)|hashCode4 (int)|partitionHashCode1 (int)  |
   ────────────────────────────────────────────────────────────
   |                .........................                 |
   ────────────────────────────────────────────────────────────

Data layout:

   _________________    __________________       __________________
   |  MemoryBlock1 |    |  MemoryBlock2  |       | MemoryBlock_N  |
   -----------------    ------------------       ------------------
   |length1 | data1|    |     data3      |       |     ...        |
   -----------------    ------------------       |                |
   |length2 | data2|    |length4 | data4 |  .... |                |
   -----------------    ------------------       |      data_K    |
   |length3 | data3|    |                |       |                |
   -----------------    |    ....        |       |      ...       |
   |      data3    |    |                |       |                |
   ────────────────-    ────────────────--       ──────────────────

On the disk values are stored in the pre-aggregated format:

   ---------------------------------------------------------------------------------------
   |hash_code_1|partition_hash_code_1|elements_count_1| length_1|data_1|length_2|data_2  |
   ---------------------------------------------------------------------------------------
   |hash_code_2|partition_hash_code_1|elements_count_2| length_3|data_3|length_4|data_4| |
   ---------------------------------------------------------------------------------------
   ...
   ---------------------------------------------------------------------------------------
   |hash_code_k|2|partition_hash_code_k|elements_count_k|length_(L-1)|data_(L-1)         |
   ---------------------------------------------------------------------------------------

LongHashSet

One of the most strong factor which affects the performance of the MM-collections is the level of compaction of data stored inside the collection. For example we need to store data in Set-based collection when each data is represented as single 8-byte (long) value.

In this case no need to split data storage in address blocks and data blocks, no need to store hash codes and length of the elements.

All 8-byte (long) values are directly written to the partition’s memory blocks and each of them is table with open addressing storage approach.

Address layout with open addressing tables in each block:

   _______________________
   |  Partition_1        |
   -----------------------
   |long_value_1 (long)| |
   -----------------------
   |long_value_2 (long)| |
   -----------------------
   |long_value_3 (long)| |
   -----------------------
   |long_value_4 (long)| |
   ───────────────────────
   |.....................|
   ───────────────────────

...

   _______________________
   |  Partition_K        |
   -----------------------
   |long_value_1 (long)| |
   -----------------------
   |long_value_2 (long)| |
   -----------------------
   |long_value_3 (long)| |
   -----------------------
   |long_value_4 (long)| |
   ───────────────────────
   |.....................|
   ───────────────────────

This format provides an excellent cache locality and low densiti of the stored data.

Preliminary Conclusions

We’ve described the new approach in Stardog to memory management, which has several advantages over the conventional GC approach in the JVM. In particular the byte-based approach gives us greater control over memory management in Stardog, which will lead to better performance, improved stability, and a more predictable system at scale.

In next week’s post we will look at the performance overhead of this approach, especially in Java Collections with small, medium, and large numbers of elements.

Download Stardog today to start your free 30-day evaluation.

Back to all posts

download our free e-guide

Knowledge Graphs 101

How to Overcome a Major Enterprise Liability and Unleash Massive Potential

Download for free

Saying Goodbye to Garbage

Background & Inspiration

From Objects to Bytes

Byte-based Memory Management

Implementation

Simple Array

Sorted Array

Sorted Array

HashTable

Aggregator

LongHashSet

Preliminary Conclusions

Keep Reading:

Stardog Performance: SP2B Benchmark

Knowledge Graphs 101

Let’s stay in touch