Saying Goodbye to Garbage
Get the latest in your inbox
Get the latest in your inbox
The main problem of all heavily-loaded Java applications which operate on a huge amount of data is memory management. We solve this problem with a new, byte-based memory management scheme that will debut in the upcoming Stardog 5 release.
The new scheme will make Stardog more efficient with respect to memory, more stable for very heavy load workloads, and in some cases more performant. The first use of the new scheme will be in query evaluation—the part of Stardog most vulnerable to memory pressure. Over the 5.x release cycle we will integrate the new approach into other parts of the system.
In this post we describe that scheme and its advantages. In next week’s post we will look at some micro-benchmarks to get an idea about the overhead of the new approach.
JVM’s garbage collector has two fundamental problems:
OutOfMemoryException
can
be thrown if there is insufficient memory.How do we solve this without rewriting Stardog in C++?
Consider the main problem of general Java approach. Using simple Java objects in heavily-loaded application, we allocate memory for the object, use this memory, and then we release strong reference on the corresponding object. If the number of such objects is large, that inevitably leads to big pressure on GC, and then we cannot control memory overflow.
To prevent this problem we don’t use Java objects to store data. Instead we use a byte-based representation of the data. We operate directly with serialized data and always control memory allocations and used memory directly, byte by byte.
If there is no more memory, we can always use disk. Because we always control
each byte used for our data, we can prevent OutOfMemoryException
.
We were inspired by the Apache Flink approach as described in Juggling with Bits and Bytes. The same technique is used in Apache Drill, Apache Ignite (incubating), Apache Geode (incubating), and Apache Spark (Project Tungsten) to name a few.
Here are some design basics and constraints we’ve implemented:
MemoryBlock
interface and can on or off heap.byte[]
-array.The main advantages of this approach:
We think it’s obvious how each of these benefits will improve Stardog.
So how does this approach get implemented and then integrated into the existing codebase? Primarily through implementation of various collections, including simple array, sorted array, optimized sorted array, hash table, and aggregator.
Each collection is implemented using the atomic memory block scheme described in this post. Each collection has a main chain of blocks where data are written. The element can be spread between blocks. Element can have start at one blocks and finish at another block. So elements of any length can be written.
All elements are being subsequently written to the chain of memory blocks. If there are no blocks to write, then all used blocks are spilled to the disk and memory blocks are reused.
The data layout of a simple array:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|length1 | data1| | data3 | | ... |
----------------- ------------------ | |
|length2 | data2| |length4 | data4 | .... | |
----------------- ------------------ | data_K |
|length3 | data3| | | | |
----------------- | .... | | ... |
| data3 | | | | |
────────────────- ────────────────-- ──────────────────
A file on disk with previously spilled elements:
----------------------------------------
|length_1|data_1|length_2|data_2 .... |
----------------------------------------
All elements are written to the chain of memory blocks. The index and offset of the element in the data layout is written to the blocks of the address layout.
If no more blocks exist to write pointer in address layout, pairs
<index,offset>
are sorted with a binary comparator. Using merge sort, data
from data layout are written to the disk in the sorted format. On iteration data
are sorted on address layout and are emitted as output using merge sort. If
search is required, we build solid sorted index in memory or on disk. Using
binary search elements can be found inside the collection with O(log(N))
time
complexity.
The address layout:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|index1 |offset1| |index5 | offset5| |index9 |offset9 |
----------------- ------------------ ------------------
|index2 |offset2| |index6 | offset6| .... |index10|offset10|
----------------- ------------------ ------------------
|index3 |offset3| |index7 | offset7| |index11|offset11|
----------------- ------------------ ------------------
|index4 |offset4| |index8 | offset8| |index12|offset12|
────────────────- ────────────────-- ──────────────────
The data layout:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|length1 | data1| | data3 | | ... |
----------------- ------------------ | |
|length2 | data2| |length4 | data4 | .... | |
----------------- ------------------ | data_K |
|length3 | data3| | | | |
----------------- | .... | | ... |
| data3 | | | | |
────────────────- ────────────────-- ──────────────────
On disk with previously spilled bytes in sorted order:
----------------------------------------
|length_1|data_1|length_2|data_2 .... |
----------------------------------------
When we can compare values using only 1 long-value, we write the corresponding long-value of each element to the address layout, which gives significant performance improvement because of good cache locality.
Address layout:
__________________________ __________________________
| MemoryBlock1 | | MemoryBlockN |
|------------------------| |------------------------|
|index1 |offset1| long_1 | | |
|------------------------| | |
|index2 |offset2| long_2 | ... | .... |
|------------------------| | |
|index3 |offset3| long_3 | | |
|------------------------| |------------------------|
|index4 |offset4| long_4 | |indexK | offsetK |long_K|
|────────────────--------| |────────────────--------|
And the data layout:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|length1 | data1| | data3 | | ... |
----------------- ------------------ | |
|length2 | data2| |length4 | data4 | .... | |
----------------- ------------------ | data_K |
|length3 | data3| | | | |
----------------- | .... | | ... |
| data3 | | | | |
────────────────- ────────────────-- ──────────────────
On disk with previously spilled bytes in the sorted order:
----------------------------------------
|length_1|data_1|length_2|data_2 .... |
----------------------------------------
All elements are written to the chain of memory blocks. We calculate partition
hash code of the element to obtain the partition. Using current memory block of
the partition, we insert current element to the open addressing table of this block.
If an element already exists, the reference to the current element is to the
tail of the previous written elements, so all duplicate elements are stored
as linked list in the data layout. If only unique keys are required (HashSet
keys)
only one element will be stored in the address block. When we look up by key,
the partition will be calculated. After than memory block of the partition, plus
index on the on disk, are used for the probe.
Firstly, number of partitions is either one or can be calculated with preliminary passed estimated number of distinct keys. If the number of free slots is exceeds during insertion process the resize process is started. It iterates over partitions and copies slots data to the new partition’s memory blocks which number is in two times greater than the number of partitions before. So the number of partitions is dinamic and can be increased.
Because of the good cache locality the performance of the resize procedure is pretty good.
Address layout with open addressing tables in each block:
____________________________________________________________
| Partition1 |
------------------------------------------------------------
|address1 (long)|hashCode1 (int)|partitionHashCode1 (int) |
------------------------------------------------------------
|address2 (long)|hashCode2 (int)|partitionHashCode2 (int) |
------------------------------------------------------------
|address3 (long)|hashCode3 (int)|partitionHashCode3 (int) |
------------------------------------------------------------
|address4 (long)|hashCode4 (int)|partitionHashCode4 (int) |
────────────────────────────────────────────────────────────
| ......................... |
────────────────────────────────────────────────────────────
...
____________________________________________________________
| PartitionK |
------------------------------------------------------------
|address1 (long)|hashCode1 (int)|partitionHashCode1 (int) |
------------------------------------------------------------
|address2 (long)|hashCode2 (int)|partitionHashCode1 (int) |
------------------------------------------------------------
|address3 (long)|hashCode3 (int)|partitionHashCode1 (int) |
------------------------------------------------------------
|address4 (long)|hashCode4 (int)|partitionHashCode1 (int) |
────────────────────────────────────────────────────────────
| ......................... |
────────────────────────────────────────────────────────────
Data layout:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|length1 | data1| | data3 | | ... |
----------------- ------------------ | |
|length2 | data2| |length4 | data4 | .... | |
----------------- ------------------ | data_K |
|length3 | data3| | | | |
----------------- | .... | | ... |
| data3 | | | | |
────────────────- ────────────────-- ──────────────────
On disk with the spilled hash tables:
Index with data sorted by hash codes in the following format:
-------------------------------------------------------------------------------------
|segment_address (long)|hash_code (int)| partition_hashcode(int)|records_count(long)|
-------------------------------------------------------------------------------------
File on the disk with previously spilled elements:
----------------------------------------
|length_1|data_1|length_2|data_2 .... |
----------------------------------------
Aggregator’s structure is very similar to HashTable
. The difference occurs
during spilling to disk. Data are written to the file in pre-aggregated format.
Additionally, we support Functor
for the aggregator.
For example, when we need to calculate a function over all elements in the group, we don’t write elements as a sequence. Rather, we calculate the function’s value directly in the block of memory. That allows us to reduce memory consumption in addition to writing pre-calculated function values to disk.
Further, we support second level of aggregation when elements in the first level can be aggregated using second level aggregation strategies.
Address layout with open addressing tables in each block:
____________________________________________________________
| Partition_1 |
------------------------------------------------------------
|address1 (long)|hashCode1 (int)|partitionHashCode1 (int) |
------------------------------------------------------------
|address2 (long)|hashCode2 (int)|partitionHashCode2 (int) |
------------------------------------------------------------
|address3 (long)|hashCode3 (int)|partitionHashCode3 (int) |
------------------------------------------------------------
|address4 (long)|hashCode4 (int)|partitionHashCode4 (int) |
────────────────────────────────────────────────────────────
| ......................... |
────────────────────────────────────────────────────────────
...
____________________________________________________________
| Partition_K |
------------------------------------------------------------
|address1 (long)|hashCode1 (int)|partitionHashCode1 (int) |
------------------------------------------------------------
|address2 (long)|hashCode2 (int)|partitionHashCode1 (int) |
------------------------------------------------------------
|address3 (long)|hashCode3 (int)|partitionHashCode1 (int) |
------------------------------------------------------------
|address4 (long)|hashCode4 (int)|partitionHashCode1 (int) |
────────────────────────────────────────────────────────────
| ......................... |
────────────────────────────────────────────────────────────
Data layout:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|length1 | data1| | data3 | | ... |
----------------- ------------------ | |
|length2 | data2| |length4 | data4 | .... | |
----------------- ------------------ | data_K |
|length3 | data3| | | | |
----------------- | .... | | ... |
| data3 | | | | |
────────────────- ────────────────-- ──────────────────
On the disk values are stored in the pre-aggregated format:
---------------------------------------------------------------------------------------
|hash_code_1|partition_hash_code_1|elements_count_1| length_1|data_1|length_2|data_2 |
---------------------------------------------------------------------------------------
|hash_code_2|partition_hash_code_1|elements_count_2| length_3|data_3|length_4|data_4| |
---------------------------------------------------------------------------------------
...
---------------------------------------------------------------------------------------
|hash_code_k|2|partition_hash_code_k|elements_count_k|length_(L-1)|data_(L-1) |
---------------------------------------------------------------------------------------
One of the most strong factor which affects the performance of the MM-collections is the level of compaction of data stored inside the collection. For example we need to store data in Set-based collection when each data is represented as single 8-byte (long) value.
In this case no need to split data storage in address blocks and data blocks, no need to store hash codes and length of the elements.
All 8-byte (long) values are directly written to the partition’s memory blocks and each of them is table with open addressing storage approach.
Address layout with open addressing tables in each block:
_______________________
| Partition_1 |
-----------------------
|long_value_1 (long)| |
-----------------------
|long_value_2 (long)| |
-----------------------
|long_value_3 (long)| |
-----------------------
|long_value_4 (long)| |
───────────────────────
|.....................|
───────────────────────
...
_______________________
| Partition_K |
-----------------------
|long_value_1 (long)| |
-----------------------
|long_value_2 (long)| |
-----------------------
|long_value_3 (long)| |
-----------------------
|long_value_4 (long)| |
───────────────────────
|.....................|
───────────────────────
This format provides an excellent cache locality and low densiti of the stored data.
We’ve described the new approach in Stardog to memory management, which has several advantages over the conventional GC approach in the JVM. In particular the byte-based approach gives us greater control over memory management in Stardog, which will lead to better performance, improved stability, and a more predictable system at scale.
In next week’s post we will look at the performance overhead of this approach, especially in Java Collections with small, medium, and large numbers of elements.
Download Stardog today to start your free 30-day evaluation.
How to Overcome a Major Enterprise Liability and Unleash Massive Potential
Download for free