The upcoming Stardog 2.1 release improves query scalability by about 3 orders of magnitude and can handle 50 billion triples on a $10,000 server.
We’ve never been overly concerned about Stardog scalability per se: first we wanted to make it easy to use; then we wanted to make it fast. We just assumed we’d get around to make it insanely scalable, too. Stardog 2.1 makes giant leaps in query, data loading, and reasoning scalability.
Running on $10,000 of server hardware (32 cores, 256 GB RAM), Stardog 2.1 can handle 20 to 50 billion triples. Compared to Stardog 2.0.x, Stardog 2.1 loads ~100M triple datasets about two times faster; it loads ~1B triple datasets about 3 times faster—all while using much less memory. 2.1 can load 20B datasets at 300,000 triples per second. We also improved query evaluation performance significantly so that Stardog 2.1 is still very fast, even at much larger database sizes. How did we do it?
Much of the performance improvement comes from taking care with
concurrency and reducing thread contention, especially during bulk
data loading (i.e., initial database creation with large amounts of
data being added at creation time). We’re avoiding more locks and
using more non-blocking algorithms and data structures a lot more
often. Moving for example from
helps, even though the former is more space efficient than the
later. We’re also using
ThreadLocals more aggressively to reduce
thread contention and avoid synchronization. As loading performance
improved, several LRU caches became problematic since evictions were
being swamped by additions. Batching evictions in a single thread
helps but increases memory pressure and GC times.
Worse Hashing is Better Hashing
Stardog hashes URIs, bnodes, and literal values for storage in a
mapping dictionary; previously we used the 64-bit
it’s very fast, has low collision rate, and lets us store values as
longs. When handling collisions and cache misses, disk accesses are
required; at scale these random disk accesses are too
expensive. Moving to
SHA1 is perhaps non-intuitive since the hash
size goes from 64 to 160 bits. But since that makes hash collision
practically impossible, we’re able to achieve significant speedups—it
also simplifies the mapping dictionary significantly.
Off-Heap Memory Management
Before 2.1, we were using
mmap aggressively during loading in order to use the operating system’s VM, memory management, etc. But memory mapped files in JVM are notoriously crappy (
unmap!), and we’ve seen frequent JVM crashes when we had a lot of memory mapped files around. Not good.
We also realized that memory mapped files were causing performance slowdowns when there was more than 64GB of RAM available. And since we have no control over when memory mapped files were flushed to disk, that was typically happening too often. But keeping this information on the Java heap was never a viable choice at scale because of GC processing. Stardog 2.1 moves to an off-heap memory allocation scheme, which gives us very fine-grained control over disk flushes, uses available system memory more efficiently, and is (roughly) as fast as memory mapping.
Reducing GC Pauses
Finally, to significantly improve loading performance we needed to do something about GC costs, which are important in bulk loading because of high rate of object creation and destruction. Using immutable objects makes it possible to improve concurrency but with the overhead of additional garbage being created. Tuning the GC knobs and buttons didn’t really help appreciably. We’ve addressed the GC processing costs by reworking places where we were unnecessarily creating objects. That kind of careful software engineering is aided by the relatively small size of the Stardog code base. It’s like engineering works or something!
This careful software re-engineering included, for example, delving
into the guts of the RDF parser to use a single
was continuously reset rather than creating new builders for each RDF
value. We also reduced the amount of cache use on the heap to relieve
memory pressure. We now see GC pauses taking 1% or less of the overall
bulk loading time.
It doesn’t do much good to improve bulk data loading performance if you don’t also improve query evaluation performance at least as much. So we did that, too. The key to improving query evaluation performance in 2.1 came down to memory usage and how we handle intermediate results. Consider a SPARQL query from SP2B:
For a 5M triple database, this query produces 18,362,955 (!) results
DISTINCT keeps them in memory. Unworkable at scale. Stardog
2.1 addresses this issue by reusing the new off-heap memory allocation
scheme. A small heap doesn’t work; and GC processing kills you with a
large heap, so we just avoid the heap instead. Yes, that means we have
to manage memory manually, but that’s actually achievable in a JVM
application. We implemented a memory manager based on various JVM
internals; it’s responsible for allocating (and deallocating) data
used during query evaluation, including holding intermediate query
results. It manages the heap outside the JVM and will flow to disk
when needed. The new memory manager also performs some static analysis
of the query using database statistics to guide its behavior.
No public APIs were harmed in the creation of the new scalability improvements. See this slide deck for an overview of some other changes in Stardog 2.1.