Faster, More Scalable Stardog

Evren Sirin

May 23, 2019, 5 minute read

Note: We’re way beyond version 6.2 now! Check out our release blogs to see the latest and greatest.

Stardog 6.2 just shipped with scalable virtual graph caching, better Kubernetes integration, support for Amazon Redshift, and many new optimizations . Read on for the details.

In this post we’ll go over some of the new features in Stardog and how they can help you to manage and query your knowledge graph.

Distributed Caching

Stardog’s virtual graph capability allows users to map external data sources to Stardog’s graph model and run SPARQL queries over external sources in real-time. Now in some use cases performance of these queries can be 100% faster using the new distributed caching that shipped in Stardog 6.2.

This process involves, among other things, Stardog translating graph queries to the source query language supported by the external system. Data virtualization opens up so many possibilities for unifying data in data silos, while avoiding the common pitfalls of ETL processes where data lineage and traceability can be hard to tackle.

One question that comes up frequently in the context of virtualization is the performance impact of querying external data sources. In most cases, the queries auto-generated by Stardog can be as efficient as hand-written queries against those data sources. But even then network latencies, load on external databases or graph-specific queries, e.g. shortest path queries, can increase the query answering times over virtual graphs.

With Stardog 6.2, we are introducing a distributed cache feature that allows users to cache an entire graph, a virtual graph, or query results (currently experimental). The cache can be spread over multiple nodes in the Stardog cluster allowing a scale-out approach to virtualization, caching multiple data sources that won’t fit a single cluster node.

In our experiments we see 30% to 100% speed-up in queries when we simulate different levels of network latencies for external data sources. Query answering over virtual graphs work seamlessly with the distributed cache and queries do not need to be modified to take advantage of these speedups once caching is configured. You can also use this feature to cache data coming from external SPARQL endpoints. Federated SPARQL queries over those endpoints can use the distributed cache just like virtual graph queries.

Kubernetes

Stardog 6.2 also includes the first alpha release of our distributed Knowledge Graph platform. This release provides first class support for running Stardog’s HA cluster and virtual graph caching in Kubernetes (K8s). Support for K8s is provided as Helm charts, which makes it easy to deploy and test. Helm charts describe the services and applications to run in K8s and how they should be deployed, providing a single means for repeatable application deployment.

The Helm charts packaged with Stardog specify an initial set of defaults for the deployment; for example, that the deployment should launch 3 Stardog nodes and 3 ZooKeeper servers that run on different physical hosts. You can override the defaults of a deployment by setting different values when you install via Helm.

Amazon RedShift Support

Stardog supports many different relational and NoSQL data sources as virtual graphs and as of version 6.2 Amazon Redshift is also supported. Amazon Redshift is a fast and scalable data warehouse that also supports loading data from S3. Arbitrary OLAP computations in Redshift can be part Knowledge Graph and contribute to graph-based queries in turn.

So Many Optimizations

Cardinality estimation–estimating how many results a (part of a) query will return–is the most important step for query optimization and one of the hardest and most researched problems in the database literature. Stardog has state-of-the-art cardinality estimations capabilities due to sophisticated statistics we collect over the graph. That said, misestimations are sometimes unavoidable for complex query patterns and may well lead to suboptimal query plans.

Stardog 6.2 includes a self-adjusting, auto-tuning cache for cardinality estimations where accuracy improves as more queries are executed. During query execution we track how many actual results are returned by certain plan nodes and detect the cases where actual counts differ from estimated counts significantly. The misestimations in the cache are corrected and subsequent queries can use the accurate estimations to generate better plans. Better plans means faster query performance.

Stardog 6.2 includes other optimizations, too.

Most notably we improved path query evaluation to be more performant while using less memory. Path query evaluation in Stardog has been lazy to generate results in a streaming fashion as the client is consuming previously found paths. This works fairly well for shortest paths but creates problems for queries that ask for all paths. We now have an eager evaluation mode that is significantly faster for such queries and can also account for the LIMIT or MAX LENGTH constraints defined in the query.

The speed improvements you will see here depend on the size and the connectivity of the graph along with your path query. For example, in one of our benchmarks a path query looking for all paths of max length 4 between two nodes in a DBpedia database with 150M triples did not show any improvement with 6.2 (~850ms execution time) but the same query with max length 5 was more than 10 times faster (260 seconds with 6.1 as compared to 20sec with 6.2).