Virtual Graphs in Stardog 5

by Jess Balint, 9 May 2017 · 6 minute read

Stardog 5 features an all-new virtual graph engine with lots of improvements. This post describes them and how they can help you build better knowledge graphs with Stardog.

Courtesy of Alper Çuğun

Courtesy of Alper Çuğun

The new virtual graph engine—all-new and built from scratch for Stardog 5—was designed to overcome limitations in the previous version, as well as provide a platform for future growth. (See my previous post for a discussion of virtual graphs in Stardog.) While retaining the same interface, we’ve replaced the SQL rewriting backend, improved performance, and added features. I discuss some of these changes including their use cases; the optimizations are general and applicable to many use cases.

Rewrite Optimizations

Federated query execution is based on the idea that we can translate a portion of a SPARQL query into a SQL query for execution by a remote relational database engine. Given schema information, including primary and foreign keys, we can generate an efficient SQL query with a minimal number of joins.

Metadata such as types and nullability constraints allow optimizations with OPTIONAL patterns. “Pushing down” filters and aggregates reduces data transfer from the remote system, in many cases significantly. Let’s take a look at these optimizations and see how they work and why they’re beneficial.

Minimizing Joins

Query complexity increases with the number of joins in the query. Many algorithms that implement SPARQL to SQL translation introduce intermediate self-joins, which need to be pruned before execution to avoid unnecessary complexity.

An illustration will make this clear. Given a virtual graph query containing a basic graph pattern (BGP) with two triple patterns, we can imagine that the employee first and last names come from the same table, say, EMP:

graph <virtual://emp> {
  ?emp :firstName ?first ;
       :lastName  ?last
}

Naively translating this to SQL might introduce a query over the table for each predicate. In this case two predicates would query the table twice and require a join between them:

SELECT E1.EMP_NO, E1.FIRST, E2.LAST
FROM EMP AS E1 INNER JOIN EMP AS E2
ON E1.EMP_NO = E2.EMP_NO

Writing this query manually, we would know that the join is unnecessary and avoid it. If the field being joined on, EMP_NO, is a primary (or unique) key on the EMP table, then we can eliminate this self-join. The use of keys or constraints for this type of optimization is called semantic query optimization. We’re able to reduce the number of joins in this case, but it’s not always possible using this approach.

Courtesy of Ed Uthman

Courtesy of Ed Uthman

In addition to standard self-join pruning, we eagerly avoid this issue in Stardog 5 by mapping several triple patterns in the same BGP with the same subject to a single-table query. Which means we never introduce these types of redundant joins in any part of the query translation process. We still perform several types of semantic query optimization which may further reduce joins if generated as a byproduct of the mappings and query.

Optimizing OPTIONAL

SPARQL’s OPTIONAL, which is really a left join in disguise, is another opportunity for join reduction or avoidance. The concept of translating optionality is simple: if a field is nullable, we allow it to be unbound.

A field in a SQL result may be nullable for a number of reasons. The originating table’s field may be nullable or it may originate in a table used on the right side of a left join. The general approach to translating OPTIONAL to SQL is using a left join.

Again, an example makes this clear. Using the employee example again, let’s imagine the middle name is nullable in the EMP table. We can query it like so:

graph <virtual://emp> {
  ?emp :lastName  ?last
  OPTIONAL { ?emp :middleName ?middle }
}

We’re looking for employee’s last names and, optionally, their middle names. The general translation algorithm will create a left join:

SELECT E1.EMP_NO, E1.FIRST, E2.MIDDLE
FROM EMP AS E1 LEFT JOIN EMP AS E2
ON E1.EMP_NO = E2.EMP_NO

This looks suspiciously similar to the previous situation. We have a self-join, but this time it’s a left join in contrast to the inner join we had before. We don’t need semantic query optimization as before. We don’t even need a left join because there’s no FILTER or join in the OPTIONAL and the table and subject of the optional pattern is the same as one in the outer BGP. We can simply query over the EMP table and allow the MIDDLE field to be NULL. In cases where it is NULL, ?middle will remain unbound.

Pushing Aggregates

The last rewriting optimization we’ll discuss is pushing aggregation and subqueries down to the source system. Before Stardog 5, it was possible to express aggregation in virtual graph queries, but the evaluation of the aggregates was performed by Stardog.

Courtesy of Naotake Murayama

Courtesy of Naotake Murayama

This implies that all aggregate data was transferred from the source system to Stardog for evaluation. Imagine we want to know how much we’re spending in employee salaries for each state. We can express this in SPARQL:

graph <virtual://emp> {
  select ?state (sum(?sal) as ?totalSalary)
  where {
    [] :state ?state ;
       :salary ?sal
  }
  group by ?state
}

Evaluation of this query requires reading all records with employee state and salary data. With Stardog 5, we push the aggregation operation to the source system. This means that we only transfer the states and aggregate salaries, not every solution as was previously necessary. The resulting SQL query would look something like this:

SELECT EMP.STATE, SUM(SALARIES.SAL) AS TOTAL
FROM EMP, SALARIES WHERE EMP.EMP_NO = SALARIES.EMP_NO
GROUP BY EMP.STATE

This is one of several new optimizations in which we can push down query fragments for evaluation by the source system.

Other Improvements

The new virtual graph system in Stardog 5 uses Apache Calcite. Calcite gives us a mature and flexible representation of SQL algebra and a framework within which to easily rewrite queries. We still maintain a database portability layer for generating queries for different SQL syntaxes.

Improved Debugging and Development

If you’ve ever peeked under the hood at running queries using Oracle’s V$SESSION or SQL Server’s sys.dm_exec_requests, then you might notice a long chain of REPLACE() calls not shown in Stardog’s explain output. These calls are used to escape special characters in IRIs. In Stardog 5 we’re doing this on the Stardog side in most cases. This means less data transfer and easier debugging.

Courtesy of Lord Jim

Courtesy of Lord Jim

A related improvement is mapping validation. When creating a virtual graph, all mappings are validated immediately, returning a set of validation failures. This means reduced development time. Here’s an example of one of the new error messages:

Invalid mapping for subject=TEMPLATE(http://example.com/emp/{emp_no})
	table=select * from employees where emp_no = 5820
	field [INVALID_FIELD1] not found; input fields are:
          [EMP_NO, BIRTH_DATE, FIRST_NAME, LAST_NAME, HIRE_DATE, NICKNAME]
		term map=COLUMN(Invalid_Field1)^^xsd:string

You can see that this clearly identifies the mapping, the validation failure, and why it happened.

Apache Hive Support

We’ve also expanded our set of supported SQL dialects to include Apache Hive. We’ve actively working on supporting other databases. If you have specific requirements, please let us know.

Future Work

Stardog 5 is a big step forward for virtual graphs. But we’re not done. We’re working on incorporating better statistics from source databases to make better query planning decisions within Stardog. We’re also working on additional rewriting strategies that can be enabled for specific use cases.

Download Stardog today to start your free 30-day evaluation.

Jess Balint

9 May 2017

Social Media Icon Social Media Icon