Virtual Graphs in Stardog 5
Get the latest in your inbox
Get the latest in your inbox
Stardog 5 features an all-new virtual graph engine with lots of improvements. This post describes them and how they can help you build better knowledge graphs with Stardog.
The new virtual graph engine—all-new and built from scratch for Stardog 5—was designed to overcome limitations in the previous version, as well as provide a platform for future growth. (See my previous post for a discussion of virtual graphs in Stardog.) While retaining the same interface, we’ve replaced the SQL rewriting backend, improved performance, and added features. I discuss some of these changes including their use cases; the optimizations are general and applicable to many use cases.
Federated query execution is based on the idea that we can translate a portion of a SPARQL query into a SQL query for execution by a remote relational database engine. Given schema information, including primary and foreign keys, we can generate an efficient SQL query with a minimal number of joins.
Metadata such as types and nullability constraints allow optimizations with
OPTIONAL
patterns. “Pushing down” filters and aggregates reduces data transfer
from the remote system, in many cases significantly. Let’s take a look at these
optimizations and see how they work and why they’re beneficial.
Query complexity increases with the number of joins in the query. Many algorithms that implement SPARQL to SQL translation introduce intermediate self-joins, which need to be pruned before execution to avoid unnecessary complexity.
An illustration will make this clear. Given a virtual graph query containing a
basic graph pattern (BGP) with two triple patterns, we can imagine that the
employee first and last names come from the same table, say, EMP
:
graph <virtual://emp> {
?emp :firstName ?first ;
:lastName ?last
}
Naively translating this to SQL might introduce a query over the table for each predicate. In this case two predicates would query the table twice and require a join between them:
SELECT E1.EMP_NO, E1.FIRST, E2.LAST
FROM EMP AS E1 INNER JOIN EMP AS E2
ON E1.EMP_NO = E2.EMP_NO
Writing this query manually, we would know that the join is unnecessary and
avoid it. If the field being joined on, EMP_NO
, is a primary (or unique) key
on the EMP
table, then we can eliminate this self-join. The use of keys or
constraints for this type of optimization is called semantic query
optimization. We’re able to reduce the number of joins in this case, but it’s
not always possible using this approach.
In addition to standard self-join pruning, we eagerly avoid this issue in Stardog 5 by mapping several triple patterns in the same BGP with the same subject to a single-table query. Which means we never introduce these types of redundant joins in any part of the query translation process. We still perform several types of semantic query optimization which may further reduce joins if generated as a byproduct of the mappings and query.
SPARQL’s OPTIONAL
, which is really a left join in disguise, is another
opportunity for join reduction or avoidance. The concept of translating
optionality is simple: if a field is nullable, we allow it to be unbound.
A field in a SQL result may be nullable for a number of reasons. The originating
table’s field may be nullable or it may originate in a table used on the right
side of a left join. The general approach to translating OPTIONAL
to SQL is
using a left join.
Again, an example makes this clear. Using the employee example again, let’s
imagine the middle name is nullable in the EMP
table. We can query it like so:
graph <virtual://emp> {
?emp :lastName ?last
OPTIONAL { ?emp :middleName ?middle }
}
We’re looking for employee’s last names and, optionally, their middle names. The general translation algorithm will create a left join:
SELECT E1.EMP_NO, E1.FIRST, E2.MIDDLE
FROM EMP AS E1 LEFT JOIN EMP AS E2
ON E1.EMP_NO = E2.EMP_NO
This looks suspiciously similar to the previous situation. We have a self-join,
but this time it’s a left join in contrast to the inner join we had before. We
don’t need semantic query optimization as before. We don’t even need a left join
because there’s no FILTER
or join in the OPTIONAL
and the table and subject
of the optional pattern is the same as one in the outer BGP. We can simply query
over the EMP
table and allow the MIDDLE
field to be NULL
. In cases where
it is NULL
, ?middle
will remain unbound.
The last rewriting optimization we’ll discuss is pushing aggregation and subqueries down to the source system. Before Stardog 5, it was possible to express aggregation in virtual graph queries, but the evaluation of the aggregates was performed by Stardog.
This implies that all aggregate data was transferred from the source system to Stardog for evaluation. Imagine we want to know how much we’re spending in employee salaries for each state. We can express this in SPARQL:
graph <virtual://emp> {
select ?state (sum(?sal) as ?totalSalary)
where {
[] :state ?state ;
:salary ?sal
}
group by ?state
}
Evaluation of this query requires reading all records with employee state and salary data. With Stardog 5, we push the aggregation operation to the source system. This means that we only transfer the states and aggregate salaries, not every solution as was previously necessary. The resulting SQL query would look something like this:
SELECT EMP.STATE, SUM(SALARIES.SAL) AS TOTAL
FROM EMP, SALARIES WHERE EMP.EMP_NO = SALARIES.EMP_NO
GROUP BY EMP.STATE
This is one of several new optimizations in which we can push down query fragments for evaluation by the source system.
The new virtual graph system in Stardog 5 uses Apache Calcite. Calcite gives us a mature and flexible representation of SQL algebra and a framework within which to easily rewrite queries. We still maintain a database portability layer for generating queries for different SQL syntaxes.
If you’ve ever peeked under the hood at running queries using Oracle’s
V$SESSION
or SQL Server’s sys.dm_exec_requests
, then you might notice a long
chain of REPLACE()
calls not shown in Stardog’s explain
output. These calls
are used to escape special characters in IRIs. In Stardog 5 we’re doing this on
the Stardog side in most cases. This means less data transfer and easier
debugging.
A related improvement is mapping validation. When creating a virtual graph, all mappings are validated immediately, returning a set of validation failures. This means reduced development time. Here’s an example of one of the new error messages:
Invalid mapping for subject=TEMPLATE(http://example.com/emp/{emp_no})
table=select * from employees where emp_no = 5820
field [INVALID_FIELD1] not found; input fields are:
[EMP_NO, BIRTH_DATE, FIRST_NAME, LAST_NAME, HIRE_DATE, NICKNAME]
term map=COLUMN(Invalid_Field1)^^xsd:string
You can see that this clearly identifies the mapping, the validation failure, and why it happened.
We’ve also expanded our set of supported SQL dialects to include Apache Hive. We’ve actively working on supporting other databases. If you have specific requirements, please let us know.
Stardog 5 is a big step forward for virtual graphs. But we’re not done. We’re working on incorporating better statistics from source databases to make better query planning decisions within Stardog. We’re also working on additional rewriting strategies that can be enabled for specific use cases.
Download Stardog today to start your free 30-day evaluation.
How to Overcome a Major Enterprise Liability and Unleash Massive Potential
Download for free