Showing results for

Back to all articles

Joins and NULLs in SPARQL

Oct 19, 2021

Joins in SPARQL could be confusing to newcomers. You can hear some people celebrating the fact that they don’t need to write explicit join conditions (like in SQL) but if you actually look in the SPARQL spec, you will see the term “join” used like 67 times (as of Oct 2021). Furthermore, if you look at the join definition you will recognize the familiar relational operator that’s not so different from SQL.

What it all means is that while joins are not in the SPARQL syntax they are deeply ingrained in the SPARQL semantics. The definition of join often specifies what the results should be for this query or that query on some graph independently of which system executes the query. That definition, while pretty simple, can easily trick the user, especially, if they thought that using SPARQL somehow means that joins are irrelevant. This post should demystify one particular aspect of the join semantics in SPARQL, namely, NULLs (aka unbound join keys).

Joins in SPARQL semantics

Whenever two SPARQL patterns, say a BGP and UNION, end up in the same scope (or Group Graph Pattern, simply in {...}), their results are joined:

SELECT * {
    { ?person :lives ?city } UNION { ?person :worksFor ?company }
    ?person :name ?name
}

This UNION matches both city dwellers and employees. The results are joined with the ?person :name ?name BGP to retrieve names. All patterns in SPARQL generate table-like results, i.e. (multi)sets of rows called solutions. Each solution is simply a set of bindings where a variable is bound to a value (an IRI, a bnode, or a literal). Since all intermediate query results are tables (even though the base data is a graph), the relational join provides a natural foundation for combining results until they reach the final projection (e.g. SELECT or CONSTRUCT) and get shipped to the client.

Note: there could be any other pattern there in place of the UNION, like a property path or a subquery, and it would still require a join to combine its results with the BGP.

Joins in SPARQL are natural joins. That means that the solution sequences are joined on values of the same variable on both sides (that variable is called the join key). In the example above the ?person variable is the join key. Assuming the BGP generates

+--------+--------+
| person |  name  |
+--------+--------+
| :p1    | "John" |
| :p2    | "Mary" |
| :p3    | "Lisa" |
+--------+--------+

and the UNION generates

+--------+-------+----------------------+
| person | city  |  company             |
+--------+-------+----------------------+
| :p1    | :NYC  |                      |
| :p2    |       | :Google              |
| :p2    |       | :Stanford_University |
+--------+-------+----------------------+

the joined result (which also happens to be the final result of the query) would be

+--------+--------+-------+----------------------+
| person |  name  | city  |  company             |
+--------+--------+-------+----------------------+
| :p1    | "John" | :NYC  |                      |
| :p2    | "Mary" |       | :Google              |
| :p2    | "Mary" |       | :Stanford_University |
+--------+--------+-------+----------------------+

Note: there could be multiple join variables in a pair of solutions.

According to the join definition every join result represents a combination of two compatible solutions, one from the BGP and the other from the UNION, which agree on the value of ?person (the join key). For John there’s one such solution on both sides, for Mary there’s one on the BGP side but two from the UNION (so Mary appears twice in the results), and for Lisa there’s no city or company information so she does not appear in the results.

This should all look familiar to SQL users, maybe to the extent of being boring. But hang on, we are going to talk about those missing values. The dreaded NULLs.

NULLs or unbound variables

The term “NULL” actually does not occur in the SPARQL spec, perhaps because it doesn’t really have a good rep in the database world. Instead the spec talks about bound and unbound variables. If a variable, like ?city, is associated with a value in a solution, like :NYC in the top row, it’s said to be bound to that value. Otherwise it is unbound.

Unbound or NULL is about the same thing. Perhaps “unbound” better exposes the fact that “NULL” is not some sort of special marker value indicating missing data. It is the lack of a value. That confuses a lot of new SQL users when they first see IS NULL (or IS NOT NULL) predicates instead of simple comparisons. Or wonder why any comparison with nulls never returns true. Unfortunately, the same confusion also happens in SPARQL, but manifests differently.

It happens when a join key variable is unbound.

Unbound join keys or “where all these rows came from?!”

The solution compatibility definition states that two solutions are compatible if a variable is bound in both solutions, it is bound to the same value. The trick here is that this definition does not say anything about unbound variables. If a variable is unbound in a solution, it simply does not exist there. Thus it makes no impact on compatibility of that solution with any other solution.

That may have some interesting consequences as the following example illustrates. Let’s add a triple about a company but without any employees:

:Yandex :locatedIn :Moscow .

and then join the same (?person, ?name) solutions:

+--------+--------+
| person |  name  |
+--------+--------+
| :p1    | "John" |
| :p2    | "Mary" |
| :p3    | "Lisa" |
+--------+--------+

with the following solutions on the ?person variable:

+--------+----------------------+----------------+
| person |       company        |      city      |
+--------+----------------------+----------------+
|        | :Yandex              | :Moscow        |
| :p2    | :Stanford_University | :Stanford      |
| :p2    | :Google              | :Mountain_View |
+--------+----------------------+----------------+

possibly generated by a pattern like this:

?company :locatedIn ?city .
OPTIONAL {
  ?person :worksFor ?company 
}

The result would be this:

+--------+--------+----------------------+----------------+
| person |  name  |       company        |      city      |
+--------+--------+----------------------+----------------+
| :p2    | "Mary" | :Stanford_University | :Stanford      |
| :p2    | "Mary" | :Google              | :Mountain_View |
| :p2    | "Mary" | :Yandex              | :Moscow        |
| :p1    | "John" | :Yandex              | :Moscow        |
| :p3    | "Lisa" | :Yandex              | :Moscow        |
+--------+--------+----------------------+----------------+

The first two rows are no surprise: Mary jointly works for Google and Stanford. The other three, however, could look counterintuitive. Yandex is not associated with any person in the second set of solutions, why did we get it for all people in the first set? Imagine how the results would explode if there were millions of people in the data, it’d come up for all of them.

The answer is in the compatibility definition. Consider two solutions: (?person -> :p1, ?name -> "John") and (?company -> :Yandex, ?city -> :Moscow). None of the variables bound in the first solution has a different value in the second because none of them exists in the second. This is where the true meaning of unbound (or NULL) shows: if this were a marker value that’s different from :p1, the solutions would not be compatible. But it’s not the case, there’s no ?person value in the second solution, thus there is no variable to make the solutions incompatible.

Note: Another way to think about it is that generally there’s no rigid relational schema for solution sequences that a particular SPARQL pattern generates (even though it’s convenient to depict them as tables, as we do in this post). There’s no fixed set of variables for all solutions coming out of some UNION or some OPTIONAL (i.e. no fixed Heading in the relational algebra terms). In general solutions are not even fixed size. The compatibility definition naturally deals with that by operating only with those variables which are bound in the solutions in hand.

This means, in general, that a solution without a bound join key is compatible with all solutions generated by the pattern on the other side of the join. It doesn’t matter what values the same variable has on the other side. If there are multiple such solutions, on both sides, the join will generate all combinations of such solutions, up to the Cartesian product. The results could explode possibly causing issues for both the server and the client.

Unbound join keys in the wild

OK, this sounds ominous but do those unbound join keys actually occur in real life? Can realistic SPARQL queries generate them? Do I actually need to care? Yes, here are some common examples.

BIND patterns and functions raising errors

BIND patterns are a popular SPARQL feature that enables users generate new values from existing ones and assign them to variables. Suppose in our data the :lives predicate links people to cities but cities are strings and not IRIs. One may want to turn them to IRIs in order to link to other information, like population data:

SELECT * {
  ?person :lives ?city .
  BIND(iri(concat("http://test.com/", ?city)) as ?cityIri)
  ?cityIri :population ?pop
}

Executing this on

@prefix : <http://test.com/> .

:John :lives "NYC" .
:NYC :population 8419000 .
:London :population 8982000 .
:Moscow :population 11912000 .

returns the expected results:

+----------------------+-------+---------------------+---------+
|        person        | city  |       cityIri       |   pop   |
+----------------------+-------+---------------------+---------+
| http://test.com/John | "NYC" | http://test.com/NYC | 8419000 |
+----------------------+-------+---------------------+---------+

However, as soon as we add this triple to the data:

:Mary :lives :London .

the results again blow up, this time for :Mary, associating her with all cities in the data:

+----------------------+------------------------+------------------------+----------+
|        person        |          city          |        cityIri         |   pop    |
+----------------------+------------------------+------------------------+----------+
| http://test.com/John | "NYC"                  | http://test.com/NYC    | 8419000  |
| http://test.com/Mary | http://test.com/London | http://test.com/London | 8982000  |
| http://test.com/Mary | http://test.com/London | http://test.com/NYC    | 8419000  |
| http://test.com/Mary | http://test.com/London | http://test.com/Moscow | 11912000 |
+----------------------+------------------------+------------------------+----------+

The problem is that :Mary is already linked to an IRI (:London), not to a literal ("London"), as the query’s author probably assumed. The concat function in SPARQL expects a string and produces a type error which, according to the spec, does not abort query execution but leaves the target variable ?cityIri unbound. But that variable is a join key and once it’s removed the remaining (?person -> :Mary, ?city -> :London) solution is compatible with all solutions generated by the ?cityIri :population ?pop BGP, irrespective of ?cityIri values there.

This situation is not uncommon at all in various data linking/integration queries where joins keys are cleaned, transformed, and generated at query time. It is also not accidental that datasets for which queries are written tend to contain messy data which often defeats assumptions. Also, when writing a query one should always keep in mind that the data is fluid and can get messy after initial testing.

Lots of built-in functions in SPARQL produce errors under various conditions. It is pretty difficult (both to the user and the query engine) to conclude that a certain expression in a BIND will never produce an error so that the target variable is always bound. For example, one attempt to “fix” the above query would be to reformulate the expression as follows:

BIND(iri(concat("http://test.com/", str(?city))) as ?cityIri)

The idea is to use the str function to turn the ?city value into a string literal before concatenation. It almost works! However str can also produce an error if ?city is bound to a blank node in which case the join result would blow up just the same. A strictly spec-compliant engine would have to account for that possibility and be ready to handle NULLs during execution.

There are several ways to deal with this issue:

  1. Fix the data and keep it clean (for example, by preventing updates violating integrity constraints). SHACL is a great tool for the job. Of course, this is not always realistic, for example, ensuring that one has an integrity constraint in place for every implicit data quality assumption used in every query is going to be hard. But there are some query-time solutions, too:
  2. Use filters to sanitize data before joins:

    ?person :lives ?city .
    FILTER(!isBlank(?city))
    BIND(iri(concat("http://test.com/", str(?city))) as ?cityIri)
    ?cityIri :population ?pop
    

    Here solutions with bnodes are filtered out so the functions in BIND should not return errors. Filtering can also be done after the BIND by using the bound function (that’d be slightly less efficient).

  3. The if expression is a SPARQL version of the ternary operator common in many programming languages. One can use it to validate and transform values without filtering them out:

    ?person :lives ?city .
    BIND(if(!isBlank(?city), str(?city), "") as ?cityLiteral)
    BIND(iri(concat("http://test.com/", ?cityLiteral)) as ?cityIri)
    ?cityIri :population ?pop
    

    Here we check that ?city is not bound to a bnode before using str. Then ?cityLiteral is guaranteed to be bound to a literal and thus the iri(concat(...)) call should successfully generate an IRI. This approach probably does not make much sense in this particular example because if ?city is a bnode, there is no reason to preserve that intermediate result (it won’t join with ?cityIri). But in other cases it might be useful to keep unexpected values so the client could see them.

  4. Finally, the coalesce function can be used as the last row of defence: it will evaluate its arguments, left to right, till there is a valid value, then returning that value. By wrapping function calls in coalesce one can guarantee a non-null value:

    ?person :lives ?city .
    BIND(coalesce(iri(concat("http://test.com/", str(?city))), "") as ?cityIri)
    ?cityIri :population ?pop
    

    If the function calls produce an error, coalesce will bind ?cityIri to the empty string which will play the role of that special marker preventing the join result explosion.

The same problem may happen when joining aggregated results, i.e. when the join key variable is bound by a GROUP BY operator (although that is pretty rare). Unbound values also happen as the result of errors encountered by aggregation functions, like group_concat or avg (see the example in the spec).

Join keys bound in OPTIONAL patterns

The other notable case of unbound join keys involves OPTIONALs. Those are the SPARQL equivalent of left outer joins in SQL and are commonly used to deal with missing data (if you are not familiar with them, see this gentle introduction: Linking information to “missing” information in SPARQL).

One should be very careful when joining on variables bound inside OPTIONAL patterns. Those can be unbound if the pattern does not match the data (or the FILTER inside the OPTIONAL, if any, does not evaluate to true). Consider this example:

SELECT * {
  ?person a :Person
  OPTIONAL { ?person :lives ?city }
  OPTIONAL { ?city :locatedIn ?country }
}

Looks pretty harmless, doesn’t it? But try it on this data:

:John   a :Person ;
        :lives :NYC .
:Mary   a :Person .
:NYC    :locatedIn :USA .
:London :locatedIn :UK .
:Moscow :locatedIn :Russia . 

and you get the familiar result explosion for :Mary:

+--------+---------+---------+
| person |  city   | country |
+--------+---------+---------+
| :John  | :NYC    | :USA    |
| :Mary  | :NYC    | :USA    |
| :Mary  | :Moscow | :Russia |
| :Mary  | :London | :UK     |
+--------+---------+---------+

Since the city information is missing for :Mary, the ?city variable comes unbound from the first OPTIONAL. Since it’s the join key for the second OPTIONAL, the results explode in much the same way as in the previous examples.

Note: The sharp reader may notice that the second OPTIONAL keyword isn’t even essential to this example, just joining with ?city :locatedIn ?country would do. However it’s very common to chain OPTIONALs like this in real-life queries.

An easy way to get the desired results is to move the last pattern inside the OPTIONAL, in which case it becomes a part of the same BGP rather than being separately joined:

SELECT * {
  ?person a :Person
  OPTIONAL { 
      ?person :lives ?city .
      ?city :locatedIn ?country }
}

This yields the expected result:

+--------+-------+---------+
| person | city  | country |
+--------+-------+---------+
| :John  | :NYC  | :USA    |
| :Mary  |       |         |
+--------+-------+---------+

Conclusion and performance implications

The relational database community has long been aware that NULLs are mean. From that perspective it should be no surprise that they cause issues for query languages with similar to SQL semantics, too. What makes some of these issues, particularly their interaction with joins, unexpected for many SPARQL users is that joins almost seem like a foreign concept to SPARQL. They do, however, play a pretty fundamental role in the SPARQL semantics and have to deal with NULLs. It takes a bit of experience to “see” the implicit joins in the join-free syntax of SPARQL queries to avoid this kind of pitfalls. Fortunately the fixes are typically not hard.

However, result explosion is not the only issue. Even if the data matches the expectations in the query perfectly (i.e. no ill-typed errors screwing function calls in BINDs), the query optimizer may not know it before executing the query. One of the appeals of RDF is the ability to store and query data which does not satisfy the schema (even when specified explicitly using RDFS/SKOS/OWL or SHACL). The query engine, in most settings, must be able to deliver the correct results on whatever data mess lives inside your system.

This means every join in the query plan must be ready to deal with NULLs unless the optimizer can figure out ahead of time that the join key variables cannot be unbound. Regardless of the data. This is hard to do in general.

Note: we are not talking here about query execution under schema constraints. It’s possible to define SPARQL evaluation under the assumption that the data satisfies, say, the given SHACL shapes, and the results would be undefined otherwise. This however would be beyond the SPARQL spec (which predates SHACL).

Many mainstream join algorithms, in its classical formulation, don’t deal well with NULLs. Think about the hash join, in particular. It builds a hashtable for all rows generated by one relation. The rows are hashed on values of join variables. If those values are NULL, all such rows would end up in the same bucket. If a NULL value occurs in a row of the other relation (the probe side), the lookup in the table reduces to full iteration over the hashtable (which typically isn’t what it’s optimized for). These complications are a reason that the system may decide to use a less efficient join algorithm just because NULLs may occur during evaluation, even if they don’t actually occur. The result is a performance hit.

Different systems implement different ways to analyze the query to decide how to deal with NULLs. As demonstrated in the examples with BIND, NULLs can often be removed in multiple ways. However it does not mean that all resulting queries will be equally easy for your query optimizer to analyze. For example, the coalesce option should make it obvious to the optimizer that the target variable cannot be NULL. This is not the case about the other options because tracking variable value types across multiple BIND or FILTER expressions could require a pretty complex analysis.

Knowing your system and being able to read its query execution plans (or use the query profiler, if available) is key to dealing with performance issues, whether NULL-related or not.

Keep Reading:

Buildings, Systems and Data

Designing a building or a city block is a process that involves many different players representing different professions working in concert to produce a design that is ultimately realized. This coalition of professions, widely referred to as AECO (Architecture, Engineering, Construction and Operation) or AEC industry, is a diverse industry with each element of the process representing a different means of approaching, understanding, and addressing the problem of building design.

Graph analytics with Spark

Stardog Spark connector, coming out of beta with 7.8.0 release of Stardog, exposes data stored in Stardog as a Spark Dataframe and provides means to run standard graph algorithms from GraphFrames library. The beta version was sufficient for POC examples, but when confronted with real world datasets, its performance turned out not quite up to our standards for a GA release. It quickly became obvious that the connector was not fully utilizing the distributed nature of Spark.

Try Stardog Free

Stardog is available for free for your academic and research projects! Get started today.

Download now