Showing results for

Back to all articles

FROM vs FROM NAMED in SPARQL

Jun 28, 2021

FROM vs FROM NAMED, what’s the difference, and when should I use one or the other is a constant source of confusion for SPARQL users. It’s one of the main reasons why a query can surprisingly return zero results and the most experienced of us have been tricked by it at least once. This short post goes into a little bit of a detail of the difference and discusses how both can be used to address different use cases.

Others have written excellent posts demystifying parts of the SPARQL specification related to this. Jeen Broekstra’s post on the default graph and Marcelo Barbieri’s general post on working with named graphs (including Stardog specific extensions) are particularly useful. We won’t touch on the default graph matter here at all (thanks, Jeen!) and will focus on queries with FROM or FROM NAMED keywords.

Note: one can specify the query’s dataset externally to the query using HTTP parameters as defined in the SPARQL Protocol. As far as this post is concerned, that has the same effect as using FROM and FROM NAMED inside the query string.

Query dataset and the active graph

A lot of confusion around FROM and FROM NAMED stems from the assumption that any SPARQL query runs against a collection of graphs. That seems like a natural thing to assume, after all, the RDF 1.1 dataset is exactly that: a collection of graphs (including the default, aka unnamed, graph). The reality, however, is a little more nuanced: each graph pattern in the query, at any point of time during the execution, matches data in one graph only. That graph is called the active graph for that pattern. It can change during the execution and is defined by two things:

  • if the pattern is within a GRAPH block
  • FROM and FROM NAMED definitions

Understanding what the active graph is for a pattern, such as a BGP, at any point of time, is key to understanding why, for example, the pattern matches no results. The rules are pretty straightforward:

  • patterns outside of GRAPH blocks are evaluated against the merge of all graphs listed in FROM. The active graph is the merge of all such graphs, it does not change during execution.
  • patterns inside GRAPH iri {..} are evaluated against the iri graph in the data if iri is listed in FROM NAMED. The active graph is that graph in the data or the empty graph if it’s not listed in FROM NAMED. Again, it does not change.
  • patterns inside GRAPH ?g {..} are separately evaluated against every graph listed in FROM NAMED and the results are combined under set union. Every graph listed in FROM NAMED becomes the active graph for some evaluation of the pattern.

These rules have some simple, yet important, consequences:

  • if your query uses FROM but your patterns are inside GRAPH (with ?g or a constant IRI), the patterns will not match any data.
  • if your query uses FROM NAMED but your patterns are outside of graph blocks, the patterns will not match any data.

Remembering just these two facts by itself will keep you out of trouble most of the time: if you use GRAPH use FROM NAMED, if you don’t use GRAPH use FROM (and vice versa).

The rest of the differences are nuances. It’s OK to skip over to the section about the use cases for both. But if you’re interested in whether you can always replace your FROM by FROM NAMED, put your patterns inside a GRAPH blocks, and get the same results, read on.

The answer to that question is, as you may suspect, “no”.

FROM vs FROM NAMED difference in query results

So why exactly does the spec offer both FROM and FROM NAMED? Can’t we just have one or the other? At the end of the day what most people want to do is to query a collection of graphs so why do we need two ways to enumerate those graphs? To answer this question, we first examine how the difference between FROM and FROM NAMED can manifest in query results.

Matching data within graphs vs across graphs

Consider the following small data snippet (in TRiG):

    :products {
      :p1 a :Product ;
          :part-of :o1 .
      :o1 a :Order;
          :customer :c1 .
    }
    
    :customers {
      :c1 :name "Jim" .
    }

and the query:

    SELECT * FROM :products FROM :customers {
      ?product  a :Product ;
                :part-of ?order .
      ?order    :customer ?customer .
      ?customer :name ?name 
    }

The product data is stored in the :products graph and the customer data is stored in the :customers graph, they’re connected through order nodes. The query selects customer names associated with products.

This query’s BGP consisting of 4 triple patterns (TP) has to be evaluated over a single graph to match results even though the data is spread over two distinct named graphs. This is what the FROM keywords deliver: the active graph for the BGP is a merge of the :products graph and the :customers graph so the patterns match data. One can say that the data is matched across named graph boundaries.

Now, let’s see what happens if the query is rewritten to use FROM NAMED:

    SELECT * FROM NAMED :products FROM NAMED :customers {
      GRAPH ?g {
        ?product  a :Product ;
                  :part-of ?order .
        ?order    :customer ?customer .
        ?customer :name ?name
      }
    }

First, we need to move the BGP into a GRAPH block (remember the simple rules we mentioned earlier). If you don’t do that, any mature query optimiser will immediately realise that the query cannot produce results and won’t even evaluate the BGP. Second, and most importantly, to understand which results this query produces (if any) you need to understand the active graph for the BGP (for the curious, the relevant part of the spec is Section 18.6 Definition: Evaluation of Graph).

Informally GRAPH ?g means that the query loops over all graph IRIs in FROM NAMED and substitutes each of them into ?g, that becomes the active graph for the BGP, the pattern is evaluated, and the results are combined over all of the listed graphs. That is, the key part is that at any point of time the active graph is a single named graph in the data. But neither :products nor :customers by itself contains enough data to match the entire BGP, either products are missing or customers. Thus the BGP produces the empty result for both evaluations and so does the union. The query returns no results.

Again, one can say that with FROM NAMED and GRAPH patterns match within graphs rather than across graphs. But it’s important to understand what it actually means and this is where the concept of active graph helps.

Note: one way of thinking why GRAPH ?g matches within graphs is to think what value the graph variable ?g would have if matching was done across graphs, i.e. half in :products and half in :customers. It’d be very difficult to define it in a meaningful way.

Property paths and graph boundaries

Another interesting difference between FROM vs FROM NAMED manifests when property paths are involved. The spec says that property paths do not span multiple graphs in the dataset. There’s a little bit of ambiguity here on what is meant by “dataset” exactly. It doesn’t mean that one cannot match paths in the data across named graph boundaries, as the following example shows:

    :products {
      :p1 a :Product ;
          :part-of :o1 ;
          :part-of :o2 .
    }
    
    :orders {
      :o1 a :Order ;
          :customer :c1 .
      :o2 a :Order ;
          :customer :c2 .
    }
    
    :customers {
      :c1 :name "Jim" .
      :c2 :name "Mary" .
    }

It’s still the same products’n’customers dataset except that now there’s a separate graph for orders and a second order for the same product (:p1) but a different customer (:c2). Now, a query to find pairs of customers who buy the same products could look like:

    SELECT * FROM :products FROM :orders {
      ?customerX (^:customer/^:part-of/:part-of/:customer)+ ?customerY
      FILTER (str(?customerX) < str(?customerY))
    }

Pro tip: if you’re interested not just in pairs of customers but actual connections in the data that link them together, i.e. particular orders and products, check out path queries in Stardog.

The dataset for this query is defined by the two FROM statements which make a single merged active graph. Since it’s a single active graph for the query, the property path will match :c1 and :c2 even though the path spans both :products and :orders graphs in the data.

Again, switching the keyword FROM to FROM NAMED changes the execution and the results:

    SELECT * FROM NAMED :products FROM NAMED :orders {
      GRAPH ?g { ?customerX (^:customer/^:part-of/:part-of/:customer)+ ?customerY }
      FILTER (str(?customerX) < str(?customerY))
    }

Now the (^:customer/^:part-of/:part-of/:customer)+ pattern is evaluated twice, once against :products and once against :orders. Each time it cannot go across the graph boundary so both evaluations return empty results and so does the query.

One can often reason about property path evaluations as sequences of BGP evaluations till a fix-point, i.e. when no new nodes can be reached. So it should not be surprising the FROM vs FROM NAMED difference manifests similarly for property paths as for BGPs.

Named graphs in the wild: use cases

Let’s now go back to the earlier question whether both FROM and FROM NAMED are needed. We know they can lead to different query results but it’s interesting how their properties relate to actual use cases.

When it comes to named graphs in RDF, we often see that the data layout follows one of the following two patterns: either few relatively large graphs where each contains a unique part of the dataset, or numerous small graphs representing similar objects. We call these patterns graph per domain and graph per entity.

Graph per domain

By domain here we mean some part of the graph which is structurally different from other parts of the graph. Popular examples include:

  • a table or a collection of tables esp. if the graph has roots in a relational database(s). The products and customers example above follows this pattern: the :products and :customers graphs could easily correspond to product and customer tables.
  • a data source, i.e. the named graph identifies where a particular part of the data came from, such as an upstream database, a connector, a data stream, etc. The graph IRI is then convenient to record provenance information.
  • a dataset. It’s not uncommon for Stardog customers to maintain different datasets in a single database in different named graphs esp. if they need to be queried together.

Graphs following this pattern are typically linked, for example, in the graph per table case such links would be foreign key relationships. Product nodes whose attributes are stored in the :products graph have links to order nodes which can be stored in a different graph. Thus querying across graphs is usually a key requirement and it is directly supported by the FROM semantics.

Note: one reason people tend to partition their dataset onto several graphs is the relative ease of using the SPARQL Graph Store Protocol compared to generic SPARQL Update queries. It offers REST-like API for adding, updating, and deleting graphs which looks friendlier to a slew of users, esp. Web developers, than an expressive query language.

Graph per business entity

The other, substantially different pattern of using named graphs is when each graph keeps together data about a single object, like a product, a customer, or any other business entity. This pattern usually results in millions of small graphs. In contrast to the above use case, these graphs tend to be independent and structurally similar. Continuing the analogy with relational models, such graphs often correspond to a single table row or a small set of linked rows from several tables (i.e. if the relational schema is normalised to a high form).

One reason GRAPH ?g {...} blocks become useful in such a scenario (and thus FROM NAMED) is because it’s often important to figure out which graphs contain objects matching the search criteria. Consider the following simple example:

    SELECT * FROM NAMED ... {
      GRAPH ?g {
        ?p a :GroceryProduct ;
           :expires-on ?expdate .
        FILTER(?expdate < "2021-01-06"^^xsd:date)
      }
      ?g :created-on ?date
    } 

The query finds all grocery products which will soon expire and gets the information on when those products’ data was added to the dataset. The named graph IRIs are pretty useful for recording additional metadata since they can naturally be used as subjects of other triples, e.g. :created-on triples.

Note: SPARQL 1.1 unfortunately does not provide standard means to specify a large number of graphs as a dataset without enumerating them individually, which causes problems in such cases. Stardog provides special IRIs, like :context:named or :context:virtual, which work like wildcards. A more general pattern-based mechanism would be useful.

Restricting BGPs to match within graphs typically does not cause issues with this use case because products tend to be independent and not directly linked. They of course will be linked through other objects, like orders or customers, and it’s not difficult to make queries work under the GRAPH semantics:

    SELECT * FROM NAMED ... {
      GRAPH ?product-graph {
        ?product  a :Product ;
                  :part-of ?order .
      }
      GRAPH :orders {
        ?order    :customer ?customer .
      }
      GRAPH :customers {
        ?customer :name ?name
      }
    }

Using different graph terms is sufficient to match across graphs.

Note: it’s not uncommon to use the same IRI for the node representing a business entity and the named graph where the triples for that entity are stored.

To sum up

FROM and FROM NAMED work a little differently in SPARQL but in most cases it’s sufficient to remember that the former tends to be used when query patterns are outside of GRAPH blocks and the latter when they are inside. This simple rule alone can help you avoid the frustration of seeing no results for seemingly well-written queries. The key to a deeper understanding of the differences is the concept of active graph for a query pattern and how it may, or may not, change within the query execution.

This post only talks about pure RDF 1.1 named graphs but the discussed concepts, like the query dataset or the active graph, are relevant in case your database (like Stardog) is capable of data virtualization. You can learn how the query dataset can encompass virtual graphs and Stardog’s Virtual Transparency feature from our other posts and tutorials.

Keep Reading:

EKS Volume Snapshots

As discussed in a previous post Stardog Cloud relies on VolumeSnapshots in Kubernetes (k8s) for backups of user data. In this post we will go into more technical details of how to work with VolumeSnapshots in the Elastic Kubernetes Service (EKS). Kubernetes Components Here we will presents the k8s components that are used when working with VolumeSnapshots. We do not go into exhaustive details here but rather briefly give an overview to ease in understanding the concepts in this post.

Loading a million triples per second on commodity hardware

At Stardog we are continuously pushing the boundaries of performance and scalability. Last month’s 7.5.0 release brought 500% improvement to transactional write performance. This month’s 7.6.0 release improves writing data at database creation time by almost 100%, yielding a million triples per second loading speed using a commodity server. In this post we’ll talk about the details of loading performance. Let’s do the numbers The fastest way to load large amounts of data into Stardog is to do at database creation time.

Try Stardog Free

Stardog is available for free for your academic and research projects! Get started today.

Download now