Avoidably Slow Queries

By , · 4 minute read

This is an appendix to 7 Steps to Fast SPARQL Queries.

Avoidably Slow Queries

Worst case complexity is often depressing. Average case complexity, if you are fortunate, may be much happier. But now let’s consider unintentional complexity, which is to say, queries that are slow but which can be made fast.

Unselective Queries

First, let’s consider unselective queries, that is, those which return large result sets and are often user errors. In the worst case an unselective query is indistinguishable from a bulk data transfer over a highly inefficient mechanism.

Here’s a simplified version of one such query that we saw recently:

CONSTRUCT {
  ?bus a :Bus .
  ?tram a :Tram .
}
WHERE {
  ?bus rdfs:label ?busLabel.
  FILTER (CONTAINS(LCASE(?busLabel), "bus"))

  ?tram rdfs:label ?tramLabel.
  FILTER (CONTAINS(LCASE(?tramLabel), "tram"))
}

With this query the user wants to identify certain objects by their labels and create proper type triples (e.g. for buses and trams). This is a common pattern: pick some feature as the basis of an explicit type assertion—this node is part of the category or class of nodes called “foo” because it has feature “bar”).

Courtesy of Tony Webster

Courtesy of Tony Webster

The problem is that this query will cause Stardog to compute and return all pairs of buses and trams instead of a simple union of buses and trams. The former explodes and there’s nothing that the optimizer can do about it. It’s rarely what the user intends. This is visible from the query plan with the top-level loop join:

Projection(?bus AS ?subject, ?ltnuylet AS ?predicate, ?immqbtmb AS ?object;
+─         ?tram AS ?subject, ?ltnuylet AS ?predicate, ?phfyiimx AS ?object)
`─ Bind(<urn:Tram> AS ?phfyiimx) (<urn:Bus> AS ?immqbtmb) (rdf:type AS ?ltnuylet)
  `─ LoopJoin(_)
     +─ Filter(CONTAINS(LCASE(?tram), "tram"))
     │  `─ Scan[PSOC](?tram, <http://www.w3.org/2000/01/rdf-schema#label>, ?busLabel)
     `─ Filter(CONTAINS(LCASE(?bus), "bus"))
        `─ Scan[PSOC](?bus, <http://www.w3.org/2000/01/rdf-schema#label>, ?tramLabel)

The better way to write the intended query is like this:

CONSTRUCT {
  ?bus a :Bus .
  ?tram a :Tram .
}
WHERE {
  { ?bus rdfs:label ?busLabel.
    FILTER (CONTAINS(LCASE(?bus), "bus"))
  }
  UNION
  {
    ?tram rdfs:label ?tramLabel.
    FILTER (CONTAINS(LCASE(?tram), "tram"))
  }
}

Even experienced users can occasionally overlook such mistakes. The first thing to consider when diagnosing a slow query is to double check that the query is going to return the results you want it to return (and, ideally, nothing else).

SPARQL’s DARK CORNERS

Some parts of the SPARQL standard are less than ideal. Slow query behavior may be do to a lack of clarity in the specification, in user’s understanding, or both. For example, the semantics of literal equality are such that efficient evaluation of certain types of filters isn’t possible. Here’s a simplified query from the SP2B benchmark:

SELECT DISTINCT ?person ?name
WHERE {
  ?article rdf:type bench:Article .
  ?article dc:creator ?person .
  ?inproc rdf:type bench:Inproceedings .
  ?inproc dc:creator ?person2 .
  FILTER (?person=?person2)
}

This may look like an innocuous join on ?person=?person2 but it can return a superset of results of this query:

SELECT DISTINCT ?person ?name
WHERE {
  ?article rdf:type bench:Article .
  ?article dc:creator ?person .
  ?inproc rdf:type bench:Inproceedings .
  ?inproc dc:creator ?person .
}

And it’s harder to evaluate since even literals of different datatypes, e.g. "10.0"^^xsd:float and "10.0"^^xsd:decimal, can be equal.

This is called type promotion. Maintaining special literal indexes to represent the equality classes for data literals is an unnecessary overhead. Such queries are rare.

Courtesy of Sigfrid Lundberg

Courtesy of Sigfrid Lundberg

In the absence of such indexes, however, the filter has to be applied to all pairs of ?person and ?person2. This is again visible from the query plan with a loop join. Stardog 5 provides a query hint assume.iri which tells the optimizer that the given variable, e.g. ?person will be bound to IRIs only so the optimizer can remove the filter and evaluate a join instead.

Hard Analytic Queries

The more interesting problems occur with naturally hard queries which process a lot of data, which is typical for analytic queries. Those often do not include a selective constant—as opposed to look-up queries (e.g. “give me names of all of John’s friends”)—but rather perform complex computations over large portions of the graph.

This query asks for the top 10 most discussed product categories of products from a specific country based on number of reviews by reviewers from another country.

select ?productType ?reviewCount
{
 { select ?productType (count(?review) As ?reviewCount)
  {
   ?productType a bsbm:ProductType .
   ?product a ?productType .
   ?product bsbm:producer ?producer .
   ?producer bsbm:country country:china .
   ?review bsbm:reviewFor ?product .
   ?review rev:reviewer ?reviewer .
   ?reviewer bsbm:country country:us .
  }
  group by ?productType
 }
}
order by desc(?reviewCount) ?productType
limit 10

Neither country may be a selective restriction so the query engine may churn through a lot of data to find the required product categories. It is this sort of query for which the optimal plan—specifically, the order of joins—is critical. When performance is unsatisfactory, it may be possible to reformulate the query such that it’s easier for the optimizer to find the best plan. Which is exactly what 7 Steps to Fast SPARQL Queries is about.

Download Stardog today to start your free 30-day evaluation.


Top