Mapping Denormalized Data

By , · 7 minute read

Since we’re adding support for MongoDB in Stardog, it’s time to dive into the details of mapping denormalized data.

Distributed databases give us the ability to scale storage far beyond a single, logical machine. Distribution of data via partitioning allows multiple machines to work together to store arbitrarily large data sets.

As data is distributed, the cost of traditional joins becomes higher, making the normalized data model problematic. A denormalized model that co-locates joined data can be an attractive alternative.

While this obviates the need to perform traditional joins, it creates interesting challenges for virtualization. Let’s examine the challenges and explore solutions.

Denormalized Data

To illustrate, consider a partitioned database instance that is being mapped into the knowledge graph. We’ll assume a design with 5 servers where movies with titles beginning with A-L are on server 0, titles beginning with M-Z on server 1, actors on server 2, roles for actors with even IDs on server 3, and odd IDs on server 4.

For this example, the normalized tables look like this. Note the extra column that indicates the server # is informative and isn’t queried like a normal column.

movie table

(server) movie_id title
0 0 Forrest Gump
1 1 Cast Away
1 2 Unbreakable

actor table

(server) actor_id name
2 0 Tom Hanks
2 1 Robin Wright

role table

(server) movie_id actor_id
3 0 0
4 0 1
3 1 0
4 2 1

A query that returns the names of actors that played a role in Forrest Gump looks like this:

SELECT actor.name
FROM movie
INNER JOIN role ON movie.movie_id = role.movie_id
INNER JOIN actor ON role.actor_id = actor.actor_id
WHERE title = "Forrest Gump";
name
“Tom Hanks”
“Robin Wright”

This query requires information from 4 of the 5 database servers (0, 2, 3, 4). Contrast this with a denormalized design using 1 table that includes movies, roles and actors. We’ll use 5 servers again, this time allocated by movie title. Titles A-B on server 0, C-E on server 1, F-J on server 2, K-M on server 3, and N-Z on server 4.

denormalized

(server) movie_id title actor_id name
2 0 Forrest Gump 0 Tom Hanks
2 0 Forrest Gump 1 Robin Wright
1 1 Cast Away 0 Tom Hanks
4 2 Unbreakable 1 Robin Wright Penn

NOTE: When data is denormalized like this, it becomes possible for the same actor to have a different name for each role, as was the case for Robin Wright Penn in Unbreakable. This could be intentional in cases where we wish to represent the actor’s name at the time they had the role, or it could be a data consistency error that resulted from an update problem.

The SQL query for the names of actors in Forrest Gump becomes:

SELECT name
FROM denormalized
WHERE title = "Forrest Gump";
name
“Tom Hanks”
“Robin Wright”

All the data needed to answer this query comes from server 2.

Virtualization

Let’s look at how to map these tables to a virtual graph. The goal will be to create a graph like this:

This figure includes both names for actor:1. As mentioned, this mapping is only possible with the denormalized table. Later we’ll consider an alternative mapping where each node has a single name property.

The mappings, in SMS syntax, for these tables are:

Normalized Mappings

@prefix : <http://example.com/> .
@prefix sm: <tag:stardog:api:mapping:> .

:Movie-{movie_id} :title "{title}" ;
    sm:map [
        sm:table "movie"
    ] .

:Actor-{actor_id} :name "{name}" ;
    sm:map [
        sm:table "actor"
    ] .
   
:Movie-{movie_id} :actor :Actor-{actor_id} ;
    sm:map [
        sm:table "role"
    ] .

Denormalized Mappings

@prefix : <http://example.com/> .
@prefix sm: <tag:stardog:api:mapping:> .

:Movie-{movie_id} :title "{title}" ;
    :actor :Actor-{actor_id} ;
    sm:map [
        sm:table "denormalized"
    ] .

:Actor-{actor_id} :name "{name}" ;
    sm:map [
        sm:table "denormalized"
    ] .

The SPARQL for our Forrest Gump actors query is:

Normalized

select ?name {
    graph <virtual://normalized> {
        ?movie :title "Forrest Gump" ;
            :actor ?actor .
        ?actor :name ?name .
    }
}
name
“Tom Hanks”
“Robin Wright”

Denormalized

select distinct ?name {
    graph <virtual://denormalized> {
        ?movie :title "Forrest Gump" ;
            :actor ?actor .
        ?actor :name ?name .
    }
}
name
“Tom Hanks”
“Robin Wright”
“Robin Wright Penn”

You might not expect the last record from the denormalized query given there is no row in the source table with both “Forrest Gump” and “Robin Wright Penn” when the query seems to explicitly constrain the results to those actors from “Forrest Gump”. Why does this happen?

Let’s break down the query. It consists of three triple patterns. Each triple pattern matches a set of solutions:

Triple Pattern #1

?movie :title "Forrest Gump"
movie
:Movie-0

Triple Pattern #2

?movie :actor ?actor
movie actor
:Movie-0 Actor-0
:Movie-0 Actor-1
:Movie-1 Actor-0
:Movie-2 Actor-1

Triple Pattern #3

?actor :name ?name
actor name
:Actor-0 “Tom Hanks”
:Actor-1 “Robin Wright”
:Actor-1 “Robin Wright Penn”

Each of these solution sets are joined together. The order of the joining doesn’t matter. Let’s start by joining the first and second solution sets by their common binding, which is ?movie. The result is:

movie actor
:Movie-0 Actor-0
:Movie-0 Actor-1

The third and fourth solutions from the second triple pattern {(movie=:Movie-1, actor=Actor-0), (movie=:Movie-2, actor=Actor-1)} are dropped because there are no solutions from the first pattern that include movies :Movie-1 or :Movie-2. The third set of solutions is joined with the first two by their common binding, actor. The final result is:

movie actor name
:Movie-0 :Actor-0 “Tom Hanks”
:Movie-0 :Actor-1 “Robin Wright”
:Movie-0 :Actor-1 “Robin Wright Penn”

As a side note, this last last examples illustrates what logically happens. In practice database engines employ an optimizer that finds more efficient ways to build these solutions.

Scoping Queries to the Data of Interest

To avoid returning every name associated with an actor when working with denormalized data, let’s change our query from asking “What are the names of the actors that acted in Forrest Gump?” to asking “What were the names of the actors when they acted in Forrest Gump?“. To express this in SPARQL, we need a new node that represents acting in a movie. This node is analogous to the role table in the normalized model, so we’ll create a :Role node for that. The new graph becomes:

By hanging the actor name off the role node, we permit actors to have different names, yet we can formulate queries that scope the name to a specific role. Also note that we retain the Actor node, albeit without properties, so we can write Kevin Bacon-style queries that link movies by their common actors.

The SMS for this scoped mapping looks like this:

Scoped

@prefix : <http://example.com/> .
@prefix sm: <tag:stardog:api:mapping:> .

:Movie-{movie_id} :title "{title}" ;
    :role :Role-{movie_id}-{actor_id} ;
    sm:map [
        sm:table "denormalized"
    ] .

:Role-{movie_id}-{actor_id} :name "{name}" ;
    :actor :Actor-{actor_id} ;
    sm:map [
        sm:table "denormalized"
    ] .

The SPARQL for our Forrest Gump actors query becomes:

select distinct ?name {
    graph <virtual://scoped> {
        ?movie :title "Forrest Gump" ;
            :role ?role .
        ?role :name ?name .
    }
}
name
“Tom Hanks”
“Robin Wright”

If, for some reason, you wanted to know all the names of the actors from Forrest Gump, you can still achieve the results from the denormalized mappings by changing the query:

select distinct ?name {
    graph <virtual://scoped> {
        ?movie :title "Forrest Gump" ;
            :role ?role .
        ?role :actor ?actor .
        ?allRoles :actor ?actor ;
            :name ?name .
    }
}
name
“Tom Hanks”
“Robin Wright”
“Robin Wright Penn”

Conclusion

In this post we showed how the choice of mappings can affect query performance. Generally the main consideration in data modeling is how you wish to query your data. This example illustrates how a design decision in the source system (to denormalize data) can influence the choice of data model in the Knowledge Graph as well.

There’s another option for dealing with denormalized data that will be the subject of a future blog post, but I’ll briefly mention it here. Suppose your organization wishes to identify a canonical name for each actor. If this canonical name is not represented in any of the enterprise data stores, you can choose to store this name natively in Stardog. There are important data stewardship considerations with such a decision.


Top