Stardog is the world’s leading Knowledge Graph platform for the Enterprise Stardog makes it fast and easy to turn enterprise data into knowledge.
Check out the Quick Start Guide to get Stardog installed and running in five easy steps.
Introduction
Stardog 7.4.5 (16 Dec 2020) supports the RDF graph data model; SPARQL query language; RDF*/SPARQL* extensions for Edge Properties; OWL 2 and user-defined rules for inference and data analytics; virtual graphs; geospatial query answering; and programmatic interaction via several languages and network interfaces.
Stardog is made by hand with skill, taste, and a point of view by people who care. 🌟🐶
To learn more about where we’ve been and where we’re headed, consult the Stardog Release Notes and the Stardog Studio Release Notes.
Downloading
Visit our website for more information on how to get started with Stardog. There is also information in the Getting Stardog section of the docs.
Requesting Support
Please use the appropriate channel to request support - customers should file a support ticket
or use another dedicated support channel, others should use Stardog Community
When reporting an issue, please include the following information:
-
A complete description of the problem you are having
-
The zip file created when running
stardog-admin diagnostics report
. If your question is related to a particular database, please include the.metadata
file for that database as well.-
If you are unable to run the this command, please include the
stardog.log
file.
-
-
Other information as you are able and seems relevant:
-
Approximately when you hit this issue (so we can reference in the logs)
-
Your Stardog version
-
Your operating system and any other system info
-
Which JVM you are using
-
Enterprise Support
Real-time Support
Get access to the core Stardog development team in real-time via voice or chat. Let us help you get the most from Stardog, 24/7. Our core team has more semantic graph application and tool development experience than any other team on the planet. Other vendors shunt you off to inexperienced level one techs. We’ve never done that and never will.
Private Maven Repositories
See Using Maven for details; this includes a per-customer, private Maven repository, CDN-powered, for 24/7 builds, updates, and feature releases.
We’re also tying Maven and Docker together, providing private Docker repositories for customers, which allows us to build out clusters, custom configurations, best practices, and devops tips-and-tricks into custom Docker images…so that you don’t have to.
Private Docker Repositories
Docker-based deliverables not only shortens your development and devops cycles but they also help create seamless integration to your Kubernetes deployments. With Enterprise Support you can both get the latest-greatest versions of Stardog, including security fixes, and performance hot fixes as well as pin version numbers to those tested in your production deployments.
Previous versions of Stardog Docker images are tagged by version and available on Artifactory. To access those images you must use your credentials and first log in to Artifactory, after which you can pull any available image:
$ docker login -u <username> stardog-eps-docker.jfrog.io
$ docker pull stardog-eps-docker.jfrog.io/stardog:<version>
Priority Bug Fixes
With Maven and Docker in place, we’ve got a software delivery mechanism ready to push priority bug fixes into your enterprise as soon as they’re ready. We’ve averaged one Stardog release every two weeks since 2012. Enterprise Premium Support customers can now take advantage of our development pace in a controlled fashion.
Priority Feature Releases
We hate holding new features in a feature branch, especially for mundane business reasons; we want to release new stuff as soon as possible to our customers. With Enterprise Premium Support, we can maintain a disruptive pace of innovation without disrupting you.
Quick Start Guide
Requirements
Stardog 7.3+ is tested on Java versions 8 and 11, and requires sun.misc.Unsafe
.
Note that Stardog does not run on any other versions of Java.
To check your version of Java, run java -version
from the command line.
Java 8 and 11 can be downloaded from
Oracle, which requires
creating an account. Alternatively you can use a version from OpenJDK.
Stardog is verified to run on Ubuntu 16.04 and 18.04, RHEL 7 and CentOS 7, Amazon Linux 2,
recent versions of OSX, and Microsoft Windows Server 2019.
Insecurity
We optimize Stardog out-of-the-box for ease and simplicity. You must take additional steps to secure it before production deployment - see the Security section for more detail.
Stardog ships with an insecure but usable default setting: the super user is admin
and the admin
password is "admin".
Getting Stardog
Stardog is available via Wget, Package Managers, Homebrew, and Docker. Once you have Stardog available, continue to Starting Stardog.
Upgrading to Stardog 7
If you are upgrading to Stardog 7 from any previous version, please see Migrating to Stardog 7 for details. Stardog 7 uses a completely new disk index format and all databases created with a previous version of Stardog must be migrated.
Wget
To download via Wget, use the following commands:
wget https://downloads.stardog.com/stardog/stardog-latest.zip
unzip stardog-latest.zip
# Stardog binaries are now located at ./stardog-<version>/bin
Homebrew
To download via Homebrew, use the following command:
brew install stardog-union/tap/stardog
Package Managers
If using a package manager, download via the Debian instructions or the RPM instructions and then be sure to follow the Package Layout configuration instructions.
Debian Based Systems
To install Stardog using apt-get run the following commands:
curl http://packages.stardog.com/stardog.gpg.pub | apt-key add
echo "deb http://packages.stardog.com/deb/ stable main" >> /etc/apt/sources.list
apt-get update
apt-get install -y stardog[=<version>]
This will first add the Stardog gpg key to the systems and then fetch and install the latest Stardog deb package.
RPM Based Systems
To install Stardog using yum run the following commands:
curl http://packages.stardog.com/rpms/stardog.repo > /etc/yum.repos.d/stardog.repo
yum install -y stardog[-<version>]
Amazon EC2
Certain Amazon EC2 instances do not let you redirect output into /etc/yum.repos.d as specified above. On such instances you can install Stardog like so:
sudo yum-config-manager --add-repo http://packages.stardog.com/rpms/stardog.repo
sudo yum-config-manager --enable stardog
yum install -y stardog[-<version>]
Package Layout
The packages require that OpenJDK 8 and all of its dependencies are installed on the system. The package managers will install them if they are not already there. Stardog is then configured to start on boot via systemd and thus it can be controlled by the systemctl tool as shown below:
systemctl start stardog
systemctl restart stardog
systemctl stop stardog
To customize the environment in which stardog is run the file /etc/stardog.env.sh
can be altered with key value pairs, for example:
export STARDOG_HOME=/var/opt/stardog
export STARDOG_SERVER_JAVA_ARGS="-Xmx8g -Xms8g -XX:MaxDirectMemorySize=2g"
Note
|
If your system does not control services with systemd you can still install
Stardog with these packages, however you must configure and run it in some other
way. Altering the file /etc/stardog.env.sh will have no effect.
|
Docker
The latest release of Stardog is available on Docker Hub.
You can pull the image from Docker Hub with:
$ docker pull stardog/stardog:latest
Stardog home is located in /var/opt/stardog/
in the Docker image.
Because stardog-admin server start
is the entry point for the image,
you must instruct Docker to mount a home directory with a valid license
from your host machine at /var/opt/stardog
in the image. For example:
$ docker run -it -v ~/stardog-home/:/var/opt/stardog -p 5820:5820 stardog/stardog
In this example, ~/stardog-home/
is a Stardog home directory that only contains
a Stardog license file. /var/opt/stardog
is the location of Stardog home in
the Docker image.
The contents of the release zip (binaries, docs, helm charts) are located
in /opt/stardog/
.
We also use the -p
flag to map port 5820 on the container (the default Stardog port)
to port 5820 on localhost for easy communication with the server.
You can change the default JVM memory settings for Stardog by setting the
STARDOG_SERVER_JAVA_ARGS
environment variable:
$ docker run -v ~/stardog-home/:/var/opt/stardog -p 5820:5820 -e STARDOG_SERVER_JAVA_ARGS="-Xmx8g -Xms8g -XX:MaxDirectMemorySize=2g" stardog/stardog
Windows Beta
As of version 7.3.0 Stardog 7 supports Windows as a beta release. To install on windows do the following:
-
Download the zip file.
-
Unzip the file and open a command prompt.
-
Choose a location for
STARDOG_HOME
and set the environment variable> set STARDOG_HOME=C:\Path\To\StardogHome
-
In the unzipped Stardog distribution go to the subdirectory
\bin\
and run the following:> .\install-service.bat
At this point you can control Stardog via the Windows Services App.
Starting Stardog
Basic setup
Stardog Home
The most important piece of configuration to do before you start Stardog is setting the STARDOG_HOME
environment variable.
This is the directory where all the Stardog databases and other files will be stored. If STARDOG_HOME
is not defined, Stardog will use the Java user.dir
property value.
We recommend adding it to your ~./bash_profile
, or if you are using a package manager, /etc/stardog.env.sh
.
You should not set STARDOG_HOME
to be the same as the directory where you put the Stardog binary.
Our convention is to put Stardog in /opt/stardog/{$version}
and set STARDOG_HOME
to /var/stardog
.
If you are setting up Stardog for production or other serious usage, see Upgrading Stardog Server for additional guidance.
License Key
If you do not have a license key
You will be able to retrieve a trial license-key via the command line once you start Stardog.
If you have a license key
Add it to your STARDOG_HOME
. Ensure that the stardog-license-key.bin
file is readable by the Stardog process.
$ cp stardog-license-key.bin $STARDOG_HOME
You can specify a different location for the license file by setting
STARDOG_LICENSE_PATH
.
Setting Path
Place the bin
folder of the Stardog install on your PATH so
the stardog
and stardog-admin
scripts can be used regardless of current working directory.
We recommend adding to your ~./bash_profile
(or if you are using a package manager, /etc/stardog.env.sh
),
though you can also set temporarily.
# This assumes you've followed our convention above. If you put stardog somewhere else, update accordingly.
$ export PATH="$PATH:/opt/stardog-<version>/bin"
Running Stardog
The commands below assume you’ve followed the instructions in Setting Path.
If you haven’t, make sure you are using the full stardog path for commands,
i.e. <stardog-location>/bin/stardog-admin
.
-
If you’re not using a package manager, run the following command to start the Stardog server. By default the server will HTTP on port 5820. If you do not have a license, it will begin a workflow to help you get one
$ stardog-admin server start
-
Create a database with some input data. If you don’t have any input data, you can download some sample music data to use.
$ stardog-admin db create -n myDB /path/to/some/data.ttl
-
Query the database
$ stardog query myDB "SELECT DISTINCT ?s WHERE { ?s ?p ?o } LIMIT 10"
-
Download Stardog Studio and run the same query. You can connect to Studio using the http://localhost:5820 endpoint. If you have not secured your setup, you can login as username
admin
and passwordadmin
. Otherwise, use one of your configured accounts.
Once you’ve done all four things, Stardog is up and running! If you’re not sure where to go from here, check out our Getting Started Guide.
Stardog Development Tools
Stardog Studio
Overview
Stardog Studio is Stardog’s IDE and administration tool designed to make Stardog
functionality easier to use for everyday users. Aside from administering Stardog clusters,
almost all functionality that exists through the CLI and other endpoints is available in Studio,
and most non-admin users are able to interact with Stardog using Studio
and without going to the command line.
Stardog Studio is available as a Desktop App
as well as in the browser.
Studio’s UI is split into the sections below. Note that this is not a comprehensive
list of functionality but a general summary of functionality in each section.
-
Provenance:
-
Get a high-level overview of your knowledge graph
-
Visualize connections between data sources (including virtualized sources) and entities in your data
-
-
Data Exploration:
-
Search and browse classes and properties in your databases
-
Explore connections in your data in a browser-like interface (with navigation history)
-
Visualize both inbound and outbound relationships
-
-
Workspace:
-
Models:
-
Visualize your schema
-
Create and edit OWL and RDFS defined schemas in a form-based experience
-
Write, edit, and validate constraints
-
-
Virtual Graphs
-
Create and configure virtual graphs and virtual graph mappings
-
-
Databases
-
Manage databases
-
Update database properties and namespaces
-
See and kill running queries
-
Load and remove data
-
Visualize the schema
-
-
Security
-
Create users and roles
-
Assign permissions to users and roles
-
-
Tutorials
-
Interactive tutorials to help you use Stardog
-
For questions, comments, or feature requests, please post in the Studio section of the Stardog Community.
The Stardog blog includes post about new Studio functionality. All of those posts are available under the Stardog Studio tag. Blog posts cover topics like SHACL support, Stored Queries, Query Plans, and Visualization.
Stardog Studio in the browser
Stardog Studio is available in the browser at http://stardog.studio. It is supported for the
latest versions of Firefox and Chrome (v69 and v77, respectively, as of this writing).
In-browser functionality is almost identical to the Desktop version aside from a few differences:
-
Keyboard shortcuts that exist in the browser (e.g.
cmd+o
) will act on the browser. Those that don’t (e.g.cmd+E
) will work as they do in the desktop version of Studio. -
Saving or running to a file will only prompt you to choose a location for the file if your browser preferences for downloaded files are set to require a prompt; otherwise, the file will automatically be written to your browser’s downloaded files location
-
You cannot save workspace tabs to your filesystem - note that
cmd+s
will be captured by the browser to save the browser tab. You can drag a file into Studio to load it but you cannot save it back to that original file. -
Because the application menu is for the entire browser, there is no equivalent to Studio-specific menus like
File
that are in the Desktop version. Most of the operations are available via keyboard shortcuts or in the UI. -
Since certain keyboard shortcuts (e.g., Cmd/Ctrl+, to open preferences) are reserved by modern browsers, the shortcuts in the browser version of Studio typically differ from those in the desktop version with respect to the 'modifier' key (
Ctrl
,Alt
,Cmd
). For example, on Windows,Ctrl + ,
in the desktop version of Studio becomesAlt + ,
in the browser version, and on Mac,Cmd + ,
in the desktop version of Studio becomesCtrl + ,
in the browser version.
No data is sent to our servers when you use the browser version of Studio. Requests are instead sent directly from your browser to the Stardog endpoint that you specify in the connection dialog, without any intermediary. (To illustrate the point: after loading http://stardog.studio, you could theoretically disconnect from the internet and still use Studio in the browser to interact with your Stardog endpoint. Aside from some telemetry data, nothing is sent from your browser to any server other than your Stardog endpoint while using Studio in the browser.) If your Stardog server is running with SSL enabled, you can use https://stardog.studio so that all information between your server and Studio is also encrypted.
Using Stardog Studio in the browser via Docker
The browser version of Stardog Studio is also available in a pre-configured Docker image via
DockerHub. Before you get started with the dockerized version of Studio in the browser,
you should get Docker (if you don’t already have it) and select a port on your
local machine for Studio to be available on in your browser (the steps below use port number 8888;
make sure to substitute whatever number you’re going to use, if it’s a different one).
To get the latest version of in-browser Studio in Docker, perform the following steps:
-
Open a command line terminal.
-
In the terminal, enter
docker pull stardog/stardog-studio:current
. -
Once the command in step 2 completes, enter
docker run --name=stardog-studio -p 8888:8080 -d stardog/stardog-studio:current
The first number (before the:
) in the-p 8888:8080
argument should be the port number you chose before starting; it is the number you’ll use to access Studio in the next step. The--name=stardog-studio
argument names the container as “stardog-studio” so that you can easily reference it later; you could choose another name here, if you’d like. -
When the command in step 3 completes successfully, you should see a long string ID printed out in the terminal (this is the ID of the running Docker container; you can ignore it for present purposes). You can now access Studio in your browser by going to http://localhost:8888 (again, substituting whatever port number you chose).
At this point, you can stop and start the container whenever you need it, running
docker stop stardog-studio
and docker start stardog-studio
, respectively
(using whatever name you provided in step 3, above). The Docker Daemon and the Studio container
must be running to access in-browser Studio this way. If Studio is not accessible, please remember to start
Docker, as it may not start automatically on startup.
To upgrade the dockerized version of Stardog Studio, simply open a terminal and run
docker stop stardog-studio && docker rm stardog-studio
(again, using whatever name
you provided in step 3, above). Then, repeat steps 2 through 4, above.
Note that all in-browser versions of Studio (both at http://stardog.studio and in Docker) store user data (e.g. saved connections, query history, open workspace tabs) in your browser’s localStorage, so you must use the same browser to access your persisted user data.
Customizing Stardog Studio
Stardog Studio can be customized to have a dark or light theme. By default, Stardog Studio has a dark theme. To change between themes, open Studio’s preferences, update the option for ”theme” to your preference (”dark” or ”light”), and save.
Language extensions
In case you’d like to have Stardog Studio’s language intelligence available in other IDEs, we’ve also made Studio’s language servers freely available as Visual Studio Code Extensions and as unpackaged JavaScript modules. Instructions for installing them are available on github.
Logs
Studio has its own set of logs. They are available at different paths depending on the OS. If you are reaching out for support, please include the log file.
-
MacOS:
~/Library/Logs/
Stardog\ Studio
/log.log -
Windows:
%USERPROFILE%\AppData\Roaming``"Stardog Studio"`
\log.log` -
Linux:
~/.config/
Stardog\ Studio
/log.log -
Browser: Log messages are written to the browser’s console. If reaching out for support, please share recent browser console messages.
Collecting usage data
To improve Stardog Studio, we collect anonymous usage data.
We only collect information like session duration, feature usage,
and the size of queries and results. We never collect the actual
content of queries or results. To opt-out, set telemetryConsent to "false" in your Preferences.
You can access your preferences under the Stardog Studio
application menu. If you are using the
in-browser version, use the following keyboard shorcuts:
-
Mac:
Ctrl + ,
-
Windows/Linux:
Meta + ,
. The meta key is likelyalt
orcmd
.
Querying Stardog
Executing Queries
To execute a SPARQL query against a Stardog database with the CLI, use the query
subcommand with a query string, a query file, or the name of a
stored query:
$ stardog query myDb "select * where { ?s ?p ?o }"
Any SPARQL query type (SELECT
, CONSTRUCT
, DESCRIBE
, PATHS
, ASK
or any update
query type) can be executed using this command.
Reasoning can be enabled by using the --reasoning
flag (or -r
for short):
$ stardog query --reasoning myDb "select * where { ?sub rdfs:subClassOf ?super }"
By default, all Stardog CLI commands assume the server is running on the same machine as the client using port 5820. But you can interact with a server running on another machine using a full connection string:
$ stardog query http://myHost:9090/myDb "select * where { ?s ?p ?o }"
Detailed information on using the query command in Stardog can be found
in the man page
. See Managing Stored Queries
section for configuration, usage, and details of stored queries.
Path Queries
Stardog extends SPARQL for path queries which can be used to find paths between two nodes in a graph. Path queries are similar to SPARQL property paths that recursively traverse a graph and find two nodes connected via a complex path of edges. But SPARQL property paths only return the start and end nodes of a path. Stardog path queries return all the intermediate nodes on the path and allow arbitrary SPARQL patterns to be used in the query.
Here’s a simple path query to find how Alice
and Charlie
are connected to each
other:
$ stardog query exampleDB "PATHS START ?x = :Alice END ?y = :Charlie VIA ?p"
+----------+------------+----------+
| x | p | y |
+----------+------------+----------+
| :Alice | :knows | :Bob |
| :Bob | :worksWith | :Charlie |
| | | |
| :Alice | :worksWith | :Carol |
| :Carol | :knows | :Charlie |
+----------+------------+----------+
Query returned 2 paths in 00:00:00.056
Each row of the result table shows one edge. Adjacent edges are printed on subsequent rows of the table. Multiple paths in the results are separated by an empty row.
Path queries by default return only the shortest paths. See the Path Queries chapter for details about finding different kinds of paths, e.g. all paths (not just shortest ones), paths between all nodes, and cyclic paths.
DESCRIBE Queries
SPARQL provides a DESCRIBE
query type that returns a subgraph containing
information about a resource:
DESCRIBE <theResource>
SPARQL’s DESCRIBE
keyword is deliberately underspecified. In Stardog, by default,
a DESCRIBE
query retrieves all the
triples for which <theResource>
is the subject. There are, of course, about
seventeen thousand other ways to implement DESCRIBE
. Starting with Stardog 5.3,
we are providing two additional describe strategies out of the box. The desired
describe strategy can be selected by using a special query hint.
For example, the following query will return all the triples where theResource
is either the subject or the object:
#pragma describe.strategy bidirectional
DESCRIBE <theResource>
The other built-in describe strategy returns the CBD - Concise Bounded Description of the given resource:
#pragma describe.strategy cbd
DESCRIBE <theResource>
The default describe strategy can be changed by setting the query.describe.strategy
database configuration option. Finally, it is also possible to
implement a custom describe strategy by implementing a simple Java interface. An example
can be found in the
stardog examples repo.
Federated Queries
Stardog supports the SERVICE keyword which allows users to query distributed RDF via SPARQL-compliant data sources. You can use this to federate queries between several Stardog databases or Stardog and other public endpoints.
You can also use service variables in your queries to dynamically select the endpoints for federated queries, for example:
{
?service a :MyService .
SERVICE ?service { ... }
}
Stardog ships with a default Service
implementation which uses
SPARQL Protocol to send the service fragment to the remote endpoint
and retrieve the results. Any endpoint that conforms to the SPARQL
protocol can be used.
The Stardog SPARQL endpoint is http://<server>:<port>/{db}/query
.
HTTP Authentication
Stardog requires authentication. If the endpoint you’re referencing
with the SERVICE
keyword requires HTTP authentication, credentials
are stored in a password file called
services.sdpass
located in STARDOG_HOME
directory. The default
Service
implementation assumes HTTP BASIC authentication; for
services that use DIGEST auth, or a different authentication mechanism
altogether, you’ll need to implement a custom Service
implementation.
Querying Local Databases
Stardog contains a specialized service implementation that lets users query other databases stored
in the server without going through HTTP. The user executing the query will be still be authenticated,
just via Stardog authentication. In other words, the user executing the query must have proper permissions
to read from the database they are attempting to query. The URI to follow the SERVICE
keyword must
begin with db://
followed by the database name. Here’s an example querying a database named "books".
SELECT * { SERVICE <db://books> { ?s ?p ?o } }
Namespaces
Stardog allows users to store and manage custom namespace prefix bindings for each database. These stored namespaces allow users to omit prefix declarations in Turtle files and SPARQL queries. Namespace Prefix Bindings section describes how to manage these namespace prefixes in detail.
Stored namespaces allow one to use Stardog without declaring a single namespace
prefix. Stardog will use its default namespace (http://api.stardog.com/
)
behind the scenes so that everything will still be valid RDF, but users won’t
need to deal with namespaces manually. Stardog will act as if there are no
namespaces, which in some cases is exactly what you want!
For example, let’s assume we have some data that does not contain any namespace declarations:
:Alice a :Person ;
:knows :Bob .
We can create a database using this file directly:
$ stardog-admin db create -n mydb data.ttl
We can also add this file to the database after it is created. After the data is loaded, we can then execute SPARQL queries without prefix declarations:
$ stardog query mydb "SELECT * { ?person a :Person }"
+--------+
| person |
+--------+
| :Alice |
+--------+
Query returned 1 results in 00:00:00.111
Note
|
Once we export the data from this database, the default (i.e., in-built) prefix declarations will be printed, but otherwise we will get the same serialization as in the original data file: |
$ stardog data export mydb
@prefix : <http://api.stardog.com/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix stardog: <tag:stardog:api:> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
:Alice a :Person ;
:knows :Bob .
Query Functions
Stardog supports all of the functions from the SPARQL spec, as well as some others from XPath and SWRL. See SPARQL Query Functions for a complete list of built-in functions supported.
Any of the supported functions can be used in queries or rules. Note that,
some functions appear in multiple namespaces, but using any of the namespaces will
work. Namespaces can be omitted when calling functions too.[1]
XPath comparison and
arithmetic operators
on duration, date and time values are supported by overloading the corresponding
SPARQL operators such as =
, >
, +
, -
, etc.
In addition to the built-in functions, new functions can be defined by assigning a new name to a SPARQL expression. These function definitions can either be defined inline in a query or stored in the system and used in any query or rule. Finally, custom function implementations can be implemented in a JVM-compatible language and registered in the system. See the query functions section for more details.
Special Named Graphs
Stardog includes aliases for several commonly used sets of named graphs. These non-standard extensions are provided for convenience and can be used wherever named graph IRIs are expected in a SPARQL query but these graphs are read-only and cannot be updated. Following is a list of special named graph IRIs.
Named Graph IRI |
Refers to |
|
the default (no) context graph |
|
all named graphs, excluding the default graph |
|
all local graphs - the default graph and named graphs |
|
all virtual graphs (applicable when used with Virtual Transparency) |
|
all local graphs. If Virtual Transparency is enabled, all virtual graphs as well. |
Named Graph Aliases
As of version 7.4.5, Stardog enables users to create aliases for named graph IRIs appearing in the data. The aliases can be used in SPARQL queries and provide a layer of abstraction between the queries or applications and the data. In particular queries can run against different graphs — local or virtual — when the alias definitions are changed. Importantly, neither the queries themselves nor the relevant HTTP parameters defining the query dataset need to change. That helps make data changes transparently to consumers (applications). This is best illustrated by an example.
The common data cleansing scenario involves data being imported into a staging graph (call it :staging
),
preprocessed (for example, validated using SHACL, cleaned, augmented, etc.), and then moved to a graph visible to
currently deployed applications (call it :production
). When the data is ready, it needs to be made available to applications.
Prior to 7.4.5 this could have been made in two ways: 1) by moving it from :staging
to :production
via SPARQL Update or
2) by changing all query requests from :production
to :staging
. Both approaches have rather obvious shortcomings.
Named graph aliases rectify the problem by enabling to use to declare :production
as an alias which can be pointed to :staging
as soon as the data is ready. That requires neither data movement nor changes on the query or application level.
To use named graph aliases one must first set the graph.aliases
database property to true
. It can be done at database creation time
or later.
Querying Aliases
Named graphs aliases are IRIs which currently can appear after FROM
and FROM NAMED
keywords in read queries as well as after USING
and USING NAMED
keywords in DELETE/INSERT/WHERE
queries, for example:
select ?person ?name from :graph {
?person foaf:name ?name
}
or
insert { ?person a :Person } using :graph where {
?person foaf:name ?name
}
Assuming :graph
is an alias for :g
, Stardog will replace :graph
by :g
before processing the query. Although this is already
quite powerful, named graph aliases are not restricted to this simple use case and generalize it in two ways.
First, an IRI can be an alias for a set of graphs in the data, not just one graph. Second, special graphs as
well as virtual graphs can be used in the alias definition just as regular graph.
Adding and Updating Aliases
Named graph aliases are defined on per-database basis and the definitions are stored in the data as triples. The schema consists of a single
predicate <tag:stardog:api:graph:alias>
whose domain is the aliases and the range is actual graphs in the data.
Alias definitions must be asserted in the special named graph <tag:stardog:api:graph:aliases>
, as in the following TriG snippet:
<tag:stardog:api:graph:aliases> {
:graph <tag:stardog:api:graph:alias> :g1, :g2 .
}
Stardog Java API provides a convenient mechanism for retrieving and updating graph aliases based on
the com.complexible.stardog.query.GraphAliases
interface available from the database’s Connection
object. Under the hood it simply
fetches and updates data in <tag:stardog:api:graph:aliases>
. Note that every Connection
gets its own
snapshot of aliases which will not be affected by concurrent transactions updating aliases in the database.
Integration with Other Features
Named graph aliases interact with several Stardog features, particularly, the Named Graph Security and
virtual graphs. The latter is pretty straightforward: one can define an alias for
any combination of local and virtual graphs and the FROM
or FROM NAMED
statements for that alias will be replaced by
those with the corresponding local and virtual graph IRIs. That will happen before the query engine starts any VG-specific
processing of the query, like applying mappings, establishing a connection to the remote data source, etc.
As far as <<Named Graph Security> is concerned, aliases behave like regular graphs. It is possible to define read and write
permissions for an alias. If a user is allowed to read the graph :g
, and :g
happens to be an alias
for :g1
union :g2
, a query using :g
in its dataset will be allowed to read :g1
and :g2
. If a query uses
:g1
or :g2
in its dataset directly, the user will need explicit permissions to access those graphs.
Limitations of Aliases
Named graph aliases have the following limitations: they cannot be used in GRAPH
keywords, CONSTRUCT
, INSERT
or UPDATE
templates
or ADD/DROP/CLEAR/COPY/MOVE
queries. Aliases cannot be defined for other aliases.
Some of these restrictions may be lifted in the future.
Obfuscating
When sharing sensitive RDF data with others, you might want to (selectively) obfuscate it so that sensitive bits are not present, but non-sensitive bits remain. For example, this feature can be used to submit Stardog bug reports using sensitive data.
Data obfuscation works much the same way as the export
command and supports
the same set of arguments:
$ stardog data obfuscate myDatabase obfDatabase.ttl
By default, all URIs, bnodes, and string literals in the database will
be obfuscated using the SHA256 message digest algorithm. Non-string
typed literals (numbers, dates, etc.) are left unchanged as well as
URIs from built-in namespaces (RDF
, RDFS
, and OWL
). It’s
possible to customize obfuscation by providing a configuration file.
$ stardog data obfuscate --config obfConfig.ttl myDatabase obfDatabase.ttl
The configuration specifies which URIs and strings will be obfuscated by defining inclusion and exclusion filters. See the example configuration file in the stardog-examples Github repo.
Once the data is obfuscated, queries written against the original data will no longer work. Stardog provides query obfuscation capability, too, so that queries can be executed against the obfuscated data. If a custom configuration file is used to obfuscate the data, then the same configuration should be used for obfuscating the queries as well:
$ stardog query obfuscate --config obfConfig.ttl myDatabase myQuery.sparql > obfQuery.ttl
UNNEST Operator and Arrays
Stardog includes an UNNEST operator as a SPARQL extension. Similar to the BIND operator, UNNEST introduces new variable bindings as a result of evaluating an expression. The key difference is that UNNEST may produce more than one binding for each input solution. This is useful when dealing with arrays.
Arrays can be created with the set
and split
functions. The UNNEST operator allows transforming an array
into a set of solutions. For example, consider the following query:
select ?person ?name {
?person :names ?csvNameString
UNNEST(split(?csvNameString, ",") as ?name)
}
If we match a triple which binds ?person
to <urn:John>
and
?csvNameString
to "John,Johnny"
, the following solutions will be
returned for the query:
?person |
?name |
|
|
|
|
If the array has no elements or evaluation of the source expressions produce an error, the target variable will be unbound.
UNNEST is governed by the same scope principles as BIND. Variables used in the expression must precede the UNNEST operator syntactically. References to the variable which is being assigned must occur syntactically after the UNNEST operator.
Plan queries
Query plan returned from query explain
command can be executed with query
command in the same manner as SPARQL queries.
The plan needs to be in a verbose format, which can be achieved with --verbose
flag:
$ stardog query explain --verbose myDB "SELECT DISTINCT ?s WHERE { ?s ?p ?o } LIMIT 10"
This will produce an output similar to this:
QueryPlan
Slice(offset=0, limit=10)
`─ Distinct
`─ Projection(?s)
`─ Scan[S](?s, ?p, ?o)
Assuming the output is saved in a file named query.plan
, the following is equivalent to running the original query:
$ stardog query explain myDB query.plan
Editing this plan directly can help with performance debugging and fine-tuning - see Query Plan Syntax for details on query plan format. Currently supported in CLI, Java API and HTTP API.
Administering Stardog
In this chapter we describe the administration of Stardog Server and Stardog databases, including command-line programs, configuration options, etc.
Security is an important part of Stardog administration; it’s discussed separately (Security).
Command Line Interface
Stardog’s command-line interface (CLI) comes in two parts:
-
stardog-admin
: administrative client -
stardog
: a user’s client
The admin and user’s tools operate on local or remote databases using HTTP protocol. These CLI tools are Unix-only, are self-documenting, and the help output of these tools is their canonical documentation.[2]
Help
To use the Stardog CLI tools, you can start by asking them to display help:
stardog help
Or:
$ stardog-admin help
These work too:
$ stardog
$ stardog-admin
Security Considerations
We divide administrative functionality into two CLI programs for
reasons of security: stardog-admin
will need, in production
environments, to have considerably tighter access restrictions than
stardog
.
Caution
|
For usability, Stardog provides a default user "admin" and
password "admin" in stardog-admin commands if no user or password
are given. This is insecure; before any serious use of
Stardog is contemplated, read the Security section at least twice,
and then—minimally—change the administrative password to something
we haven’t published on the interwebs!
|
Command Groups
The CLI tools use "command groups" to make CLI subcommands easier to find. To print help for a particular command group, just ask for help:
$ stardog help [command_group_name]
Note
|
See the man pages for the canonical list of commands. |
The main help command for either CLI tool will print a listing of the command groups:
usage: stardog <command> [ <args> ] The most commonly used stardog commands are: data Commands which can modify or dump the contents of a database help Display help information icv Commands for working with Stardog Integrity Constraint support namespace Commands which work with the namespaces defined for a database query Commands which query a Stardog database reasoning Commands which use the reasoning capabilities of a Stardog database version Prints information about this version of Stardog See 'stardog help' for more information on a specific command.
To get more information about a particular command, simply issue the help command for it including its command group:
$ stardog help query execute
Finally, everything here about command groups, commands, and online help
works for stardog-admin
, too:
$ stardog reasoning consistency -u myUsername -p myPassword -r myDB
$ stardog-admin db migrate -u myUsername -p myPassword myDb
Autocomplete
Stardog also supports CLI autocomplete via bash
autocompletion. To install autocomplete for bash shell, you’ll
first want to make sure bash completion is installed:
Homebrew
To install:
$ brew install bash-completion
To enable, edit .bash\_profile
:
if [ -f `brew --prefix`/etc/bash_completion ]; then
. `brew --prefix`/etc/bash_completion
fi
MacPorts
First, you really should be using Homebrew…ya heard?
If not, then:
$ sudo port install bash-completion
Then, edit .bash\_profile
:
if [ -f /opt/local/etc/bash_completion ]; then
. /opt/local/etc/bash_completion
fi
Fedora
$ sudo yum install bash-completion
All Platforms
Now put the Stardog autocomplete script—stardog-completion.sh
—into your
bash\_completion.d
directory, typically one of
/etc/bash_completion.d, /usr/local/etc/bash_completion.d or ~/bash_completion.d.
Alternately you can put it anywhere you want, but tell .bash_profile
about it:
source ~/.stardog-completion.sh
How to Make a Connection String
You need to know how to make a connection string to talk to a Stardog database. A connection string may consist solely of the database name in cases where
-
Stardog is listening on the standard port 5820; and
-
the command is invoked on the same machine where the server is running.
In other cases, a "fully qualified" connection string, as described below, is required.
Further, the connection string is now assumed to be the first argument of any command that requires a connection string. Some CLI subcommands require a Stardog connection string as an argument to identify the server and database upon which operations are to be performed.
Connection strings are URLs and may either be local to the machine where the CLI is run or they may be on some other remote machine.
Stardog connection strings use the http://
protocol scheme.
Example Connection Strings
To make a connection string, you need to know the machine name and the port Stardog Server is running on and the name of the database:
{scheme}{machineName}:{port}/{databaseName};{connectionOptions}
Here are some example connection strings:
http://server/billion-triples-punk
http://localhost:5000/myDatabase
http://169.175.100.5:1111/myOtherDatabase;reasoning=true
Using the default port for Stardog’s use of HTTP protocol simplifies connection
strings. connectionOptions
are a series of ;
delimited key-value pairs which
themselves are =
delimited. Key names must be lowercase and their values
are case-sensitive.
Server Admin
Stardog Server supports all the administrative functions over the HTTP protocol.
Upgrading Stardog Server
The process of installation is pretty simple; see the Quick Start Guide for details.
But how do we easily upgrade between versions? The key is judicious use of
STARDOG_HOME
. Best practice is to keep installation directories for different
versions separate and use a STARDOG_HOME
in another location for storing
databases.[3] Once you set your STARDOG_HOME
environment
variable to point to this directory, you can simply stop the old version and
start the new version without copying or moving any files. You can
also specify the home directory using the --home
option when starting the
server.
Server Security
See the Security section for information about Stardog’s security system, secure deployment patterns, and more.
Configuring Stardog Server
Note
|
The properties described in this section control the behavior of the Stardog Server; to set properties or other metadata on individual Stardog databases, see Database Admin. |
Stardog Server’s behavior can be configured via the JVM arg
stardog.home
, which sets Stardog Home, overriding the value of
STARDOG_HOME
set as an environment variable. Stardog Server’s behavior
can also be configured via a stardog.properties
—which is a Java
Properties file—file in STARDOG_HOME
. To specify another location
for the stardog.properties
file, you can set the STARDOG_PROPERTIES
environment variable.
For most server properties to take effect in a running Stardog Server, it is
necessary to restart it. However, certain properties (e.g. some of the
LDAP properties) are updatable and can be mutated without
a restart using the property-set
admin command, which
updates a property (on all Stardog servers in a cluster) and saves the change
to the stardog.properties
file. Use the property-get
command to see a list of all set (updatable and otherwise) server properties.
Configuring Temporary ("Scratch") Space
Stardog uses the value of the JVM argument java.io.tmpdir
to write temporary
files for many different operations. If you want to configure temp space to use
a particular disk volume or partition, use the java.io.tmpdir
JVM argument on
Stardog startup.
Bad (or, at least, weird) things are guaranteed to happen if this part of the
filesystem runs out of (or even low on) free disk space. Stardog will
delete temporary files when they’re no longer needed. But Stardog admins should
configure their monitoring systems to make sure that free disk space is always
available, both on java.io.tmpdir
and on the disk volume that hosts
STARDOG_HOME
.[4]
Stardog Configuration
The following twiddly knobs for Stardog Server are available in
stardog.properties
:[5]
-
query.all.graphs
: If a SPARQL query does not useFROM
orFROM NAMED
keywords, this option will define what data it is evaluated against; iftrue
, it will run over the default graph and the union of all named graphs; iffalse
(the default), it will run only over the default graph. -
query.pp.contexts
: Controls how property paths interact with named graphs in the data. When set totrue
and the property path pattern is in the default scope (i.e. not inside agraph
keyword), Stardog will check that paths do not span multiple named graphs (per 18.1.7). For this to affect query results either there should be multipleFROM
clauses orquery.all.graphs
must be also set to true. -
query.timeout
: Sets the upper bound for query execution time that’s inherited by all databases unless explicitly overriden. See Managing Query Performance section below for details. -
logging.[access,audit].[enabled,type,file]
: Controls whether and how Stardog logs server events; described in detail below. -
logging.slow_query.enabled
,logging.slow_query.time
,logging.slow_query.type
: The three slow query logging options are used in the following way. To enable logging of slow queries, setenabled
totrue
. To define what counts as a "slow" query, settime
to a time duration value (positive integer plus "h", "m", "s", or "ms" for hours, minutes, seconds, or milliseconds respectively). To set the type of logging, settype
totext
(the default) orbinary
. Alogging.slow_query.time
that exceeds the value ofquery.timeout
will result in empty log entries.** -
http.max.request.parameters
: Default is 1024; any value smaller thanInteger.MAX_VALUE
may be provided. Useful if you have lots of named graphs and are at risk of blowing out the value ofhttp.max.request.parameters
. -
database.connection.timeout
: The amount of time a connection to the database can be open, but inactive, before being automatically closed to reclaim the resources. The timeout values specified in the property file should be a positive integer followed by either letterh
(for hours), letterm
(for minutes), letters
(for seconds), or lettersms
(for milliseconds). Example intervals:1h
for 1 hour,5m
for 5 minutes,90s
for 90 seconds,500ms
for 500 milliseconds. Default value is1h
. NOTE: setting a short timeout can have adverse results, especially if updates are being performed without commit changes to the server, closing the connection prematurely while using it. -
password.length.min
: Sets the password policy for the minimum length of user passwords, the value can’t be lower thanpassword.length.min
or greater thanpassword.length.max
. Default:4
. -
password.length.max
: Sets the password policy for the maximum length of user passwords. Default:1024
. -
password.regex
: Sets the password policy of accepted chars in user passwords, via a Java regular expression. Default:[\w@#$%!&]+
-
security.named.graphs
: Sets named graph security on globally. Default:false
. -
spatial.use.jts
: Enabled support for JTS in the geospatial module. Default:false
. -
spilling.max.file.length
: When Stardog cannot handle an operation in memory, it spills data to disk. This property controls the maximum size of a single file Stardog will spill data to. A query can spill to multiple files and each are bound by the value set here. NOTE: this will only take effect if the server property,memory.management
, is also enabled, which Stardog does by default. Default:10G
. -
Additional properties related to the BI Server. See Configuring the BI Server.
-
Additional properties related to LDAP. See Configuring LDAP.
Starting & Stopping the Server
Note
|
Unlike the other stardog-admin subcommands, starting the
server may only be run locally, i.e., on the same machine
the Stardog Server is will run on.
|
The simplest way to start the server—running on the default port,
detaching to run as a daemon, and writing stardog.log
to the current
working directory— is
$ stardog-admin server start
To specify parameters:
$ stardog-admin server start --require-ssl --port=8080
The port can be specified using the property --port
.
To shut the server down:
$ stardog-admin server stop
If you started Stardog on a port other than the default, or want to shut
down a remote server, you can simply use the --server
option to
specify the location of the server to shutdown.
By default Stardog will bind it’s server to 0.0.0.0
. You can specify a different
network interface for Stardog to bind to using the --bind
property
of server start
.
Server Monitoring
Stardog provides server monitoring via the Metrics library. In addition to providing some basic JVM information, Stardog also exports information about the Stardog DBMS configuration as well as stats for all databases within the system, such as the total number of open connections, size, and average query time.
Accessing Monitoring Information
Monitoring information is available via the Java API, the HTTP API, the CLI
or (if configured) the JMX interface. Performing a GET
on
/admin/status
which will return a JSON object containing the
information available the server and all the databases. This information
is also available for Prometheus via /admin/status/prometheus
,
allowing Prometheus servers to scrape Stardog directly. The endpoint
DB/status
will return the monitoring information about the database
status. The stardog-admin server status
command
will print a subset of this information on the console.
Configuring JMX Monitoring
By default, JMX monitoring is not enabled. You can enable it by setting
metrics.reporter=jmx
in the stardog.properties
file. Then, you can
simply use a tool like VisualVM or
JConsole to attach to the process running the JVM, or connect directly
to the JMX server.
If you want to connect to the JMX server remotely you need to set
metrics.jmx.remote.access=true
in stardog.properties
.
Stardog will bind an RMI server for remote access on port 5833
.
If you want to change this port Stardog binds the remote server to, you can
set the property metrics.jmx.port
in stardog.properties
.
Finally, if you wish to disable monitoring completely, set
metrics.enabled
to false
in stardog.properties
.
Locking Stardog Home
Stardog Server will lock STARDOG_HOME
when it starts to prevent
synchronization errors and other nasties if you start more than one
Stardog Server with the same STARDOG_HOME
. If you need to run more
than one Stardog Server instance, choose a different STARDOG_HOME
or
pass a different value to --home
.
Access & Audit Logging
See the exemplar stardog.properties
file for a complete discussion of how access and
audit logging work in Stardog Server. Audit logging is a superset of
the events in access logging. Access logging covers the most often required
logging events; you should consider enabling audit logging if you really need to
log every server event. Logging generally doesn’t have much impact on
performance; but the safest way to insure that impact is negligible is to log to
a separate disk (or to a centralized logging server, etc.).
The important configuration choices are whether logs should be binary or plain text (both based on ProtocolBuffer message formats); the type of logging (audit or access); the logging location (which may be "off disk" or even "off machine") Logging to a centralized logging facility requires a Java plugin that implements the Stardog Server logging interface; see Java Programming for more information; and the log rotation policy (file size or time).
Slow query logging is also available. See the Managing Running Queries section below.
Database Admin
Stardog is a multi-tenancy system and will happily give access to many, physically distinct databases.
Configuring a Database
To administer a Stardog database, some config options must be set at creation time; others may be changed subsequently and some may never be changed. All config options have sensible defaults (except for the database name), so you don’t have to twiddle any of the knobs till you really need to.
To configure a database, use the metadata-get
and metadata-set
CLI commands.
See Man Pages for the details.
Configuration Options
Option | Mutable | Default | API |
---|---|---|---|
|
Yes |
||
The name of one or more database archetypes, used to associate ontologies and constraints with new databases. Built-in archetypes are "PROV" and "SKOS", see the docs for instructions to create your own. |
|||
|
Yes |
||
The amount of time a connection to the database can be open, but inactive, before being automatically closed to reclaim the resources. The following unit suffixes are available: "ms" for milliseconds, "s" for seconds, "m" for minutes, and "h" for hours. |
|||
|
No |
||
A database name, the legal value of which is given by the regular expression [A-Za-z]{1}[A-Za-z0-9_-] |
|||
|
Yes |
||
The default namespaces for the database. Legal input is a comma-separated list of <prefix>=<namespace> values. |
|||
|
No |
|
|
Whether or not the database is online |
|||
|
Yes |
|
|
Comma-separated list of names of RDF extractors to use when processing documents when no RDF extractor names are given. Built-in extractors include tika, text, entities, linked and dictionary. See the docs for instructions to create your own. |
|||
|
Yes |
|
|
Comma-separated list of names of text extractors to use when processing documents when no text extractor names are given. See the docs to create your own. |
|||
|
Yes |
||
A URI indicating which FileSystem provider to use for document storage. In addition to local storage (file:///), documents can be stored on Amazon S3 (s3:///) or document storage can be disabled altogether (none) |
|||
|
Yes |
||
The directory where OpenNLP models are located |
|||
|
Yes |
|
|
The path under which documents will be stored. A relative path is relative to the database directory. S3 storage should specify an absolute path with the bucket name as the first part of the path |
|||
|
Yes |
|
|
Protocol used when storing unstructured text documents on S3 (and compatible) stores. Can be set to HTTP to disable TLS/SSL |
|||
|
No |
|
|
Option for enabling edge properties that allows RDF statements to appear as subjects of RDF statements (aka RDF*). |
|||
|
Yes |
|
|
Enables automatic generation of the default GraphQL schema if one has not been created by the user. |
|||
|
Yes |
|
|
Specifies the input source to use for generating the default GraphQL schema automatically. |
|||
|
Yes |
|
|
The maximum number of results a GraphQL query can return. The argument "first" can be used within queries to limit results as well. |
|||
|
No |
||
Specifies which part of the database, in terms of named graphs, is checked with IC validation |
|||
|
Yes |
|
|
Enables automatic ICV consistency check as part of transactions. Ensures that both reasoning consistencies and constraint violations are checked during ICV "Guard Mode" validation (with each commit). This is only effective when "icv.reasoning.enabled" is set to true. |
|||
|
Yes |
|
|
Determines if all database mutations are subject to ICV. When enabled, each commit is inspected to ensure that the contents of the database are valid. Should a transaction attempt to commit invalid data, the commit will fail and the data will not be added/removed. |
|||
|
Yes |
|
|
The number of violations that will be computed and returned in the error message when guard mode is enabled. If the option is set to 0 no explanations will be computed and transaction failure will only indicate there was a violation without specifying which constraint failed. |
|||
|
Yes |
|
|
Determines if reasoning is used during ICV. |
|||
|
Yes |
|
|
NOTE: Not used in Stardog v7+. The minimum number of statements in the Stardog database before differential indexes are used. |
|||
|
Yes |
|
|
NOTE: Not used in Stardog v7+. The size in number of RDF statements before the differential indexes are merged to the main indexes. |
|||
|
No |
|
|
Specify that non-string typed literals are canonicalized in the database. Improves query and loading performance, but does change literal values to a canonical form. For example, "1"^^xsd:byte is canonicalized to "1"^^xsd:integer. Set this option to false if you require literals to be exactly as specified, rather than canonicalized. The default value is 'true'. note that, this value can only be set at database creation time and cannot be changed at a future time. |
|||
|
Yes |
|
|
Enables memory-mapping in lucene indices (e.g., search, spatial). |
|||
|
Yes |
|
|
The max capacity for the query pattern cardinality cache that is shared across queries to the same database. |
|||
|
Yes |
|
|
If true, Stardog will pre-compute the number of times frequent binary chains occur in the data and will use that information for query optimization.Could be disabled for very large or very complex datasets. Changes to this option will take effect the next time statistics is recomputed. |
|||
|
Yes |
|
|
The max number of characteristic sets computed as a part of the statistical summary of the database. More diverse datasets may require a higher number for more accurate query planning. The downside is higher memory footprint and slower planning |
|||
|
No |
|
|
Determines whether and how selectivity statistics is computed when a database is bulk loaded. By default it’s computed synchronously. |
|||
|
Yes |
|
|
Determines whether statistics are maintained automatically. When set to "true", Stardog will decide when to update statistics as the database is modified through additions and removals and update statistics as needed. If this option is set to "false", Stardog will never update the statistics regardless of how much the database is updated. |
|||
|
Yes |
|
|
Once the ratio of updated triples to database size goes over this limit statistics computation will be performed synchronously within the transaction instead of a background thread. Setting this option to a non-positive number (⇐ 0) will disable blocking updates. |
|||
|
Yes |
|
|
Minimum number of triples that should be in the database for statistics to be updated automatically |
|||
|
Yes |
|
|
Ratio of updated triples to the number of triples in the database that triggers the automatic statistics computation in a background thread |
|||
|
Yes |
|
|
Maximum number of triples to keep in memory for merging interleaving additions and removals while querying uncommitted state |
|||
|
Yes |
|
|
Controls whether query evaluation will use extended literal comparison. If enabled, literals of different datatypes are first compared based on their string values and then based on the string value of their datatypes. |
|||
|
No |
|
|
Configuration option for determining the normalization algorithm for the langauge tags of literals. |
|||
|
No |
|
|
Determines how the Stardog parser handles bnode (blank node) identifiers that may be present in RDF input. If this property is enabled, parsing and data loading performance are improved; but the other effect is that if distinct input files use (randomly or intentionally) the same bnode identifier, that bnode will point to one and the same node in the database. If you have input files that use explicit bnode identifiers, and more than one of those files may use the same bnode identifiers, and you don’t want those bnodes to be smushed into a single node in the database, then this configuration option should be disabled. |
|||
|
Yes |
|
|
When enabled, the progress of various tasks will be printed in the server log. |
|||
|
Yes |
||
This option controls the behavior for answering queries that don’t specify a dataset (FROM or FROM NAMED) in the query. In such cases, the SPARQL specification says that the query should be answered only using the information in default graph (no context). However, sometimes it is desirable to answer such queries using all the information in the database including the default graph and all named graphs. Setting this option to true changes the behavior of Stardog to do this. Queries that specify a dataset are not affected by this option. |
|||
|
Yes |
|
|
The default DESCRIBE query strategy for the database. Built-in strategies include "default", "cbd" and "bidirectional". See the docs for instructions to create your own describe strategy. |
|||
|
Yes |
|
|
The conditions under which a cached plan will be reused. "ALWAYS" and "NEVER" determine query plan reuse as you would expect. "CARDINALITY" instructs Stardog to reuse cached query plans for structurally equivalent queries if the cardinality estimations of scans are similar. |
|||
|
Yes |
|
|
Determines how property paths interact with named graphs in the data. When set to true and the property path pattern is in the default scope (i.e. not inside a graph keyword), Stardog will check that paths do not span multiple named graphs (per section 18.1.7 of the W3C SPARQL 1.1 Query Language Recommendation). For this to affect query results either there should be multiple FROM clauses or query.all.graphs must be also set to true. |
|||
|
Yes |
||
Determines max execution time for query evaluation. This can also be overridden in a query’s parameters. The following unit suffixes are available: "ms" for milliseconds, "s" for seconds, "m" for minutes, and "h" for hours. |
|||
|
Yes |
|
|
Enables approximate reasoning. With this flag enabled Stardog will approximate an axiom that is outside the profile Stardog supports and normally ignored. For example, an equivalent class axiom might be split into two subclass axioms and only one subclass axiom is used. |
|||
|
Yes |
|
|
Perform schema classification eagerly when the schema is loaded. Classifying eagerly ensures subclass and equivalence queries between named classes can be answered with a simple lookup. However, if the schema is changing frequently then this option can be turned off so classification is performed only if necessary. |
|||
|
Yes |
|
|
Enables automatic consistency checking as part of every query performed with reasoning enabled. If the underlying database did not change since the last consistency check, the check will not be performed. |
|||
|
Yes |
|
|
If true, Stardog will pre-compute class and property names which have assertions in the data. That can speed-up reasoning but may slow things down when data changes often. |
|||
|
Yes |
|
|
Enables punning; the ability for an IRI to represent both a class and an individual. |
|||
|
Yes |
|
|
Allows one to choose how query patterns are rewritten for reasoning: as a whole (per scope) or individually (per pattern). |
|||
|
Yes |
|
|
Option to enable owl:sameAs reasoning. When this option is set to "ON", then the reflexive, symmetric, and transitive closure of owl:sameAs triples in the database are computed. When it is set to "FULL", then owl:sameAs inferences are computed based on schema axioms, such as functional properties. See the docs for more information. |
|||
|
Yes |
||
Determines which, if any, named graph or graphs contains the schema (ontology, "TBox") part of the data. The legal value is a comma-separated list of named graph identifiers, including (optionally) the special names, tag:stardog:api:context:default and tag:stardog:api:context:local, which represent the default graph and the union of all local (non-virtual) named graphs and the default graph, respectively. In the context of database configurations only, Stardog will recognize default and * as short forms of those URIs, respectively. |
|||
|
Yes |
|
|
Timeout for schema reasoning. If schema reasoning cannot be completed in the specified time then only RDFS reasoning will be performed for the schema which might yield incomplete answers for the schema queries. The timeout values specified as test be a positive integer followed by either letter 'h' (for hours), letter 'm' (for minutes), letter 's' (for seconds), or letters 'ms' (for milliseconds). Examples: '1h' for 1 hour, '5m' for 5 minutes, '90s' for 90 seconds, '500ms' for 500 milliseconds. |
|||
|
Yes |
||
Option to specify the schemas and the named graphs that constitute each schema. The value is a comma-separated collection of schema=IRI pairs. There should be one pair for each named graph in a schema. The graphs for the default schema are set via the reasoning.schema.graphs option. |
|||
|
Yes |
|
|
Option to specify the number of schemas to keep in memory. There can be more schemas defined in the database but only this many schemas will be kept in memory and other schemas will be pulled into memory as queries are getting answered. If this limit is too high, the amount of memory used for schemas will increase and might cause memory problems. If it is too low then answering reasoning queries might slow down. |
|||
|
Yes |
|
|
Specifies the reasoning type associated with the database, most corresponding to the OWL Profiles of the same name. The following reasoning types are available: RDFS (OWL 2 axioms allowed in RDF Schema), QL (OWL 2 QL axioms), RL (OWL 2 RL axioms), EL (OWL 2 EL axioms), DL (OWL 2 DL axioms), SL (a combination of RDFS, QL, RL and EL axioms + user-defined rules) and NONE (disables reasoning). Any axiom outside the selected type will be ignored by the reasoner. |
|||
|
Yes |
|
|
Flag to enable reasoning over virtual graphs and SERVICE clauses. |
|||
|
Yes |
|
|
Specify the default limit on the number of results returned from a full-text search (-1 returns all results). This only limits the number of results returned from the Lucene full-text index, not from its containing query. |
|||
|
Yes |
|
|
Enables the full-text (unstructured) search index for the database; important for Semantic Search applications. |
|||
|
No |
||
Option to specify the datatypes for which to index literals in Lucene. Literals with other datatypes will not be accessible via full-text search. |
|||
|
Yes |
|
|
Whether literals added during a transaction are automatically indexed. If this flag is set to 'false', then full-text search queries will return incomplete results until the index is rebuilt. |
|||
|
Yes |
|
|
Enable support to query the Lucene full-text search index with leading wildcards. |
|||
|
Yes |
|
|
If enabled, named graphs are an explicit resource type in Stardog’s security model. |
|||
|
Yes |
|
|
Enables the geospatial search index for the database. |
|||
|
No |
|
|
Specifies the precision used for the indexing of geospatial data. The smaller the value, the less precision, but the better the performance of geospatial queries. The default value is 11 which yields sub-meter precision; a value of 8 will give a precision +/- 50m. |
|||
|
Yes |
|
|
Specify the default limit on the number of results returned from a geospatial query (-1 returns all results). This only limits the number of results returned from the geospatial index, not from its containing query. |
|||
|
Yes |
|
|
Enables automatic SQL schema generation when one does not exist in the database. |
|||
|
Yes |
|
|
Specifies the input source to use for generating the SQL schema automatically when one does not exist in the database. |
|||
|
Yes |
||
Specifies which named graph in the database is used to read SQL schema mapping. |
|||
|
No |
|
|
Controls whether Stardog parses RDF strictly (true, the default) or loosely (false). Setting this to "false" will allow, for example, integer literals with leading zeros. |
|||
|
Yes |
|
|
Configures isolation level for transactions |
|||
|
Yes |
|
|
Option for whether or not the database logs all transactions events to disk. The default when not in Cluster mode is "false", and when in Cluster mode the default is "true". |
|||
|
Yes |
|
|
Determines whether old log files will be deleted after rotation |
|||
|
Yes |
|
|
Determines the size (in bytes) at which the transaction log will be rotated |
|||
|
Yes |
|
|
Enables using rotated transaction log |
|||
|
No |
|
|
Which conflict resolution strategy to use for this database. Write conflicts occur when two transactions attempt to modify the same statement inthe database simultaneously. |
|||
|
Yes |
|
|
Determines what data the database evaluates queries against; if true, it will query over the default graph and the union of all accessible virtual graphs; if false (the default), it will query only over the default graph. Requires query.all.graphs to be true |
A Note About Database Status
A database must be set to offline
status before most configuration
parameters may be changed. Hence, the normal routine is to set the database
offline, change the parameters, and then set the database to online. All
of these operations may be done programmatically from CLI tools, such
that they can be scripted in advance to minimize downtime. In a future
version, we will allow some properties to be set while the database
remains online.
Managing Database Status
Databases are either online or offline; this allows database maintenance to be decoupled from server maintenance.
Online and Offline
Databases are put online or offline synchronously: these operations
block until other database activity is completed or terminated. See
stardog-admin help db
for details.
Examples
To set a database from online to offline:
$ stardog-admin db offline myDatabase
To set the database online:
$ stardog-admin db online myDatabase
If Stardog Server is shutdown while a database is offline, the database will be offline when the server restarts.
Creating a Database
Stardog databases may be created locally or remotely; but
performance is better if data files don’t have to be transferred over a
network during creation and initial loading. See the section below about
loading compressed data. All data files, indexes, and server metadata
for the new database will be stored in Stardog Home. Stardog won’t
create a database with the same name as an existing database. Stardog
database names must start with an alpha character followed by zero or more
alphanumeric, hyphen or underscore characters, as is given by the regular
expression [A-Za-z]{1}[A-Za-z0-9_-]*
.
Note
|
There are four reserved words that may not be used for the
names of Stardog databases: system , admin , and docs .
|
Minimally, the only thing you must know to create a Stardog database is a database name; alternately, you may customize some other database parameters and options depending on anticipated workloads, data modeling, and other factors.
See stardog-admin help db create
for all the
details including examples.
Database Archetypes
A database archetype is a simple templating mechanism for bundling a set of namespaces, schemas and constraints to populate a newly created database. Archetypes are an easy way to register the namespaces, reasoning schemas and constraints for standardized vocabularies and ontologies with a database. Archetypes are composable so multiple archetypes can be specified at database creation time to load all the defined namespaces, schemas and constraints into the database. Archetypes are intended to be used alongside your domain data, which may include as many other schemas and constraints as are required.
As of Stardog 7.2.0, the preferred way of using archetypes is via the Stardog Archetype Repository which comes with archetypes for FOAF, SKOS, PROV and CIM. Follow the instructions on the GitHub repository for setting up and using archetypes.
Once the archetypes have been setup you can use the following command to create a new database that will load the namespaces, schemas and constraints associated with an archetype:
$ stardog-admin db create -o database.archetypes="cim" -n db
Inline Archetypes
Archetypes can be used as a predefined way of loading a schema and a set of constraints to the database just like any RDF data can be loaded to a database. These kinds of archetypes are called "inline" as their contents will appear in the database under predefined named graphs as explained next. These named graphs that are automatically created by archetypes can be queried and modified by the user as any other named graph.
Each archetype has a unique IRI identifying it and the schema contents of inline
archetypes will be loaded into a named graph with that IRI. To see an example, follow the
setup instructions
to download the archetypes to ${STARDOG_HOME}/.archetypes
and create a new
database with the FOAF archetype:
$ stardog-admin db create -o database.archetypes="foaf" myDb
If you query the database you will see a named graph automatically created:
$ stardog query myDb "select distinct ?g { graph ?g { } }"
+----------------------------+
| g |
+----------------------------+
| http://xmlns.com/foaf/0.1/ |
+----------------------------+
Protected Archetypes
Archetypes can also be defined in a "protected" mode where the schema and the constraints will be available for reasoning and validation services but they will not be stored in the database. In this mode, archetypes prevent unintended modifications to the schema and the constraints without losing their reasoning and validation functionality. An ontology like PROV is standardized by W3C and is not meant to change over time so the protected mode can be used with it.
The user-defined archetypes are inline by default but the archetype definition can be configured to make the schema and/or the constraints protected as explained in the Github repository.
The following example shows how using a protected archetype would look:
$ stardog-admin db create -o database.archetypes="prov" -n provDB
Successfully created database 'provDB'.
$ stardog query provDB "select distinct ?g { graph ?g { } }"
+-------+
| g |
+-------+
+-------+
$ stadog reasoning schema provDB
prov:wasDerivedFrom a owl:ObjectProperty
prov:wasGeneratedBy owl:propertyChainAxiom (prov:qualifiedGeneration prov:activity)
prov:SoftwareAgent a owl:Class
prov:wasInfluencedBy rdfs:domain (prov:Activity or prov:Agent or prov:Entity)
...
$ stardog query --reasoning provDB "select * { ?cls rdfs:subClassOf prov:Agent }"
+--------------------+
| cls |
+--------------------+
| prov:Agent |
| prov:SoftwareAgent |
| owl:Nothing |
| prov:Person |
| prov:Organization |
+--------------------+
$ stardog icv export provDB
AxiomConstraint{prov:EmptyCollection rdfs:subClassOf (prov:hadMember max 0 owl:Thing)}
AxiomConstraint{prov:Entity owl:disjointWith prov:Derivation}
SPARQLConstraint{
...
This example demonstrates that the database looks empty to regular SPARQL queries but
reasoning queries see the PROV ontology. Similarly PROV constraints are visible
for validation purposes but they cannot be removed by the icv drop
command.
Built-in Archetypes
Before Stardog 7.2.0, the only way to define archetypes was by creating and registering a new Java class that contained the archetype definition. This method is deprecated as of Stardog 7.2.0 but it will continue to work until Stardog 8 at which point support for Java-based archetypes will be removed. Until that time, the Java-based PROV and SKOS archetypes that were bundled in the Stardog distribution as built-in archetypes will be available and can be used without setting up the archetype location as describe above.
Database Creation Templates
As a boon to the overworked admin or devops peeps, Stardog Server supports database creation templates: you can pass a Java Properties file with config values set and with the values (typically just the database name) that are unique to a specific database passed in CLI parameters.
Examples
To create a new database with the default options by simply providing a name and a set of initial datasets to load:
$ stardog-admin db create -n myDb input.ttl another_file.rdf moredata.rdf.gz
Datasets can be loaded later as well. To create (in this case, an empty) database from a template file:
$ stardog-admin db create -c database.properties
At a minimum, the configuration file must have a value for
database.name
option.
If you only want to change only a few configuration options you can directly give the values for these options in the CLI args as follows:
$ stardog-admin db create -n db -o icv.enabled=true icv.reasoning.enabled=true -- input.ttl
“--” is used in this case when “-o” is the last option to delimit the value for “-o” from the files to be bulk loaded.
Please refer to the CLI help for more details of the db create
command.
Database Create Options
Name | Description | Arg values | Default |
---|---|---|---|
|
Required, the name of the database to create |
||
|
Flag to specify whether bulk loaded files should be first copied to the server |
|
|
|
Specifies the kind of database indexes: memory or disk |
|
disk |
|
Specifies that the database’s indexes should be optimized for RDF triples only |
|
Backing Up and Restoring
Stardog provides two different kinds of backup operations: database backups and server backups. These commands perform physical backups, including database metadata, rather than logical backups via some RDF serialization. They are native Stardog backups and can only be restored with Stardog tools as explained below. Backups may be accomplished while a database is online; backup is performed in a read transaction: reads and writes may continue, but writes performed during the backup are not reflected in the backup.
In addition to physical backups one can perform a logical backup using the
stardog data export
command that will save the
contents of a database into a standard RDF file. Logical backups do not contain
database metadata or configuration options.
In addition to that, Stardog 7+ supports backup/restore functionality for the entire
system in one step via the stardog-admin server backup
and
stardog-admin server restore
commands. This functionality
is convenient when all databases need to be backed up and restored simultaneously.
Backup
We explain database backups and server backups in the following sections.
Database Backups
Database backup saves the contents of a single database along with database metadata
including user and role permissions associated with the database. The
stardog-admin db backup
command assumes
a default location for its output, namely,
$STARDOG_HOME/.backup
; that default may be overriden by setting backup.dir
.
Backups are stored in directories by database name and then in
date-versioned subdirectories for each backup volume.
If you need to specify a location outside of $STARDOG_HOME
(e.g. a network
mount) you can set backup.location
or pass it to the --to
argument.
To backup a Stardog database called foobar
:
$ stardog-admin db backup foobar
To perform a remote backup, for example, pass in a specific directory that may be mounted in the current OS namespace via some network protocol, thus:
$ stardog-admin db backup --to /my/network/share/stardog-backups foobar
Finally, database backups can also be performed directly to S3 or GCP. For S3 backups use a URL in the following format:
s3://[<endpoint hostname>:<endpoint port>]/<bucket name>/<path prefix>?region=<AWS Region>&AWS_ACCESS_KEY_ID=<access key>&AWS_SECRET_ACCESS_KEY=<verySecretKey1>
The endpoint hostname
and endpoint port
values are only used for
on-premises S3 clones. To use Amazon S3 those values can be left blank
and the URL will have three /
before the bucket as in:
s3:///mybucket/backup/prefix?region=us-east-1&AWS_ACCESS_KEY_ID=accessKey&AWS_SECRET_ACCESS_KEY=secret`
For GCP backups use a URL in the following format:
gs://<bucket name>/<path prefix>?GOOGLE_APPLICATION_CREDENTIALS=<path to Google Credentials JSON file>
See [GCP documentation](https://cloud.google.com/docs/authentication/production) for creating Google credentials JSON file.
A default S3 or GCP backup location can also be specified in the stardog.properties
file with the key backup.location
.
Server Backups
Server backup will back up the entire Stardog server, all databases and associated metadata. Unlike database backups, which takes a full backup of the database every time it is run, server backup takes an incremental backup of the data. That way, each time the command is run only the updates to the databases since the last time backup need to be saved.
Server backups are accomplished with the following command:
$ stardog-admin server backup
You can optionally specify the backup location, otherwise the Stardog defaults will be used,
similar to the db backup
command:
$ stardog-admin server backup /path/to/custom/backup/location
This command only supports file-based backups, it cannot be used with S3, for instance.
To copy the backups off local disk we recommend using a tool such as rclone.
After setting up rclone you can use it to send the backups to another server:
$ rclone sync /path/to/backup/location/ sftp:/path/to/other/location/
Consult the rclone docs for the full list of supported storage systems.
Restore
There are different restore commands corresponding to database and server backups.
Database Restores
To restore a Stardog database from a Stardog backup volume, simply pass a fully-qualified path to the volume in question. The location of the backup should be the full path to the backup, not the location of the backup directory as specified in your Stardog configuration. There is no need to specify the name of the database to restore.
To restore a database from its backup:
$ stardog-admin db restore $STARDOG_HOME/.backups/myDb/2012-06-21
Backups can also be restored directly from S3 by using an S3 URL in the following format:
s3://[<endpoint hostname>:<endpoint port>]/<bucket name>/<path prefix>/<database name>?region=<A
WS Region>&AWS_ACCESS_KEY_ID=<access key>&AWS_SECRET_ACCESS_KEY=<verySecretKey1>
Note: Unlike the backup URL the database name must be specified as the last entry of the path
field in the URL.
Stardog can be configured to automatically restore databases from a backup location on startup. For example, when a Stardog cluster node first starts it could pull all of the database data down from an S3 backup before joining the cluster.
There are two options that control this behavior.
Option | Description |
---|---|
|
A regular expression that matches the names of the databases to automatically
restore on startup, eg: |
|
A boolean value that determines if all databases which failed to load should be automatically restored from a backup location. |
Server Restores
You can use server restore
command to restore server backups created by server backup
.
To do so you must shut down Stardog and set $STARDOG_HOME
to an empty home directory.
The server restore
command will restore the complete server to $STARDOG_HOME
.
Once complete you can start Stardog server.
$ export STARDOG_HOME=/path/to/empty/stardog/home
$ stardog-admin server restore /path/to/server/backup
Server backups do not contain your license file, stardog.properties
or any other additional
files or directories created externally under STARDOG_HOME
so you need to back up and restore
those files and directories separately.
By default server restore
will restore the latest backup found in the backup directory. The
server configuration option backup.keep.last.number.backups
that can be set in stardog.properties
controls how many backups will be retained. By default, this option is set to 4 and any one of the
older backups can be restored if desired by specifying a backup ID in the command:
$ stardog-admin server restore -b 3 /path/to/server/backup
The server backup
command prints the ID for the backup created which is the value that can be
passed to the server restore
command. The backup IDs correspond to directories under the
versions
directory of the backup directory. The creation date for these directories will
indicate when the corresponding backup was created.
Namespace Prefix Bindings
Stardog allows database administrators to persist and manage custom namespace prefix bindings:
-
At database creation time, if data is loaded to the database that has namespace prefixes, then those are persisted for the life of the database. This includes setting the default namespace to the default that appears in the file. Any subsequent queries to the database may simply omit the
PREFIX
declarations:$ stardog query myDB "select * {?s rdf:type owl:Class}"
-
To add new bindings, use the
namespace
subcommand in the CLI:$ stardog namespace add myDb --prefix ex --uri 'http://example.org/test#'
-
To change the default binding, use a quote prefix when adding a new one:
$ stardog namespace add myDb --prefix "" --uri http://new.default
-
To change an existing binding, delete the existing one and then add a new one:
$ stardog namespace remove myDb --prefix ex
-
Finally, to see all the existing namespace prefix bindings:
$ stardog namespace list myDB
If no files are used during database creation, or if the files do not
define any prefixes (e.g. NTriples), then the "Big Four" default
prefixes are stored: RDF, RDFS, XSD
, and OWL
.
When executing queries in the CLI, the default table format for SPARQL
SELECT
results will use the bindings as qnames. SPARQL CONSTRUCT
query output (including export) will also use the stored prefixes. To reiterate,
namespace prefix bindings are per database, not global.
Loading Compressed Data
Stardog supports loading data from compressed files directly: there’s no need to uncompress files before loading. Loading compressed data is the recommended way to load large input files. Stardog supports GZIP, BZIP2 and ZIP compressions natively.
GZIP and BZIP2
A file passed to create
will be treated as compressed if the file name ends
with .gz
or .bz2
. The RDF format of the file is determined by the
penultimate extension. For example, if a file named test.ttl.gz
is used as input, Stardog will perform GZIP decompression during loading and
parse the file with Turtle parser. All the formats supported by Stardog
(RDF/XML, Turtle, Trig, etc.) can be used with compression.
ZIP
The ZIP support works differently since zipped files can contain
many files. When an input file name ends with .zip
, Stardog
performs ZIP decompression and tries to load all the files inside the
ZIP file. The RDF format of the files inside the zip is determined
by their file names as usual. If there is an unrecognized file extension
(e.g. '.txt'), then that file will be skipped.
Dropping a Database
This command removes a database and all associated files and metadata.
This means all files on disk related to the database will be deleted,
so only use drop
when you’re certain!
It takes as its only argument a valid database name. For example,
$ stardog-admin db drop my_db
Using Integrity Constraint Validation
Stardog supports integrity constraint validation as a data quality mechanism via closed world reasoning. Constraints can be specified in SHACL as well as OWL, SWRL, and SPARQL. Please see the Validating Constraints section for more information about using ICV in Stardog.
The CLI icv
subcommand can be used to add, delete, or drop all
constraints from an existing database. It may also be used to validate
an existing database with constraints that are passed into the icv
subcommand; that is, using different constraints than the ones already
associated with the database.
For details of ICV usage, see stardog help icv
and stardog-admin help icv
.
For ICV in transacted mutations of Stardog databases, see the database creation
section above.
Migrating a Database
The migrate
subcommand migrates an older Stardog database to the
latest version of Stardog. Its only argument is the name of the database
to migrate. migrate
won’t necessarily work between arbitrary Stardog
version, so before upgrading check the release notes for a new version
carefully to see whether migration is required or possible.
$ stardog-admin db migrate myDatabase
will update myDatabase
to the latest database format.
Getting Database Information
You can get some information about a database by running the following command:
$ stardog-admin metadata get my_db_name
This will return all the metadata stored about the database, including the values of configuration options used for this database instance. If you want to get the value for a specific option then you can run the following command:
$ stardog-admin metadata get -o index.named.graphs my_db_name
Managing Stored Functions
Stored functions, available since Stardog 5.1, provide the ability to
reuse expressions. This avoids duplication and ensures consistency
across instances of the same logic. Stored functions are treated
similarly to built-in and user-defined functions in that they can be
used in FILTER
constraints and BIND
assignments in SPARQL queries,
path queries and rules.
Creating and Using Functions
Functions are useful to encapsulate computational or business logic
for reuse. We can create a new function to compute the permutation
using the function add
command with
stardog-admin
on the command line:
stardog-admin function add "function permutation(?n, ?r) { factorial(?n) / factorial(?n - ?r) }"
We can use this function in a SPARQL query and see that the function is expanded in the query plan:
Explaining Query:
select * where { ?x :p :q. filter(permutation(?x, 3) > 1) }
The Query Plan:
Projection(?x) [#1]
`─ Filter((factorial(?x) / factorial((?x - "3"^^xsd:integer))) > "1"^^xsd:integer) [#1]
`─ Scan[POS](?x, :p, :q) [#1]
Stored Function Syntax
Function definitions provided to the add
command must adhere to the
following grammar:
FUNCTIONS ::= Prolog FUNCTION+
FUNCTION ::= 'function' FUNC_NAME '(' ARGS ')' '{' Expression '}'
FUNC_NAME ::= IRI | PNAME | LOCAL_NAME
ARGS ::= [Var [',' Var]* ]?
Prolog ::= // BASE and PREFIX declarations as defined by SPARQL 1.1
Expression ::= // as defined by SPARQL 1.1
Var ::= // as defined by SPARQL 1.1
We can use IRIs or prefixed names as function names and include
several functions in one add
call:
$ stardog-admin function add "prefix ex: <http://example/> \
function ex:permutation(?n, ?r) { factorial(?n) / factorial(?n - ?r) } \
function <http://example/combination>(?n, ?r) { permutation(?n, ?r) / factorial(?r) }"
Stored 2 functions successfully
Additional Function Management
The admin commands cover adding, listing and removing functions. Examples of these commands are shown below:
$ stardog-admin function list
FUNCTION combination(?n,?r) {
((factorial(?n) / factorial((?n - ?r))) / factorial(?r))
}
FUNCTION permutation(?n,?r) {
(factorial(?n) / factorial((?n - ?r)))
}
$ stardog-admin function remove permutation
Removed stored function successfully
HTTP APIs are also provided to add, list and remove stored functions:
-
GET /admin/functions/stored[/?name={functionName}]
-
DELETE /admin/functions/stored[/?name={functionName}]
-
POST /admin/functions/stored
The contents of the POST
request should be a document containing one or more
function definitions using the syntax describes above. The GET
request by default
returns the definitions for all the functions. If the name
parameter is specified
a definition for the function with that name is returned. Similarly, the DELETE
request deletes all the functions by default or deletes a single function if the
name
parameter is specified.
Stored functions are persisted in the system database. The system database should be backed up properly to avoid loss of functions.
Dependencies Across Stored Functions
Stored functions are compiled at creation time in a way that guarantees they will work indefinitely, even if other functions are removed or changed in ways that would affect them. For this reason, dependent functions need to be reloaded when their dependencies are changed.
Managing Stored Queries
Stardog 4.2 added the capability to name and store SPARQL queries for future evaluation by referring to the query’s name.
Queries of any type can be stored in Stardog and executed directly by using the name of the stored query. Stored queries can be shared with other users, which gives those users the ability to run those queries provided that they have appropriate permissions for a database.
Stored queries can be managed via CLI, Java API, and HTTP API. The CLI command group is
stardog-admin stored
. The HTTP API is detailed in Network Programming.
Storing Queries
Queries can be stored using the stored add
admin command and
specifying a unique name for the stored query:
$ stardog-admin stored add -n types "select distinct ?type {?s a ?type}"
If a file is used to specify the query string without an explicit
-n/--name
option then the name of the query file is used for the stored
query:
$ stardog-admin stored add listProperties.sparql
By default, stored queries can be executed over any database. But they can be
scoped by providing a specific database name with the -d/--database
option.
Also, by default, only the user who stored the query can access that stored
query. Using the --shared
flag will allow other users to execute the stored
query.
The following example stores a shared query with a custom name that can
be executed over only the database myDb
:
$ stardog-admin stored add --shared -d myDb -n listProperties "select distinct ?p {?s ?p ?o}"
The JSON attributes which correspond to --shared
and -d
are shared
and database
.
Stored query names must be unique for a Stardog instance. Existing stored
queries can be replaced using the --overwrite
option in the command.
Updating Stored Queries
Queries can be updated using the --overwrite
option on the stored add
admin command and specifying an existing name for a stored query:
$ stardog-admin stored add --overwrite -n types "select distinct ?p {?s ?p ?o}"
Importing and Exporting Stored Queries
Stored queries are saved as RDF statements in the Stardog system database and it is possible to export the RDF representation of the queries:
$ stardog-admin stored export
@prefix system: <http://system.stardog.com/> .
system:QueryExportAll a system:StoredQuery , system:SharedQuery ;
system:queryName "ExportAll" ;
system:queryString """construct where {?s ?p ?o}""" ;
system:queryCreator "admin" ;
system:queryDatabase "*" .
system:QuerylistDroids a system:StoredQuery , system:ReasoningQuery ;
system:queryName "listDroids" ;
system:queryString "select ?x { ?x a :Droid }" ;
system:queryCreator "luke" ;
system:queryDatabase "starwars" .
The same RDF representation can be used to import the stored queries as an alternative way of storing new queries or updating existing stored queries.
$ stardog stored import queries.ttl
In addition to the built-in properties from the system
database
arbitrary RDF properties can be used for stored queries. The value of these
additional annotation properties should be IRIs or literals. Only the values
directly linked to the stored query subject in the RDF document will be saved
and the triples with a non-stored query subject will be ignored.
Running Stored Queries
Stored queries can be executed using the regular query execution CLI command by passing the name of the stored query:
$ stardog query myDb listProperties
Other commands like query explain
also accept stored query names.
They can also be passed instead of query string into HTTP API calls.
Listing Stored Queries
To see all the stored queries, use the stored list
subcommand:
$ stardog-admin stored list
The results are formatted tabularly:
+--------+-----------------------------------------+
| Name | Query String |
+--------+-----------------------------------------+
| graphs | SELECT ?graph (count(*) as ?size) |
| | FROM NAMED stardog:context:all |
| | WHERE { GRAPH ?graph {?s ?p ?o}} |
| | GROUP BY ?graph |
| | ORDER BY desc(?size) |
| people | CONSTRUCT WHERE { |
| | ?person a foaf:Person ; |
| | ?p ?o |
| | } |
| types | SELECT DISTINCT ?type ?label |
| | WHERE { |
| | ?s a ?type . |
| | OPTIONAL { ?type rdfs:label ?label } |
| | } |
+--------+-----------------------------------------+
3 stored queries
Users can only see the queries they’ve stored and the queries stored by other
users that have been --shared
. The --verbose
option will show more details
about the stored queries.
Removing Stored Queries
Stored queries can be removed using the stored remove
command:
$ stardog-admin stored remove storedQueryName
If you would like to clear all the stored queries then use the
-a/--all
option:
$ stardog-admin stored remove -a
Stored Query Service
Stardog supports a way to invoke stored queries, including path queries, in the context
of another SPARQL query using the SERVICE
keyword. The Stored Query Service was released as beta in Stardog 7.3.2
and is generally availabile (GA) as of version 7.4.0. Previous versions of Stardog already employed the service mechanism
in SPARQL to support full-text search and Entity Extraction and now this is naturally extended to
stored queries. Suppose, the following query is stored with the name "cities":
stardog-admin stored add -n "cities" "SELECT ?country ?city { ?city :locatedIn ?country }"
Then it is possible to use it as a named view in another query:
prefix sqs: <tag:stardog:api:sqs:>
SELECT ?person ?city ?country {
SERVICE <query://cities> { [] sqs:vars ?country, ?city }
?person :from ?city
}
This query uses the "cities" query to look up information about the country given the city where a person lives. It is similar to using a Wikidata endpoint or an explicit subquery except that the subquery is referenced by name. The same query with an explicit subquery would look like this:
SELECT ?person ?city ?country {
{
SELECT ?country ?city {
?city :locatedIn ?country
}
}
?person :from ?city
}
Invoking stored queries by name has the major benefit that it avoids duplication of their query strings. Stored queries become reusable query building blocks maintained in one place rather than copy-pasted over the many queries which use them.
The body pattern of SERVICE <query://name> { … }
specifies which variables of the stored query are used in the outer
scope of the calling query. The sqs:vars
is a shortcut which is useful when stored query variables retain their names.
However it’s possible to map stored query variable names to other identifiers to avoid naming conflicts:
prefix sqs: <tag:stardog:api:sqs:>
SELECT ?person ?city ?livesIn ?country {
SERVICE <query://countries> {
[] sqs:var:city ?livesIn ;
sqs:var:country ?country
}
?person :from ?livesIn ;
:born ?city
}
Furthermore, it’s possible to statically bind some stored query variables to constants so the query would behave like a parameterized view:
prefix sqs: <tag:stardog:api:sqs:>
SELECT ?city ?country {
SERVICE <query://countries> {
[] sqs:var:city ?city ;
sqs:var:country :The_United_States
}
}
Stored Query Service allows running stored queries against a different RDF dataset than the main query
(something that is not possible for standard SPARQL subqueries). If the dataset is specified in
the stored query itself using FROM
or FROM NAMED
keywords, it is used when the query is evaluated
through the service regardless of main query’s dataset. In addition the dataset can be specified inside
the service pattern using the sqs:default-graph
and sqs:named-graph
predicates.
In that case it overrides any FROM
or FROM NAMED
definitions in the stored query. This is similar
to how a SPARQL query’s dataset can be overridden using HTTP parameters defined in the SPARQL Protocol.
Example:
prefix sqs: <tag:stardog:api:sqs:>
SELECT ?city ?country {
SERVICE <query://countries> {
[] sqs:var:city ?city ;
sqs:var:country :The_United_States ;
sqs:default-graph <http://.../countries> .
}
}
Another interesting feature is the ability to call path queries from SELECT
/CONSTRUCT
/ASK
queries.
One cannot directly use a path query in a subquery because those do not return SPARQL binding sets, aka solutions
(we discussed that issue in an earlier blog post on Extended Solutions).
However, this service circumvents that restriction:
prefix sqs: <tag:stardog:api:sqs:>
SELECT ?start (count(*) as ?paths) {
SERVICE <query://paths> {
[] sqs:vars ?start
}
} GROUP BY ?start
The stored path query returns paths (according to some VIA
pattern) and uses ?start
as the start node variable.
The main query aggregates the returned paths by the start node and returns the number of paths for each. In contrast
to the earlier SELECT
example, this would not be possible directly because path queries cannot be used as subqueries.
Note
|
One should be aware of the potential explosive nature of path queries when using them through the stored query service. They can return a very high number of paths to be joined or aggregated and thus create substantial memory pressure on the server. |
Stardog 7.3.2+ supports two new SPARQL functions which take paths as the argument: stardog:length
and stardog:nodes
.
The former returns the length of the path and the latter generates a comma-separated string of all path nodes. Since
SELECT query results do not support paths as first-class citizens (that is, any value in a binding set is either an IRI or
a literal or a blank node), these provide means to return path information by generating literals. Paths returned by
the stored query service can be accessed via the reserved variable name ?path
:
prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>
SELECT ?start (avg(stardog:length(?path)) as ?avg_length) {
SERVICE <query://paths> {
[] sqs:vars ?start, ?path
}
} GROUP BY ?start
Stardog 7.4.4 supports additional stardog:all
and stardog:any
functions to check Boolean conditions over
edges in paths returned by a stored path query. These are useful for filtering path query results on the server side:
prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>
SELECT (str(stardog:nodes(?path))) {
SERVICE <query://paths> {
[] sqs:vars ?path
}
FILTER(stardog:all(?path, ?attribute = 10))
}
Here ?attribute
is a variable occurring in the VIA pattern of the stored path query. stardog:all
returns true
if the ?attribute = 10
condition is true
for all edges in the path. The second argument can be an arbitrary
SPARQL expression. stardog:any
is the complementary function returning true
if the condition is true
for at
least one edge. It is particularly useful for querying paths which must pass through a particular node(s) in the graph.
Managing Running Queries
Stardog includes the capability to manage running queries according to configurable policies set at run-time; this capability includes support for listing running queries; deleting running queries; reading the status of a running query; killing running queries that exceed a time threshold automatically; and logging slow queries for analysis.
Stardog is pre-configured with sensible server-wide defaults for query management parameters; these defaults may be overridden or disabled per database, or even per query.
Configuring Query Management
For many uses cases the default configuration will be sufficient. But
you may need to tweak the timeout parameter to be longer or shorter,
depending on the hardware, data load, queries, throughput, etc. The
default configuration has a server-wide query timeout value of
query.timeout
, which is inherited by all the databases in the server.
You can customize the server-wide timeout value and then set
per-database custom values, too. Any database without a custom value
inherits the server-wide value. To disable query timeout, set
query.timeout
to 0
. If individual queries need to set their own timeout,
this can be done (by passing a timeout
parameter over HTTP or using
the --timeout
flag on the CLI), but only if the query.timeout.override.enabled
property is set to true for the database (true is the default).
Listing Queries
To see all running queries, use the query list
subcommand:
$ stardog-admin query list
The results are formatted tabularly:
+----+----------+-------+--------------+
| ID | Database | User | Elapsed time |
+----+----------+-------+--------------+
| 2 | test | admin | 00:00:20.165 |
| 3 | test | admin | 00:00:16.223 |
| 4 | test | admin | 00:00:08.769 |
+----+----------+-------+--------------+
3 queries running
You can see which user owns the query (superuser’s can see all running queries), as well as the elapsed time and the database against which the query is running. The ID column is the key to deleting queries.
Terminating Queries
To terminate a running query, simply pass its ID to the query kill
command:
$ stardog-admin query kill 3
The output confirms the query kill completing successfully:
Query 3 killed successfully
Automatically Killing Queries
For production use, especially when a Stardog database is exposed to arbitrary query input, some of which may not execute in an acceptable time, the automatic query killing feature is useful. It will protect a Stardog Server from queries that consume too many resources.
Once the execution time of a query exceeds the value of query.timeout
,
the query will be killed automatically.[6] The client that submitted the
query will receive an error message. The value of query.timeout
may be
overriden by setting a different value (smaller or longer) in database
options. To disable, set to query.timeout
to 0
.
The value of query.timeout
is a positive integer concatenated with a
letter, interpreted as a time duration: 'h' (for hours),
'm' (for minutes), 's' (for seconds), or 'ms' (for milliseconds). For
example, '1h' for 1 hour, '5m' for 5 minutes, '90s' for 90 seconds, and
'500ms' for 500 milliseconds.
The default value of query.timeout
is five minutes.
Query Status
To see more detail about query in-flight, use the query status
subcommand:
$ stardog-admin query status 1
The resulting output includes query metadata, including the query itself:
Username: admin
Database: test
Started : 2013-02-06 09:10:45 AM
Elapsed : 00:01:19.187
Query :
select ?x ?p ?o1 ?y ?o2
where {
?x ?p ?o1.
?y ?p ?o2.
filter (?o1 > ?o2).
}
order by ?o1
limit 5
Slow Query Logging
Stardog does not log slow queries in the default configuration because there isn’t a single value for what counts as a "slow query", which is entirely relative to queries, access patterns, dataset sizes, etc. While slow query logging has minimal overhead, what counts as a slow query in some context may be acceptable in another. See Configuring Stardog Server above for the details.
Protocols and Java API
For HTTP protocol support, see Stardog’s Apiary docs.
For Java, see the Javadocs.
Security and Query Management
The security model for query management is simple: any user can kill any running query submitted by that user, and a superuser can kill any running query. The same general restriction is applied to query status; you cannot see status for a query that you do not own, and a superuser can see the status of every query.
Managing Query Performance
Stardog answers queries in two major phases: determining the query plan and executing that plan to obtain answers from the data. The former is called query planning (or query optimization) and includes all steps required to select the most efficient way to execute the query. How Stardog evaluates a query can only be understood by analyzing the query plan. Query plan analysis is also the main tool for investigating performance issues as well as addressing them, in particular, by re-formulating the query to make it more amenable to optimization.
Query Plan Syntax
We will use the following running example to explain query plans in Stardog.
SELECT DISTINCT ?person ?name
WHERE {
?article rdf:type bench:Article .
?article dc:creator ?person .
?inproc rdf:type bench:Inproceedings .
?inproc dc:creator ?person .
?person foaf:name ?name
}
This query returns the names of all people who have authored both a journal article and a paper in a conference proceedings. The query plan used by Stardog (in this example, 4.2.2) to evaluate this query is:
Distinct [#812K]
`─ Projection(?person, ?name) [#812K]
`─ MergeJoin(?person) [#812K]
+─ MergeJoin(?person) [#391K]
│ +─ Sort(?person) [#391K]
│ │ `─ MergeJoin(?article) [#391K]
│ │ +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
│ │ `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
│ `─ Scan[PSOC](?person, foaf:name, ?name) [#433K]
`─ Sort(?person) [#503K]
`─ MergeJoin(?inproc) [#503K]
+─ Scan[POSC](?inproc, rdf:type, bench:Inproceedings) [#255K]
`─ Scan[PSOC](?inproc, dc:creator, ?person) [#898K]
The plan is arranged in an hierarchical, tree-like structure. The nodes, called
operators, represent units of data processing during evaluation. They
correspond to evaluations of graphs patterns or solution modifiers as defined in
SPARQL 1.1
specification. All operators can be regarded as functions which may take some
data as input and produce some data as output. All input and output data is
represented as streams of
solutions, that is, sets
of bindings of the form x → value
where x
is a variable used in the query
and value
is some RDF term (IRI, blank node, or literal). Examples of
operators include scans, joins, filters, unions, etc.
Numbers in square brackets after each node refer to the estimated cardinality of the node, i.e. how many solutions Stardog expects this operator to produce when the query is evaluated. Statistics-based cardinality estimation in Stardog merits a separate blog post, but here are the key points for the purpose of reading query plans:
-
all estimations are approximate and their accuracy can vary greatly (generally: more precise for bottom nodes, less precise for upper nodes)
-
estimations are only used for selecting the best plan but have no bearing on the actual results of the query
-
in most cases a sub-optimal plan can be explained by inaccurate estimations
Stardog Evaluation Model
Stardog generally evaluates query plans according to the
bottom-up SPARQL
semantics. Leaf nodes are evaluated first and without input, and their results
are then sent to their parent nodes up the plan. Typical examples of leaf nodes
include scans, i.e. evaluations of triple patterns, evaluations of full-text
search predicates, and
VALUES
operators. They
contain all information required to produce output, for example, a triple
pattern can be directly evaluated against Stardog indexes. Parent nodes, such as
joins, unions, or filters, take solutions as inputs and send their results
further towards the root of the tree. The root node in the plan, which is
typically one of the
solution modifiers,
produces the final results of the query which are then encoded and sent to the
client.
Pipelining And Pipeline Breakers
Stardog implements the
Volcano model, in which
evaluation is as lazy as possible. Each operator does just enough work to
produce the next solution. This is important for performance, especially for
queries with a LIMIT
clause (of which ASK
queries are a special case) and
also enables Stardog’s query engine to send the first result(s) as soon as they
are available (as opposed to waiting till all results have been computed).
Not all operators can produce output solutions as soon as they get first input solutions from their children nodes. Some need to accumulate intermediate results before sending output. Such operators are called pipeline breakers, and they are often the culprits for performance problems, typically resulting from memory pressure. It is important to be able to spot them in the plan since they can suggest either a way to re-formulate the query to help the planner or a way to make the query more precise by specifying extra constants where they matter.
Here are some important pipeline breakers in the example plan:
-
HashJoin
algorithms build a hash table for solutions produced by the right operand. Typically all such solutions need to be hashed, either in memory or spilled to disk, before the first output solution is produced by theHashJoin
operator. -
Sort
: the sort operator builds an intermediate sorted collection of solutions produced by its child node. The main use case for sorting solutions is to prepare data for an operator which can benefit from sorted inputs, such asMergeJoin
,Distinct
, orGroupBy
. All solutions have to be fetched from the child node before the smallest (w.r.t. the sort key) solution can be emitted. -
GroupBy
: group-by operators are used for aggregation, e.g. counting or summing results. When evaluating a query likeselect ?x (count(?y) as ?count) where { … } group by ?x
Stardog has to scroll through all solutions to compute the count for every?x
key before returning the first result.
Other operators can produce output as soon as they get input:
-
MergeJoin
: merge join algorithms do a single zig-zag pass over sorted streams of solutions produced by children nodes and output a solution as soon as the join condition is satisfied. -
DirectHashJoin
: contrary to the classical hash join algorithm, this operator does not build a hash table. It utilizes Stardog indexes for look-ups which doesn’t require extra data structures. This is only possible when the right operand is sorted by the join key, but the left isn’t, otherwise Stardog would use a merge join. -
Filter
: a solution modifier which evaluates the filter condition on each input solution. -
Union
: combines streams of children solutions without any extra work, e.g. joining, so there’s no need for intermediate results.
Now, returning to the above query, one can see Sort
pipeline breakers in the
plan:
Sort(?person) [#391K]
`─ MergeJoin(?article) [#391K]
+─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
`─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
This means that all solutions representing the join of ?article rdf:type
bench:Article
and ?article dc:creator ?person
will be put in a sequence
ordered by the values of ?person
. Stardog expects to sort 391K
solutions
before they can be further merge-joined with the results of the ?person
foaf:name ?name
pattern. Alternately the engine may build a hash table
instead of sorting solutions; such decisions are made by the optimizer based on
a number of factors.
Skipping Intermediate Results
One tricky part of understanding Stardog query plans is that evaluation of each operator in the plan is context-sensitive, i.e. it depends on what other nodes are in the same plan, maybe in a different sub-tree. In particular, the cardinality estimations, even if assumed accurate, only specify how many solutions the operator is expected to produce when evaluated as the root node of a plan.
However, as it is joined with other parts of the plan, the results can be different. This is because Stardog employs optimizations to reduce the number of solutions produced by a node by pruning those which are incompatible with other solutions with which they will later be joined.
Consider the following basic graph pattern and the corresponding plan:
?erdoes rdf:type foaf:Person .
?erdoes foaf:name "Paul Erdoes"^^xsd:string .
?document dc:creator ?erdoes .
MergeJoin(?erdoes) [#10]
+─ MergeJoin(?erdoes) [#1]
│ +─ Scan[POSC](?erdoes, rdf:type, foaf:Person) [#433K]
│ `─ Scan[POSC](?erdoes, foaf:name, "Paul Erdoes") [#1]
`─ Scan[POSC](?document, dc:creator, ?erdoes) [#898K]
The pattern matches all documents created by a person named Paul Erdoes. Here the second pattern is selective (only one entity is expected to have the name "Paul Erdoes"). This information is propagated to the other two scans in the plan via merge joins, which allows them to skip scanning large parts of data indexes.
In other words, the node Scan[POSC](?erdoes, rdf:type, foaf:Person) [#433K]
will not produce all 433K
solutions corresponding to all people in the
database and, similarly, Scan[POSC](?document, dc:creator, ?erdoes) [#898K]
will not go through all 898K
document creators.
Diagnosing Performance Problems
Performance problems may arise because of two reasons:
-
complexity of the query itself, especially the amount of returned data
-
failure to select a good plan for the query.
It is important to distinguish the two. In the former case the best way forward
is to make the patterns in WHERE
more selective. In the latter case, i.e. when
the query returns some modest number of results but takes an unacceptably long
time to do so, one needs to look at the plan, identify the bottlenecks (most often, pipeline breakers),
and reformulate the query or report it to us for further analysis.
Here’s an example of a un-selective query:
SELECT DISTINCT ?name1 ?name2
WHERE {
?article1 rdf:type bench:Article .
?article2 rdf:type bench:Article .
?article1 dc:creator ?author1 .
?author1 foaf:name ?name1 .
?article2 dc:creator ?author2 .
?author2 foaf:name ?name2 .
?article1 swrc:journal ?journal .
?article2 swrc:journal ?journal
FILTER (?name1<?name2)
}
The query returns all distinct pairs of authors who published (possibly different) articles in the same journal. It returns more than 18M results from a database of 5M triples. Here’s the plan:
Distinct [#17.7M]
`─ Projection(?name1, ?name2) [#17.7M]
`─ Filter(?name1 < ?name2) [#17.7M]
`─ HashJoin(?journal) [#35.4M]
+─ MergeJoin(?author2) [#391K]
│ +─ Sort(?author2) [#391K]
│ │ `─ NaryJoin(?article2) [#391K]
│ │ +─ Scan[POSC](?article2, rdf:type, bench:Article) [#208K]
│ │ +─ Scan[PSOC](?article2, swrc:journal, ?journal) [#208K]
│ │ `─ Scan[PSOC](?article2, dc:creator, ?author2) [#898K]
│ `─ Scan[PSOC](?author2, foaf:name, ?name2) [#433K]
`─ MergeJoin(?author1) [#391K]
+─ Sort(?author1) [#391K]
│ `─ NaryJoin(?article1) [#391K]
│ +─ Scan[POSC](?article1, rdf:type, bench:Article) [#208K]
│ +─ Scan[PSOC](?article1, swrc:journal, ?journal) [#208K]
│ `─ Scan[PSOC](?article1, dc:creator, ?author1) [#898K]
`─ Scan[PSOC](?author1, foaf:name, ?name1) [#433K]
This query requires an expensive join on ?journal
which is evident from the
plan (it’s a hash join in this case). It produces more than 18M results (Stardog
expects 17.7M which is pretty accurate here) that need to be filtered
and examined for duplicates. Given all this information from the plan, the only
reasonable way to address the problem would be to restrict the criteria, e.g. to
particular journals, people, time periods, etc.
If a query is well-formulated and selective, but performance is unsatisfactory, one may look closer at the pipeline breakers, e.g. this part of the query plan:
MergeJoin(?person) [#391K]
+─ Sort(?person) [#391K]
| `─ MergeJoin(?article) [#391K]
| +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
| `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
`─ Scan[PSOC](?person, foaf:name, ?name) [#433K]
A reasonable thing to do would be to evaluate the join of ?article rdf:type
bench:Article
and ?article dc:creator ?person
separately, i.e. as a separate
queries, to see if the estimation of 391K
is reasonably accurate and to get an
idea about memory pressure. This is a valuable piece of information for a
performance problem report, especially when the data cannot be shared with us.
Similar analysis can be done for hash joins.
In addition to pipeline breakers, there could be other clear indicators of
performance problems. One of them is the presence of LoopJoin
nodes in the
plan. Stardog implements the
nested loop join algorithm which
evaluates the join by going through the Cartesian product of its inputs. This is
the slowest join algorithm and it is used only as a last resort. It sometimes,
but not always, indicates a problem with the query.
Here’s an example:
SELECT DISTINCT ?person ?name
WHERE {
?article rdf:type bench:Article .
?article dc:creator ?person .
?inproc rdf:type bench:Inproceedings .
?inproc dc:creator ?person2 .
?person foaf:name ?name .
?person2 foaf:name ?name2
FILTER (?name=?name2)
}
The query is similar to an earlier query plan we saw but runs much slower. The plan shows why:
Distinct [#98456.0M]
`─ Projection(?person, ?name) [#98456.0M]
`─ Filter(?name = ?name2) [#98456.0M]
`─ LoopJoin(_) [#196912.1M]
+─ MergeJoin(?person) [#391K]
│ +─ Sort(?person) [#391K]
│ │ `─ MergeJoin(?article) [#391K]
│ │ +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]
│ │ `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]
│ `─ Scan[PSOC](?person, foaf:name, ?name) [#433K]
`─ MergeJoin(?person2) [#503K]
+─ Sort(?person2) [#503K]
│ `─ MergeJoin(?inproc) [#503K]
│ +─ Scan[POSC](?inproc, rdf:type, bench:Inproceedings) [#255K]
│ `─ Scan[PSOC](?inproc, dc:creator, ?person2) [#898K]
`─ Scan[PSOC](?person2, foaf:name, ?name2) [#433K]
The loop join near the top of the plan computes the Cartesian product of the
arguments which produces almost 200B solutions. This is because there is no
shared variable between the parts of the query which correspond to authors of
articles and conference proceedings papers, respectively. The filter condition
?name = ?name2
cannot be transformed into an equi-join because the semantics
of term equality used in
filters is different from the
solution compatibility
semantics used for checking join conditions.
The difference manifests itself in the presence of numerical literals, e.g.
"1"^^xsd:integer
= "1.0"^^xsd:float
, where they are different RDF terms.
However, as long as all names in the data are strings, one can re-formulate this
query by renaming ?name2
to ?name
which would enable Stardog to use a more
efficient join algorithm.
Query Plan Operators
The following operators are used in Stardog query plans:
-
Scan[Index]
: evaluates a triple/quad pattern against Stardog indexes. Indicates the index used, e.g.CSPO
orPOSC
, whereS,P,O,C
stand for the kind of lexicographic ordering of quads that the index provides.SPOC
means that the index is sorted first by *S*ubject, then *P*redicate, *O*bject, and *C*ontext (named graph IRI). -
HashJoin(join key)
: hash join algorithm, hashes the right operand. Pipeline breaker. -
BindJoin(join key)
: a join algorithm binds the join variables on the right operator to the current values of the same variables in the current solution on the left. Can be seen as an optimization of the nested loop join for the case when the left operator produces far fewer results than the right. Not a pipeline breaker. -
DirectHashJoin(join key)
: a hash join algorithm which directly uses indexes for lookups instead of building a hash table. Not a pipeline breaker. -
MergeJoin(join key)
: merge join algorithm, the fastest option for joining two streams of solutions. Requires both operands be sorted on the join key. Not a pipeline breaker. -
BindJoin
: this join algorithm binds (thus the name) join key variables of the right operator to the current values of the same variables of the left operator and re-evaluates the right. Not a pipeline breaker. -
NaryJoin(join key)
: same asMergeJoin
but for N operators sorted on the same join key. -
NestedLoopJoin
: the nested loop join algorithm, the slowest join option. The only join option when there is no join key. Not a pipeline breaker. -
Shortest|All(Cyclic)Paths
: path query operators. -
Sort(sort key)
: sorts the argument solutions by the sort key, typically used as a part of a merge join. Pipeline breaker. -
Filter(condition)
: filters argument solutions according to the condition. Not a pipeline breaker. -
Union
: combines streams of argument solutions. If both streams are sorted by the same variable, the result is also sorted by that variable. Not a pipeline breaker. -
Minus
: Removes solutions from the left operand that are compatible with solutions from the right operand. Pipeline breaker. -
PropertyPath
: evaluates a property path pattern against Stardog indexes. Not a pipeline breaker. -
GroupBy
: groups results of the child operator by values of the group-by expressions (i.e. keys) and aggregates solutions for each key. Pipeline breaker (unless the input is sorted by first key). -
Distinct
: removes duplicate solutions from the input. Not a pipeline breaker but accumulates solutions as it runs so the memory pressure increases with the number of unique solutions. -
VALUES
: produces the inlined results specified in the query. Not a pipeline breaker. -
Search
: evaluates a full-text search predicates against the Lucene index within a Stardog database. -
Projection
: projects variables as results of a query or a sub-query. Not a pipeline breaker. -
Bind
: evaluates expressions on each argument solution and binds their values to (new) variables. Not a pipeline breaker. -
Unnest
: unnest array expressions. See the UNNEST operator. Not a pipeline breaker. -
Empty
andSingleton
: correspond to the empty solution set and a single empty solution, respectively. -
Type
: reasoning operator for evaluating patterns of the form?x rdf:type ?type
or:instance rdf:type ?type
. Not a pipeline breaker. -
Property
: operator for evaluating triple patterns with unbound predicate with reasoning. Not a pipeline breaker but could be very expensive especially for large schemas. Better be avoided by either using an IRI in the predicate position or turning off reasoning for such patterns using a hint. -
Service
: SPARQL federation operator which evaluate a pattern against a remote SPARQL endpoint or a virtual graph. -
ServiceJoin(join key)
: a join algorithm used when one of the operators is aService
(see above). Propagates bindings from the the operator to reduce the number of results coming over the network. -
Slice(offset=<>, limit=<>)
: combinesLIMIT
andOFFSET
solution modifiers in SPARQL. -
OrderBy
: an operator which implements theORDER BY
solution modifier in SPARQL. -
Describe
: a SPARQL Describe operator. -
ADD
,CLEAR
,COPY
,LOAD
,MOVE
,DELETE
,DELETE DATA
,INSERT
,INSERT DATA
: SPARQL Update operators.
Using Query Hints
Query hints help Stardog generate optimized query plans. They can be expressed in queries in two ways:
-
as SPARQL comments started with the
pragma
keyword, e.g.#pragma push.filters aggressive
, or -
as SPARQL triple patterns of the form
[] <tag:stardog:api:hint:{hint name}> {hint value}
, e.g.[] hint:push.filters "aggressive"
The first approach makes hints transparent to other query processing tools (other than Stardog). The second approach is preferred when using 3rd party tools which do not preserve SPARQL comments. In both cases a hint applies to the scope where it’s used and all nested scopes (unless overridden by the same hint with a different value).
The equality.identity
hint expects a comma-separated list of variables. It tells
Stardog that these variables will be bound to RDF terms (IRIs, bnodes, or literals) for
which equality coincides with identity (i.e. any term is equal only to itself). This is not true
for literals of certain numerical datatypes (cf. Operator Mapping).
However assuming that the listed variables do not take on values of such datatypes can sometimes lead to faster query plans,
for example, because of converting some filters to joins and through value inlining.
SELECT ?o ?o2 WHERE {
#pragma equality.identity ?o,?o2
?s :p ?o ;
:q ?o2
FILTER (?o = ?o2)
}
Sometimes our query planner can produce sub-optimal join orderings.
The group.joins
hint introduces an explicit scoping mechanism to help with join
order optimization. Patterns in the scope of the hint, given by the enclosing {}
,
will be joined together before being joined with anything else. This way, you can tell
the query planner what you think is the optimal way to join variables.
select ?s where {
?s :p ?o1 .
{
#pragma group.joins
#these patterns will be joined first, before being joined with the other pattern
?s :p ?o2 .
?o1 :p ?o3 .
}
}
The push.filters
hint controls how the query optimizer pushes filters down the query plan.
There are three possible values: default
, aggressive
, and off
. The aggressive
option
means that the optimizer will push every filter to the deepest operator in the plan which binds
variables used in the filter expression. The off
option turn the optimization off and each filter
will be applied to the top operator in the filter’s graph pattern (in case there’re multiple filters,
their order is not specified). Finally, the default
option (or absence of the hint) means that the
optimizer will decide whether to push each filter down the plan
based on various factors, e.g. the filter’s cost, selectivity of the graph pattern, etc.
select ?s where {
#pragma push.filters off
#the filter in the top scope will not be pushed into the union
?s :p ?o1 .
FILTER (?o2 > 10)
{
#pragma push.filters aggressive
#the optimizer will place this filter directly on top of ?s :r ?o3
#and it will be evaluated before the results are joined with ?s :p ?o2
?s :p ?o2 ;
:r ?o3 .
FILTER (?o3 > 1000)
}
UNION
{
#pragma push.filters default
#the optimizer will decide whether to place the filter directly
#on top of ?s :q ?o3 or leave it on top of the join
?s a :Type ;
:q ?o3 .
FILTER (?o3 < 50)
}
}
ACID Transactions
What follows is specific guidance about Stardog’s transactional semantics and guarantees.[7]
Atomicity
Databases may guarantee atomicity—groups of database actions (i.e., mutations) are irreducible and indivisible: either all the changes happen or none of them happens. Stardog’s transacted writes are atomic. Stardog does not support nested transactions.[8]
Consistency
Data stored should be valid according to the data model (in this case, RDF) and to the guarantees offered by the database, as well as to any application-specific integrity constraints that may exist. Stardog’s transactions are guaranteed not to violate integrity constraints during execution. A transaction that would leave a database in an inconsistent or invalid state is aborted.
See the Validating Constraints section for a more detailed consideration of Stardog’s integrity constraint mechanism.
Isolation
A Stardog connection will run in
SNAPSHOT
isolation level if it has not started an explicit transaction and
will run in SNAPSHOT
or SERIALIZABLE
isolation level
depending on the value of the transaction.isolation
. In any of these modes,
uncommitted changes will only be visible to the connection that made the changes:
no other connection can see those values before they are committed. Thus,
"dirty reads" can never occur. Additionally, a transaction will only see changes which were
committed before the transaction began, so there are no "non-repeatable reads".
SNAPSHOT
isolation does suffer from the write skew anomaly, which poses a problem when
operating under external logical constraints. We illustrate this with the following
example, where the database initially has two triples :a :val 100
and :b :val 50
,
and the application imposes the constraint that the total can never be less than 0.
Time |
Connection 1 |
Connection 2 |
Connection 3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
||
|
|
At the end of this scenario, Connection 1 believes the state of the database to be
:a :val 0
and :b :val 50
, so the constraint is not violated.
Similarly, Connection 2 believes the state of the database to be :a :val 100
and :b :val 0
, which also does not violate the constraint. However,
Connection 3 sees :a :val 0
and :b :val 0
which violates
the logical constraint.
No locks are taken, or any conflict resolution performed, for concurrent
transactions in SNAPSHOT
isolation level. If there are
conflicting changes, the transaction with the highest commit timestamp (functionally,
the transaction which committed "last") will be the result held in the database.
This may yield unexpected results since every transaction reads from a snapshot
that was created at the time its transaction started.
Consider the following query being executed by two concurrent threads in
READ COMMITTED SNAPSHOT
isolation level against a database having the triple
:counter :val 1
initially:
INSERT { :counter :val ?newValue }
DELETE { :counter :val ?oldValue }
WHERE { :counter :val ?oldValue
BIND (?oldValue+1 AS ?newValue) }
Since each transaction will read the current value from its snapshot, it is
possible that both transactions will read the value 1
and insert the value 2
even though we expect the final value to be 3
.
Isolation level SERIALIZABLE
can be used to avoid these situations. In
SERIALIZABLE
mode an exclusive lock needs to be acquired before a transaction
begins. This ensures concurrent updates cannot interfere with each other, but as a
result update throughput will decrease since only one transaction can run at a time.
Durability
By default Stardog’s transacted writes are durable and no other actions are required.
Commit Failure Autorecovery
Stardog’s transaction framework is low maintenance; but there are some rare conditions in which manual intervention may be needed.
Stardog’s strategy for recovering automatically from commit failure is as follows:
-
Stardog will roll back the transaction upon a commit failure;
-
Stardog takes the affected database offline for maintenance;[9] then
-
Stardog will begin recovery, bringing the recovered database back online once that task is successful so that operations may resume.
With an appropriate logging configuration for production usage (at least
error-level logging), log messages for the preceding recovery operations
will occur. If for whatever reason the database fails to be returned
automatically to online status, an administrator may use the CLI tools
(i.e., stardog-admin db online
) to try to online the database.
Optimizing Bulk Data Loading
Stardog tries hard to do bulk loading at database creation time in the most efficient and scalable way possible. But data loading time can vary widely, depending on factors in the data to be loaded, including the number of unique resources, etc. Here are some tuning tips that may work for you:
-
Use the
bulk_load
memory configuration for loading large databases (see Memory Configuration section). -
Load compressed data since compression minimizes disk access
-
Use a multicore machine since bulk loading is highly parallelized and indexes are built concurrently
-
Load many files together at creation time since different files will be parsed and processed concurrently improving the load speed
-
Turn off strict parsing (see Configuring a Database for the details).
Memory Management
As of version 5.0, Stardog by default uses a custom memory management approach to minimize GC activity during query evaluation. All intermediate query results are now managed in native (aka off-heap or direct) memory which is pre-allocated on server start-up and never returned to the OS until server shutdown. Every query, including SPARQL Update queries with the WHERE clause, gets a chunk of memory from that pre-allocated pool to handle intermediate results and will return it back to the pool when it finishes or gets cancelled. More technical details about this GC-less memory management scheme are available in a blog post.
The main goal of this memory management approach is to improve server’s resilience under heavy load. A common problem with JVM applications under load is the notorious Out-Of-Memory (OOM) exceptions which are hard to foresee and impossible to reliably recover from. Also, in the SPARQL world, it is generally difficult to estimate how many intermediate results any particular query will have to process before the query starts (although the selectivity statistics offers great help to this end). As such, the server has to deal with the situation when there is no memory available to continue with the current query. Stardog handles this by placing all intermediate results into custom collections which are tightly integrated with the memory manager. Every collection, e.g. for hashing, sorting, or aggregating binding sets, requests memory blocks from the manager and transparently spills data to disk when such requests are denied.
This helps avoid OOMs at any time during query evaluation since running out of memory only means triggering spilling and
the query will continue slower because of additional disk access. This also means Stardog 5.0+ can run harder, e.g. analytic, queries
which may exceed the memory capacity on your server. We have also seen performance improvements
in specific (but common) scenarios, such as with many concurrent queries, where the GC pressure would considerably
slow down the server running on heap. However, everything comes at a price and the custom collections can be
slightly slower than those based on JDK collections when the server is under light load, all queries are selective,
and there is no GC pressure. For that reason Stardog has a server option memory.management
which you can set to JVM
in stardog.properties
to disable custom memory management and have Stardog run all queries on heap.
The spilling.dir
server option specifies the directory which will be used for spilling data in case the server runs out of native memory.
It may make sense to set this to another disk to minimize disk contention.
Memory Configuration
Stardog provides a range of configuration options related to memory management.
Query engine by default uses the custom memory management approach described above but it is not the only
critical Stardog component which may require a large amount of memory. Memory is also
consumed aggressively during bulk loading and updates. Stardog defines three standard
memory consumption modes to allow users to configure how memory should be distributed based on the usage scenario.
The corresponding server property is memory.mode
which accepts the following values:
-
default
: This is the default option which provides roughly equal amount of memory for queries and updates (including bulk loading). This should be used either when the server is expected to run both read queries and updates in roughly equal proportion or when the expected load is unknown. -
read_optimized
: This option provides more memory to read queries and SPARQL Update queries with the WHERE clause. This minimizes the chance of having to spill data to disk during query execution at the expense of update and bulk loading operations. This option should be used when the transactions will be infrequent or small in size, e.g. up to a thousand triples since such transactions do not use significant amount of memory. -
write_optimized
: This option should be used for optimal loading and update performance. Queries may run slower if there is not enough memory for processing intermediate results. It may be also suitable when the server is doing a lot of updates and some read queries but the latter are selective and are not highly concurrent. -
bulk_load
: This option should be used for bulk loading very large databases (billions of triples) where there is no other workload on the server. When bulk loading is complete, the memory configuration should be changed and the server restarted.
As with any server option the server has to be restarted after the user changes the memory
mode. The stardog-admin server status
command displays
detailed information on memory usage and the current configuration.
Literal Index
Note
|
This feature is in beta in Stardog 7.3.2 |
Stardog 7.3.2 introduces a new kind of index: an in-memory sorted index for numerical values occurring in the graph. It’s aimed at improving performance of queries with range filters over values of numerical properties. For example, prior to 7.3.2 the filter in the following query:
SELECT * {
?product a :Product ;
:price ?price
FILTER(?price > 1000)
}
would be applied to ?price
values for all products which is inefficient if there are many products with a price
below 1000
. If the literal index is enabled, however, the engine would first scan it to obtain all price values
in the range and then scan only the corresponding products. This is visible in the query plan:
Projection(?product, ?price)
`─ MergeJoin(?product)
+─ Scan[POS](?product, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :Product)
`─ Sort(?product)
`─ DirectHashJoin(?price)
+─ LiteralRangeScan("1000"^^xsd:integer, inf) -> price
`─ Scan[POS](?product, :price, ?price)
The LiteralRangeScan
operator scans the literal index so the matching ?price
values can be efficiently joined
with the ?product :price ?price
pattern. The supposedly small result of that join is then further joined with
?product a :Product
pattern to obtain the final query result. See below for more information on query optimization
when the literal index is enabled.
Configuration and Logging
The literal index is disabled by default. To enable it, specify a comma-separated list of IRIs of properties whose
values should be indexed using the index.literals.properties
database option. The option accepts two meta IRIs:
tag:stardog:api:property:all
and tag:stardog:api:property:none
meaning that all properties should be indexed or
none, respectively (the latter is the default value).
In addition to that, there are two options to put a cap on the total size of the literal index (index.literals.limit
)
and the size of the differential literal index (index.literals.merge.limit
). The former is straightforward: it prevents
the index from growing beyond the limit and consuming too much of server’s memory. After reaching that limit, the index
is dropped and no longer used in queries. The latter controls the max size of the index that’s maintained separately
during transactions to avoid updates to the main literal index on every small change to the data. On reaching that limit
the differential index is merged into the main literal index.
Detailed information regarding building, maintaining the literal index, as well as using it in queries will be
printed in stardog.log
if DEBUG
logging is enabled for the com.complexible.stardog.db.index.literals
package.
Query Optimization
The literal index is fully integrated into the Stardog’s query engine. The optimizer will decide whether the index is applicable to a query (that is, if it’s up-to-date and the query uses filters over indexed properties) but also if its use would actually speed up the query. The latter is not necessarily true, consider the following example:
SELECT * {
?product a :Product ;
:from ?supplier ;
:price ?price .
?vendor :city :LittleRock
FILTER(?price > 10)
}
assuming vendors from Little Rock don’t produce many products and the ?price > 10
predicate is not selective
(many price points in that range in the database), the optimizer will probably decide to join vendors and products
first and then simply apply the filter to the (supposedly small) result set. On the other hand, when the price range is
tighter but the vendor predicate is not selective (let’s say it’s the state of California), the optimizer would first
scan the literal index to obtain the price points in the range, and then look up corresponding products and vendors.
In either case, the query plan with cardinality estimations will show how the query will be evaluated. As always,
there’s the possibility to force the optimizer to either use or not use the literal index by specifying the
#pragma literal.index
hint. It admits the following values: aggressive
(always use the index if available),
off
(never use it) and default
(the same as not having the hint at all).
Current limitations
The following needs to be kept in mind when using the literal index while it is in beta:
-
The index is in-memory only and is not persisted. As such, it is rebuilt on each server restart. The time it takes to rebuild the index is proportional to the number of triples with numerical predicates (configured via
index.literals.properties
). Each predicate is indexed concurrently with others. -
The index is currently stored on Java heap. Its memory footprint depends greatly on the data but when literal inlining is enabled (via
index.literals.canonical
which istrue
by default) most numerical values are encoded as 64-bit integers. Then each literal index entry takes around 100 bytes on heap. Still, it increases heap consumption so might require a higher value for-Xmx
(see Capacity Planning below). -
While the index supports transactional updates, we do not currently recommend to use it when values of numerical properties are frequently updated (i.e. those triples often added or deleted). It may have a negative impact on transactional throughput and increase GC pressure on the system. The index can instead be enabled after the data is relatively stable and read query performance is prioritized.
-
The index is managed as a single data structure for all indexed properties across all named graphs. If there’re many indexed properties with overlapping ranges, it can have a negative impact on cardinality estimations (i.e. the optimizer can believe that there’re many numerical values in the range while they could be for different properties than used in the query).
Capacity Planning
The primary system resources used by Stardog are CPU, memory, and disk.[10] Stardog will take advantage of more than one CPU, core, and core-based thread in data loading and in throughput-heavy or multi-user loads. Stardog performance is influenced by the speed of CPUs and cores. But some workloads are bound by main memory or by disk I/O (or both) more than by CPU. Use the fastest CPUs you can afford with the largest secondary caches and the most number of cores and core-based threads of execution, especially in multi-user workloads.
The following subsections provides more detailed guidance for the memory and disk resource requirements of Stardog.
Memory usage
Stardog uses system memory aggressively and the total system memory available to Stardog is often the most important factor in performance. Stardog uses both JVM memory (heap memory) and also the operating system memory outside the JVM (direct or native memory). Having more system memory available is always good; however, increasing the total memory limit too close to total system memory is not prudent as operating system will not have enough memory for its own operations (see guidelines below).
The following table shows recommended system memory for Stardog based on the graph locally stored in Stardog and how the system memory should be divided between the JVM heap memory and the direct memory. Note that, the exact amount of needed memory can vary a lot depending on many factors other than the graph size such as the characteristics of your graph, amount of data processed by queries, virtual graph access patterns, transactional load on the system, amount of data bulk loaded into new databases, number of concurrent users and so on. The values in this table can be used as a guideline but the only way to make sure you have optimal settings is trying your workload on your data and analyzing memory metrics provided by Stardog (see the memory metrics categories in the Metrics in Stardog 7).
# of Triples | JVM Heap Memory | Direct memory | Total System Memory |
---|---|---|---|
100 million |
3GB |
4GB |
8GB |
1 billion |
8GB |
20GB |
32GB |
10 billion |
30GB |
80GB |
128GB |
25 billion |
60GB |
160GB |
256GB |
50 billion |
80GB |
380GB |
512GB |
Out of the box, Stardog sets the maximum JVM memory to 2GB and
direct memory to 1GB which works fine for most small databases (less than 100
million triples) but should be increased as recommended above for larger datasets.
You can increase the memory for Stardog by setting the
system property STARDOG_SERVER_JAVA_ARGS
using the standard JVM options. For
example, you can set this property to "-Xms3g -Xmx3g -XX:MaxDirectMemorySize=4g"
to increase the JVM memory to 4GB and off-heap to 8GB. We recommend
setting the minimum heap size (-Xms
option) and
max heap size (-Xmx
option) to the same value.
Some general guidelines that can be used in addition to the above table:
-
Heap memory should not be less than 2GB and setting it higher than 100GB is typically not recommended due to increased GC pauses.
-
JVM uses Compressed OOPs optimization if heap limit is less than 32GB. If you want to set the heap limit to higher than 32GB then you will only see noticeable benefits if you go to 50-60GB or higher.
-
Direct memory should be set higher than heap memory except for very small scales to prevent the heap size going below the recommended 2GB limit.
-
The sum of heap and direct memory settings should be around 90% of the total system memory available so operating system has enough memory for its own operations.
-
It is not recommended to run any other memoery-intensive application on the same machine that a Stardog server is running as those applications would compete for the same resources and if the overall memory usage in the system increases to dangerously high levels the operating system or the container will kill the Stardog process.
Disk Usage
Stardog stores data on disk in a compressed format. The disk space needed for a database depends on many factors besides the number of triples, including the number of unique resources and literals in the data, average length of resource identifiers and literals, and how much the data is compressed. As a general rule of thumb, every million triples require 70 MB to 100 MB of disk space. The actual disk usage for a database may be different in practice. It is also important to realize the amount of disk space needed at creation time for bulk loading data is higher as temporary files will be created. The extra disk space needed at bulk loading time can be 40% to 70% of the final database size.
The disk space used by Stardog is additive for more than one database and there is little disk space used other than what is required for the databases. To calculate the total disk space needed for more than one database, one may sum the disk space needed by each database.
Using Stardog with Kerberos
Stardog can be configured to run in both MIT and Active Directory Kerberos
environments. In order to
do so a keytab
file must be properly created.
Once the keytab file is acquired the following options can be set in
stardog.properties
:[11]
-
krb5.keytab
: The path to the keytab file for the Stardog server. -
krb5.admin.principal
: The Kerberos principal that will be the default administrator of this service. -
krb5.debug
: A boolean value to enable debug logging in the Java Kerberos libraries. -
krb5.user.translation.regex
: A string value used to translate a krb5 principal name to a Stardog username. The string is an expression in two parts divided by a:
. On the left side is a matching regex of the krb5 principal name to replace and on the right side is the string to replace it with. By default this is/:-
. This means "replace any/
character in the krb5 principal with a-
character and use that as the Stardog username". Thus the krb5 principal namestardog/admin
will be translated tostardog-admin
. The details of the substitution rules are that of Java String.replaceAll(). -
pack.krb5.principal
: The Kerberos principal that is authorized to connect as a cluster peer. Stardog cluster nodes connected directly to each other. This directive tells Stardog to use Kerberos authentication for this communication and to only allow connections from entities with the given Kerberos principal. -
pack.krb5.keytab
: The path to the keytab file that Stardog cluster peers will use to prove to other nodes they are authorized peers. The principal in this keytab must match the value ofpack.krb5.principal
.
Once Stardog is properly configured for Kerberos Stardog user names should be created that match their associated Kerberos principal names. Authentication will be done based on the Kerberos environment and authorization is done based on the principal names matching Stardog users.
Enterprise Data Unification
State of the art IT management tells us to organize data, systems, assets, staffs, schedules, and budgets vertically to mirror lines of business. But increasingly all the internal and external demands on IT are horizontal in nature: the data is organized vertically, but the enterprise increasingly needs to access and understand it horizontally.
Structured Data (Virtual Graphs)
Stardog supports a set of techniques for unifying structured enterprise data,
chiefly, Virtual Graphs which let you declaratively map data into a Stardog
knowledge graph and query it via Stardog in situ.
Stardog intelligently rewrites (parts of) SPARQL queries against Stardog into
native query syntaxes like SQL, issues the native queries to remote datasources,
and then translates the native results into SPARQL results. Virtual Graphs can
be used to map both tabular (relational) data from RDBMSs and CSVs as well as
semi-structured hierarchical data from NoSQL sources such as MongoDB, Elasticsearch, Cassandra and JSON to RDF.
A Virtual Graph has three components:
-
a unique name
-
a properties file specifying configuration options
-
data source connection parameters
-
query and data parameters
-
-
a data mapping file (which can be omitted and automatically generated for most sources)
Supported Data Sources
Stardog currently supports all the data sources below. Please
inquire if you need support for another.
Please Note: Sources with a *
are not included in any Stardog trial offering and are only available upon request.
Relational Databases:
-
Apache Hive
-
Apache/Cloudera Impala
-
AWS Athena (additional info)
-
AWS Aurora
-
AWS Redshift
-
Derby
-
Exasol
-
Google BigQuery (additional info)
-
H2
-
IBM DB2
-
MariaDB
-
Microsoft SQL Server
-
MySQL
-
Odata *
-
Oracle
-
PostgreSQL
-
SAP Business One DI *
-
SAP HANA
-
Sybase ASE
-
Teradata
NoSQL Databases:
-
Apache Cassandra (additional info)
-
Cosmos DB
-
DataStax
-
Elasticsearch (additional info)
-
MongoDB (additional info)
SPARQL Engine/Service:
-
Stardog (additional info)
Cloud Service
-
Active Directory *
-
AWS Management *
-
Azure Management *
-
Facebook *
-
Hubspot *
-
Instagram *
-
Jira *
-
LDAP *
-
Linkedin *
-
Marketo *
-
Microsoft Teams *
-
Oracle Eloqua *
-
Oracle SalesCloud *
-
Salesforce Chatter *
-
Salesforce Marketing *
-
Salesforce Pardot *
-
SAP *
-
SAP SuccessFactors *
-
ServiceNow *
-
Slack *
-
Splunk *
-
Twilio *
-
Veeva *
-
Zendesk *
Files/Unstructured Data
-
Box *
-
CSV (additional info)
-
Dropbox *
-
Email *
-
Excel *
-
Excel Online *
-
Excel Services *
-
Gmail *
-
Google Calendar *
-
Google Contacts *
-
Google Drive *
-
Google Sheets *
-
JSON (additional info)
-
Microsoft CDS *
-
Microsoft Exchange *
-
Microsoft OneDrive *
-
Microsoft OneNote *
-
Microsoft Planner *
-
Microsoft Project *
-
Office365 *
-
Parquet *
-
REST *
-
Sharepoint *
CRM
-
Dynamics 365 Sales *
-
Dynamics CRM *
-
Netsuite *
-
Odoo *
-
Salesforce (via Simba Salesforce.com JDBC Connector)
-
Salesforce Einstein *
-
SAP ByDesign *
-
SAP Netweaver Gateway *
-
Sugar CRM *
-
Veeva CRM *
Data Analytics
-
Adobe Analytics *
Specific Data Source Considerations
AWS Athena Virtual Graph Considerations
To connect to an Athena database, first download the driver for JDBC version 4.2
here. Then follow the
instructions here.
The Athena JDBC driver does not expose the list of accessible databases. To save having
to qualify every table name with the database name in your mappings, provide the default
database in your connection URL using the Schema
parameter. For example:
jdbc.url=jdbc:awsathena://athena.us-west.amazonaws.com:443;S3OutputLocation=s3://mybucket/output;Schema=mydb
Athena does not support primary (or unique) keys. This can negatively impact Stardog’s
ability to create optimal queries. Use the unique.key.sets
virtual graph option to define unique columns manually.
Google BigQuery Virtual Graph Considerations
To connect to a BigQuery database, first download the driver for JDBC version 4.2
here.
Then follow the instructions here. You’ll need the GoogleBigQueryJDBC42.jar
,
google-api-client-1.28.0.jar
, google-auth-library-oauth2-http-0.13.0.jar
,
gax-1.42.0.jar
, google-api-services-bigquery-v2-rev426-1.25.0.jar
, and avro-1.9.0.jar
jar files.
The connection string requires a number of parameters. For example:
jdbc.url=jdbc:bigquery://https://bigquery.googleapis.com/bigquery/v2;ProjectId=myproject;DefaultDataset=mydataset;OAuthType=0;OAuthServiceAcctEmail=vgtestaccount@myacct.iam.gserviceaccount.com;OAuthPvtKeyPath=/path/creds/bq_pk.json;Timeout=60 #jdbc.username= #jdbc.password=
Note the jdbc.username
and jdbc.password
properties are not set. An example
/path/creds/bq_pk.json
file looks like this:
{
"type": "service_account",
"project_id": "myproject",
"private_key_id": "363287712aaedb3cd20ebdc11213d43a15760768",
"private_key": "-----BEGIN PRIVATE KEY-----\nBRIngMEMybROWnpANTsb4THenEXTbaTTLe/or\Ill8mYlUNC5+Din2tH...3er45/2ed\53fg\g+d==\n-----END PRIVATE KEY-----\n",
"client_email": "myaccount@myid.iam.gserviceaccount.com",
"client_id": "12345678987654321239",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/myaccount%40myid.iam.gserviceaccount.com"
}
See https://www.simba.com/products/BigQuery/doc/JDBC_InstallGuide/content/jdbc/bq/using/connectionurl.htm for details. Note that alternative means of providing credentials may require additional jar files.
BigQuery does not support primary (or unique) keys. This can negatively impact Stardog’s
ability to create optimal queries. Use the unique.key.sets
virtual graph option to define unique columns manually.
MongoDB Virtual Graph Considerations
To connect to a MongoDB database, first download the client jar. The client jar
for MongoDB version "x.y.z" can be obtained from
http://central.maven.org/maven2/org/mongodb/mongo-java-driver/x.y.z/mongo-java-driver-x.y.z.jar
Then follow the instructions here.
MongoDB has one Date type. It is stored as a 64-bit integer that represents the number
of milliseconds since Jan 1, 1970, Universal Time Coordinated (UTC). This Date type can
be mapped to xsd:date
, xsd:dateTime
or xsd:dateTimeStamp
data types. (The
xsd:dateTimeStamp
data type is the same as xsd:dateTime
except instead of having an
optional timezone the timezone is required.) When a Date is mapped to either
xsd:date
or xsd:dateTimeStamp
, it will be represented in the UTC timezone. When a
Date field is mapped to an xsd:dateTime
, the Date will be converted to the local
timezone of the Stardog server and the label will include the timezone.
Setting Unique Keys Manually
The unique.key.sets
option can be used with
MongoDB, but the format is MongoDB-specific. In place of schema and table names, the
unique keys must be specified in terms of the collection and a list of nested arrays.
For example, take an accounts
virtual graph with this SMS2 mapping:
prefix : <http://example.com/>
MAPPING <urn:accounts>
FROM JSON {
"accounts" : {
"_id" : "?id",
"acct" : "?acctNum",
"customerName" : [ "?name" ],
"card" : [ {
"number" : "?ccNumber",
"expiration" : "?ccExpr" }
]
}
}
TO {
?holder :hasAcct ?acct .
?holder :hasName ?name .
}
WHERE {
BIND (template("http://example.com/acct/{acctNum}") AS ?acct)
BIND (template("http://example.com/holder/{ccNumber}_{name}") AS ?holder)
}
And this query:
SELECT * {
graph <virtual://accounts> {
?holder :hasAcct ?acct .
?holder :hasName ?name .
}
}
When Stardog translates this query, it creates a flattened view of the collection
(using the $unwind stage)
giving a relational view. In this example both the customerName
and card
arrays will
be flattened because both are referenced in the template for the ?holder
variable.
The plan for the example query will include a join because Stardog has no way of knowing
that the card.number
/customerName
pair is unique. If we know that this pair of fields
is indeed unique in this collection, we can make the query more efficient by adding the
pair as a unique key to the unique.key.sets
property:
(accounts.[card;customerName].customerName,accounts.[card;customerName].card.number)
It is required that the flattened arrays are listed in alphabetical order, are separated by
semicolons, and are enclosed in square brackets. For nested arrays, use periods to delimit
the names (level1.level2
).
Multiple key sets can be separated with commas. For example, if we also know that the
acct
field is unique, the property value becomes:
(accounts.[].acct),(accounts.[card;customerName].customerName,accounts.[card;customerName].card.number)
Elasticsearch Virtual Graph Considerations
To create an Elasticsearch virtual graph, you need to download the Elasticsearch client jar
along with two supporting jars. The client jar for Elasticsearch version "x.y.z" can be obtained from
https://repo1.maven.org/maven2/org/elasticsearch/client/elasticsearch-rest-client/x.y.z/elasticsearch-rest-client-x.y.z.jar
Two supporting jars are also required:
Then follow the instructions here.
Supported Field Types
Stardog supports the following Elasticsearch field types:
-
keyword
-
long
-
integer
-
short
-
byte
-
double
-
float
-
half_float
-
scaled_float
-
date
Note that only the keyword
data type is supported for strings. Strings indexed as text
cannot be mapped for the purpose of SPARQL/GraphQL querying.
Virtual Graph Mappings for Elasticsearch
To create virtual graph mappings for Elasticsearch, use SMS2
mapping syntax with the FROM JSON
clause.
Note
|
There are two types of mappings being discussed here: Stardog Virtual Graph mappings that describe how Elasticsearch fields are mapped to RDF and Elasticsearch mappings that define a schema for an Elasticsearch index. |
For an index named simple
with a single Elasticsearch mapping type named _doc
and an
Elasticsearch mapping like:
{
"simple" : {
"mappings" : {
"_doc" : {
"properties" : {
"StudentId" : {
"type" : "integer"
},
"StudentName" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
an example Stardog mapping could look like this:
prefix ex: <http://example.com/>
MAPPING <urn:example>
FROM JSON {
"simple._doc":{
"_id": "?docId",
"StudentId": "?id",
"StudentName.keyword": "?name"
}
}
TO {
?studentIri a ex:Student ;
rdfs:label ?name ;
ex:esId ?docId .
}
WHERE {
bind(template("http://example.com/{id}") as ?studentIri)
}
The top-level key in the FROM JSON
(simple._doc
) is formed by joining the index name and
the Elasticsearch mapping type with a period. This is similar to the schemaName.tableName
convention that is used for SQL databases. As a shorthand, for indexes with
only one mapping type, the mapping type can be omitted. In this example, simple._doc
can
be replaced with simple
assuming _doc
is the only mapping type.
Note
|
For Elasticsearch versions 6 and later, indexes are allowed only one mapping type, where the name of the mapping type defaults to _doc . For version 5 it is possible for an index to have more than one mapping type. See Removal of mapping types in the Elasticsearch online documentation for details.
|
Notice in the above example that the built-in _id
field is mapped. Stardog knows that the
_id
field is unique across all documents in the index and it uses this information to
simplify the queries it generates. Stardog is not able to determine the uniqueness of any
other fields but if you know certain fields (or combinations of fields) are unique you can
indicate which field sets are unique in the configuration options.
For example, suppose we know that StudentId
is in fact a unique field. We can tell Stardog so
by setting the unique.key.sets
configuration option:
unique.key.sets=(simple._doc.StudentId)
or if the simple
index has only the one mapping type:
unique.key.sets=(simple.StudentId)
Automatically Generating Mappings
Elasticsearch indexes have well-defined schemas. Stardog can use that schema information to
automatically generate virtual graph mappings to RDF. By default, the generated templates
for the IRIs will be based on the Elasticsearch mapping type names, which are _doc
for
all indexes on recent versions of Elasticsearch. This makes the IRIs difficult to
distinguish. To address this, Stardog defaults the schema.in.generated.mappings
configuration option to true
when generating virtual graph
mappings for Elasticsearch.
Apache Cassandra Virtual Graph Considerations
To create a Cassandra virtual graph, you’ll need to first download the
shaded
Cassandra client jar. The client jar for Cassandra version "x.y.z" can be obtained from
http://central.maven.org/maven2/com/datastax/cassandra/cassandra-driver-core/x.y.z/cassandra-driver-core-x.y.z-shaded.jar
Then follow the instructions here.
Cassandra is special in the way it attempts to prevent users from distributing queries
over a large number of server nodes. If you have experience with CQL queries, you have no
doubt seen the ubiquitous error message, Cannot execute this query as it might involve
data filtering and thus may have unpredictable performance. If you want to execute this
query despite the performance unpredictability, use ALLOW FILTERING
.
This reflects the
Cassandra modeling principle
that favors writing the same data to multiple tables (perhaps through the use of
Materialized Views), where each table is optimized for answering different queries.
In order to support as many queries as possible, we recommend creating mappings to each of
these tables and letting Stardog choose which mappings apply for each query. It is
possible that no mappings can support a particular query. In such cases, Stardog will
write an entry to the log file and return no results.
This is the default behavior, which can be changed by setting the
cassandra.allow.filtering
virtual graph option
to true. When set, Stardog will include the ALLOW FILTERING
clause at the end of each
CQL query. Please note that the use of this option is highly discouraged in large-scale
production environments.
Cassandra is also special for how SQL-like its query language is (for a NoSQL database).
As this is the case, Stardog supports the use of SQL queries in the mappings files for
Cassandra virtual graphs. That is, you can use the
rr:sqlQuery
predicate for R2RML
mappings, the sm:query
predicate
for Stardog Mapping Syntax, or the FROM SQL
clause for Stardog
Mapping Syntax 2. In all cases, you can supply a SQL query to describe a view to use for
a virtual graph mapping, however, the SQL query can only contain operators that are
supported in CQL - no joins, subqueries, SQL functions, etc. are allowed.
Stardog/SPARQL Virtual Graph Considerations
To create Stardog Engine or SPARQL service virtual graph, the properties file should specify values for the following properties:
-
sparql.url
(SPARQL endpoint with database specified, for example,http://myhost:26023/testdb/query
) -
sparql.username
(username to access the SPARQL endpoint) -
sparql.password
(password to access the SPARQL endpoint) -
sparql.graphname
(boolean value for specifying whether or not the input file has a header line at the beginning) -
sparql.statsbasedoptimization
(boolean value to enable/disable statistics based optimization, only valid for Virtual Graphs to other Stardog servers)
Virtual graph mappings are not supported.
Importing Text Files
The same Virtual Graph commands and mappings that are used for creating virtual graphs
can be used to import data from delimited (CSV or TSV) and JSON files. It is not truly
virtual, but is part of our Virtual Graph APIs and docs because it shares the same
mappings syntax.
The mappings files for importing text files must be expressed in
SMS2 (Stardog Mapping Syntax 2).
NOTE: Unlike all other Virtual Graph data sources, the WHERE
clause in SMS2 mappings
for text files supports any SPARQL function when BIND
-ing
transformed values to new variables. This includes unnest
.
Importing CSV Files
To import a CSV file, provide the file as the last argument to the import command:
$ stardog-admin virtual import myDB cars.sms cars.csv
If the input file is using different kind of separators, e.g. tab character, a properties file can be provided:
$ stardog-admin virtual import myDB cars.properties cars.sms cars.tsv
The properties file for CSVs can specify values for the following properties:
-
csv.separator
(character for separating fields) -
csv.quote
(used for strings that contain field separators) -
csv.escape
(character for escaping special characters) -
csv.header
(boolean value for specifying whether or not the input file has a header line at the beginning) -
csv.hash.function
(the hash function to use for fields prefixed with a#
)
The csv.escape
character is used as an alternative to the csv.quote
character. To
escape a csv.quote
character within a string that is enclosed in csv.quote
characters, use two consecutive csv.quote
characters. Do not set csv.escape
to the
csv.quote
character.
Note that whitespace characters in the Java properties file need to be escaped
so if you want to import tab-separated value files set csv.separator=\t
in the
properties file.
In addition to directly referencing columns by name, CSV mappings may include a special
?ROW_NUMBER
variable used to obtain the current line number.
The mappings for delimited files can be automatically generated given a couple additional properties:
-
csv.class
(indicate the class, orrdf:type
, to use for the subjects of each row) -
unique.key.sets
(the set of columns that uniquely identify each row)
To import with automatically generated mappings, omit the command line argument for the mappings file:
$ stardog-admin virtual import myDB cars.properties cars.csv
There is a complete example available in our examples repo.
Importing JSON Files
To import a JSON file, provide the file name as the final argument to the import command:
$ stardog-admin virtual import myDB bitcoin.sms bitcoin.json
Here is an example JSON file:
{
"hash": "00000000000000000028484e3ba77273ebd245f944e574e1d4038d9247a7ff8e",
"time": 1569266867591,
"block_index": 1762564,
"height": 575144,
"txIndexes": [
445123952,
445058113,
445054577,
445061250
]
}
and a corresponding SMS2 mapping:
PREFIX : <http://example.com/>
mapping
from json {
{
"hash" : "?hash",
"time" : "?time",
"block_index" : "?block_index",
"height" : "?height",
"txIndexes" : [ "?txIndex" ]
}
}
to {
?block a :Block ;
:hash ?hash ;
:time ?dateTime ;
:height ?height ;
:includesTx ?tx .
?tx a :Tx ;
:index ?txIndex .
}
where {
bind(xsd:dateTime(?time) as ?dateTime)
bind(template("http://example.com/tx/{txIndex}") as ?tx)
bind(template("http://example.com/block/{hash}") as ?block)
}
Note how unlike from json
when used with MongoDB, with a JSON file SMS2 mapping there
is no MongoDB collection name serving as the key name for the top-level object.
Configuration
Supported Client Drivers
To connect to your data sources, Stardog requires you to supply Stardog with the
appropriate Client Driver. You need to manually copy the JAR file containing the driver to
the stardog_install_dir/server/dbms/
directory (or to the location pointed to by the STARDOG_EXT
environment variable if you have set one) so that it will be available to the Stardog server.
Stardog is tested against all supported databases using
the following drivers. Stardog requires a JDBC 4.2 compatible client
driver. While other drivers may work, your mileage may vary. For best results,
set the sql.dialect
option property when using unsupported drivers.
Database | Driver |
---|---|
Apache Hive |
|
Apache/Cloudera Impala |
https://www.cloudera.com/downloads/connectors/impala/jdbc/2-5-42.html |
AWS Athena |
|
AWS Aurora |
|
AWS Redshift |
|
Cassandra |
|
Cassandra |
|
Derby |
|
Elasticsearch |
Three jars are required: elasticsearch-rest-client-6.7.1.jar httpasyncclient-4.1.2.jar httpcore-nio-4.4.5.jar |
Exasol |
|
Google BigQuery |
These jars from https://storage.googleapis.com/simba-bq-release/jdbc/SimbaJDBCDriverforGoogleBigQuery42_1.2.2.1004.zip: GoogleBigQueryJDBC42.jar google-api-client-1.28.0.jar google-auth-library-oauth2-http-0.13.0.jar gax-1.42.0.jar google-api-services-bigquery-v2-rev426-1.25.0.jar avro-1.9.0.jar |
H2 |
|
IBM DB2 |
|
Microsoft SQL Server |
|
MongoDB |
|
MySQL & MariaDB |
|
Oracle |
https://www.oracle.com/technetwork/database/features/jdbc/default-2280470.html |
PostgreSQL |
|
SAP HANA |
|
Sybase ASE |
|
Teradata |
https://downloads.teradata.com/download/connectivity/jdbc-driver (version 16.20) |
Available Properties
The following table lists the available options for use in virtual graph
properties files. The first prefix indicates the type of datasource that the property
is used for. jdbc.
properties are used for all relational data sources.
Additionally, connection pool properties for the built-in Tomcat connection
pool are allowed. This set of additional allowed properties is listed in the
Tomcat
JDBC Connection Pool documentation. Stardog sets these connection pool defaults:
initialSize=3
, testWhileIdle=true
, timeBetweenEvictionRunsMillis=14400000
(4 hours), and validationQueryTimeout=10
.
Any options with the prefix ext.
will be passed directly to the JDBC Driver.
Any unknown options will be ignored.
Option | Default |
---|---|
|
|
Base IRI used to resolve relative IRIs from virtual graphs. |
|
|
|
The URL of the JDBC connection. |
|
|
|
The username used to make the JDBC connection. |
|
|
|
The password used to make the JDBC connection. |
|
|
|
The driver class name used to make the JDBC connection. |
|
|
|
A single-character separator used when importing tabular data files. |
|
|
|
A single character used to encapsulate values containing special characters. |
|
A single character used to escape values containing special characters. |
|
|
|
Should the import process read the header row? When headers are enabled the first row of the input file is used to retrieve the column names and mappings can refer to those column names. ( |
|
|
|
Should empty values be skipped in the CSV file? If |
|
Which class should be imported rows from CSV files be members of? |
|
|
|
Which hash function should be when the |
|
|
|
The URI for the MongoDB connection. Examples: |
|
|
|
Whitespace-delimited list of connection scheme://host:port values for Elasticsearch. Scheme defaults to http. Example: |
|
|
|
Username for Elasticsearch connections |
|
|
|
Password for Elasticsearch connections |
|
|
|
The address of the Cassandra node(s) that the driver uses to discover the cluster topology. Example: |
|
|
|
The Cassandra keyspace to use for this session |
|
|
|
The username for the Cassandra cluster |
|
|
|
The password for the Cassandra cluster |
|
|
|
Whether to include the |
|
|
|
Should IRI template strings be percent-encoded to be valid IRIs? ( |
|
|
|
Should |
|
|
|
If unspecified, R2RML views (using |
|
|
|
A comma-separated list of SQL function names to register with the parser. If an R2RML view (using |
|
For data sources that do not express unique constraints in their metadata, either because unique constraints are not supported or because the data source did not include some or all of the valid constraints for reasons such as performance concerns, this property is used to define additional constraints manually. The property value is a comma-separated list of keys that define unique rows in a table. Each key is itself a comma-separated list of schema-qualified columns, enclosed in parentheses. For example, if table |
|
|
Inferred from supported JDBC drivers. |
When using an unsupported JDBC driver, this option can be used to specify the format of the generated SQL. The options supported are |
|
|
|
A comma-separated list of schemas to append to the schema search path. This option allows R2RML tables and queries to reference tables that are outside of the default schema for the connected user. |
|
|
|
Override the default schema for the connected user. Tables in the default schema may be referenced without qualification ( |
|
|
|
A comma-separated list of tables to include when generating default mappings. If blank, mappings will be generated for all tables in the default schema for the connected user, plus any schemas listed in |
|
|
|
A comma-separated list of tables to exclude when generating default mappings. Mappings will be generated for all tables in the default schema for the connected user, plus any schemas listed in |
|
|
|
Whether to include the name of the schema (along with the table name) in the templates for IRIs when automatically generating mappings based on source database metadata. For Elasticsearch, setting this to true will cause the index name to be included in the template. |
|
SPARQL query endpoint/connection string with database specified, i.e. "http://myhost:26023/testdb/query" |
|
The username to access the SPARQL endpoint. |
|
The password to access the SPARQL endpoint. |
|
The graph name on SPARQL endpoint that needs to be mapped as virtual graph. |
|
|
|
The boolean value to enable/disable statistics based optimization while accessing SPARQL endpoint. By default its always enabled. |
Mapping
Stardog Virtual Graphs supports 3 mapping formats, but not all data sources
support all formats. Moreover, one mapping format, SMS2, supports multiple
source data models via the FROM clause. Please review the following table to
understand your options.
Data Source |
R2RML & SMS |
SMS2 |
Source Data Model |
Functions in BIND Expressions |
Relational |
||||
* |
Yes |
Yes |
SQL |
Type casts |
NoSQL |
||||
Apache Cassandra |
Yes |
Yes |
SQL |
Type casts |
Cosmos DB |
No |
Yes |
JSON/GraphQL |
Type casts |
MongoDB |
No |
Yes |
JSON/GraphQL |
Type casts |
Elasticsearch |
No |
Yes |
JSON/GraphQL |
Type casts |
Static Files |
||||
CSV |
Yes |
Yes |
CSV |
Full Support |
JSON |
No |
Yes |
JSON/GraphQL |
Full Support |
Support for type cast functions in BIND expressions means only the template
and datatype casting functions are available. Valid datatype casting functions include specific types, eg. xsd:integer(?field)
, and the generic strdt()
function which can also accept a user-defined type. Full support makes all SPARQL functions available.
We recommend SMS2 for supporting all features.
R2RML and SMS (Stardog Mapping Syntax)
R2RML is the W3C-recommended language for mapping
relational databases to RDF. For this reason all SMS mappings, and all SMS2 mappings
using FROM SQL
, can be converted to R2RML.
The Stardog Mapping Syntax (SMS) is an alternative way to write R2RML mappings
that is much simpler to read and write than R2RML.
We will use the example
database from the R2RML specification to explain SMS. The SQL schema that
corresponds to this example is:
CREATE TABLE "DEPT" (
"deptno" INTEGER UNIQUE,
"dname" VARCHAR(30),
"loc" VARCHAR(100));
INSERT INTO "DEPT" ("deptno", "dname", "loc")
VALUES (10, 'APPSERVER', 'NEW YORK');
CREATE TABLE "EMP" (
"empno" INTEGER PRIMARY KEY,
"ename" VARCHAR(100),
"job" VARCHAR(30),
"deptno" INTEGER REFERENCES "DEPT" ("deptno"),
"etype" VARCHAR(30));
INSERT INTO "EMP" ("empno", "ename", "job", "deptno", "etype" )
VALUES (7369, 'SMITH', 'CLERK', 10, 'PART_TIME');
Suppose we would like to represent this information in RDF using the same translation for job codes as in the original example:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix emp: <http://example.com/emp/> .
@prefix dept: <http://example.com/dept/> .
dept:10 a dept:Department ;
dept:location "NEW YORK" ;
dept:deptno "10"^^xsd:integer .
emp:7369 a emp:Employee ;
emp:name "SMITH" ;
emp:role emp:general-office ;
emp:department dept:10 .
SMS looks very similar to the output RDF representation:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix emp: <http://example.com/emp/> .
@prefix dept: <http://example.com/dept/> .
@prefix sm: <tag:stardog:api:mapping:> .
dept:{"deptno"} a dept:Department ;
dept:location "{\"loc\"}" ;
dept:deptno "{\"deptno\"}"^^xsd:integer ;
sm:map [
sm:table "DEPT" ;
] .
emp:{"empno"} a emp:Employee ;
emp:name "{\"ename\"}" ;
emp:role emp:{ROLE} ;
emp:department dept:{"deptno"} ;
sm:map [
sm:query """
SELECT \"empno\", \"ename\", \"deptno\", (CASE \"job\"
WHEN 'CLERK' THEN 'general-office'
WHEN 'NIGHTGUARD' THEN 'security'
WHEN 'ENGINEER' THEN 'engineering'
END) AS ROLE FROM \"EMP\"
""" ;
] .
SMS is based on Turtle, but it’s not valid Turtle since it uses the
URI templates of R2RML—curly braces
can appear in URIs. Other than this difference, we can treat an SMS document as
a set of RDF triples. SMS documents use the special namespace
tag:stardog:api:mapping:
that we will represent with the sm
prefix below.
Every subject in the SMS document that has a sm:map
property maps a
single row from the corresponding table/view to one or more triples. If an
existing table/view is being mapped, sm:table
is used to refer to the table.
Alternatively, a SQL query can be provided inline using the sm:query
property.
The values generated will be a URI, blank node, or a literal based on the type
of the value used in the mapping. The column names referenced between curly
braces will be replaced with the corresponding values from the matching row.
SMS can be translated to the standard R2RML syntax automatically by Stardog. For
completeness, we provide the R2RML mappings corresponding to the above example:
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix emp: <http://example.com/emp#> .
@prefix dept: <http://example.com/dept#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@base <http://example.com/base/> .
<DeptTriplesMap>
a rr:TriplesMap;
rr:logicalTable [ rr:tableName "DEPT" ];
rr:subjectMap [ rr:template "http://data.example.com/dept/{\"deptno\"}" ;
rr:class dept:Department ];
rr:predicateObjectMap [
rr:predicate dept:deptno ;
rr:objectMap [ rr:column "\"deptno\""; rr:datatype xsd:positiveInteger ]
];
rr:predicateObjectMap [
rr:predicate dept:location ;
rr:objectMap [ rr:column "\"loc\"" ]
].
<EmpTriplesMap>
a rr:TriplesMap;
rr:logicalTable [ rr:sqlQuery """
SELECT "EMP".*, (CASE "job"
WHEN 'CLERK' THEN 'general-office'
WHEN 'NIGHTGUARD' THEN 'security'
WHEN 'ENGINEER' THEN 'engineering'
END) AS ROLE FROM "EMP"
""" ];
rr:subjectMap [
rr:template "http://data.example.com/employee/{\"empno\"}";
rr:class emp:Employee
];
rr:predicateObjectMap [
rr:predicate emp:name ;
rr:objectMap [ rr:column "\"ename\"" ];
];
rr:predicateObjectMap [
rr:predicate emp:role;
rr:objectMap [ rr:template "http://data.example.com/roles/{ROLE}" ];
];
rr:predicateObjectMap [
rr:predicate emp:department;
rr:objectMap [ rr:template "http://example.com/dept/{\"deptno\"}"; ];
].
SMS2 (Stardog Mapping Syntax 2)
Stardog Mapping Syntax 2 (SMS2) is a way to represent virtual graph
mappings that is designed to support a broader range of source data formats than
R2RML and SMS, including semi-structured data sources such as JSON, MongoDB and
Elasticsearch, as well as structured formats like SQL RDBMS.
SMS2 is loosely based on the SPARQL CONSTRUCT
query. An abbreviated example looks
like this:
PREFIX : <http://stardog.com/movies/>
MAPPING <urn:movies>
FROM JSON {
"movie":{
"_id":"?movieId",
"name":"?name",
}
}
TO {
?movie a :Movie ;
:name ?name .
}
WHERE {
BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
}
SMS2 consists of five parts: PROLOGUE
, MAPPING
, FROM
, TO
, and WHERE
.
-
The
PROLOGUE
is a series of prefix declarations at the beginning of the file.-
The
MAPPING
throughWHERE
clauses define a mapping and the set of them can be repeated, separated by a semicolon.
-
-
The
MAPPING
clause consists of theMAPPING
keyword followed by an optional IRI for naming the mapping. -
The
FROM
clause describes the input. TheFROM
clause starts with theFROM
keyword and is followed by a data format keyword (JSON
in this case, but can beJSON
,CSV
,GraphQL
orSQL
) followed by a definition that describes the structure of the data and assigns fields to variable names. -
The
TO
clause defines how the output RDF should look. It is analogous to theCONSTRUCT
portion of the SPARQLCONSTRUCT
query. It consists of a set of triples where variables can be used in any position. -
The
WHERE
clause is where you can transform source data andBIND
the transformed values to new variables. The currently supported functions for use withinBIND
aretemplate
for IRI construction and the cast functions (xsd:string
,xsd:boolean
,xsd:integer
,xsd:float
,xsd:double
,xsd:decimal
,xsd:dateTime
,xsd:date
) for literal value type conversion.
Notice there are no platform-specific query elements (such as MongoDB query syntax)
present in the mapping, only descriptions of the source and target data schemas
and transformations for mapping the relationship between the source and target.
To help illustrate SMS2, we’ll use the following JSON for a movie
collection
from a MongoDB database:
{
"_id":"unforgiven",
"name":"Unforgiven",
"datePublished":new Date("1992-08-07T00:00:00.000Z"),
"genre":["Drama", "Western"],
"boxOffice":101157447,
"description":"Retired gunslinger reluctantly takes on one last job.",
"director":[
{"director":"clintEastwood", "name":"Clint Eastwood"}
],
"actor":[
{"actor":"morganFreeman", "name":"Morgan Freeman"},
{"actor":"clintEastwood", "name":"Clint Eastwood"},
{"actor":"geneHackman", "name":"Gene Hackman"}
]
}
{
"_id":"noWayOut",
"name":"No Way Out",
"datePublished":new Date("1987-08-14T00:00:00.000Z"),
"genre":["Action", "Mystery", "Drama", "Thriller"],
"boxOffice":35509515,
"description":"A coverup and witchhunt occur after a politician accidentally kills his mistress.",
"director":[
{"director":"rogerDonaldson", "name":"Roger Donaldson"}
],
"actor":[
{"actor":"geneHackman", "name":"Gene Hackman"},
{"actor":"kevinCostner", "name":"Kevin Costner"}
]
}
For this example we’ll create mappings that represent the data as this RDF:
@prefix : <http://stardog.com/movies/> .
:Title_noWayOut a :Movie ;
:name "No Way Out" ;
:datePublished "1987-08-14"^^xsd:date ;
:boxOffice 35509515 ;
:description "A coverup and witchhunt occur after a politician accidentally kills his mistress." ;
:genre "Action", "Mystery", "Drama", "Thriller" ;
:directed :Job_noWayOut_rogerDonaldson ;
:actedIn :Job_noWayOut_geneHackman, :Job_noWayOut_kevinCostner .
:Title_unforgiven a :Movie ;
:name "Unforgiven" ;
:datePublished "1992-08-07"^^xsd:date ;
:boxOffice 101157447 ;
:description "Retired gunslinger reluctantly takes on one last job." ;
:genre "Drama", "Western" ;
:directed :Job_unforgiven_clintEastwood ;
:actedIn :Job_unforgiven_morganFreeman, :Job_unforgiven_clintEastwood, :Job_unforgiven_geneHackman .
:Job_noWayOut_rogerDonaldson a :DirectedMovie ;
:name "Roger Donaldson" ;
:director :Name_rogerDonaldson .
:Name_rogerDonaldson a :Person .
:Job_unforgiven_clintEastwood a :DirectedMovie ;
:name "Clint Eastwood" ;
:director :Name_clintEastwood .
:Job_unforgiven_clintEastwood a :ActedInMovie ;
:name "Clint Eastwood" ;
:actor :Name_clintEastwood .
:Name_clintEastwood a :Person .
:Job_noWayOut_geneHackman a :ActedInMovie ;
:name "Gene Hackman" ;
:actor :Name_geneHackman .
:Job_unforgiven_geneHackman a :ActedInMovie ;
:name "Gene Hackman" ;
:actor :Name_geneHackman .
:Name_geneHackman a :Person .
:Job_noWayOut_kevinCostner a :ActedInMovie ;
:name "Kevin Costner" ;
:actor :Name_kevinCostner .
:Name_kevinCostner a :Person .
:Job_unforgiven_morganFreeman a :ActedInMovie ;
:name "Morgan Freeman" ;
:actor :Name_morganFreeman .
:Name_morganFreeman a :Person .
Notice there are many IRIs that contain both Movie and Person ids.
These scoped IRIs are redundant in this dataset but they serve a purpose
when working with denormalized datasources, which is common in NoSQL
databases like MongoDB. In this dataset, the name of a Person can appear
in both actor and director objects. The name is repeated for every directing or
acting job that Person has had. There is no guarantee that a Person’s name
is constant across all their jobs, either because the field reflects the name
the person had at the time of the job, or because of a problem during an update
that led to the inconsistency. Without IRIs that scope a Person to a specific
Movie, when you query for the Person’s name, the correct response is a record for
every Person/name pair, which can be an expensive query. See the blog post
Mapping Denormalized Data
for more details.
Here is the SMS2 mapping for this exercise:
PREFIX : <http://stardog.com/movies/>
MAPPING <urn:movies>
FROM JSON {
"movie":{
"_id":"?movieId",
"name":"?name",
"datePublished":"?datePublished",
"genre":["?genre"],
"boxOffice":"?boxOffice",
"description":"?description",
"director":[ {
"director":"?directorId",
"name":"?directorName"
}
],
"actor":[ {
"actor":"?actorId",
"name":"?actorName"
}
]
}
}
TO {
?movie a :Movie ;
:name ?name ;
:datePublished ?xsdDatePublished ;
:genre ?genre ;
:boxOffice "?boxOffice"^^xsd:integer ;
:description ?description ;
:directed ?directedMovie ;
:actedIn ?actedInMovie .
?directedMovie a :DirectedMovie ;
:director ?director ;
:name ?directorName .
?director a :Person .
?actedInMovie a :ActedInMovie ;
:actor ?actor ;
:name ?actorName .
?actor a :Person .
}
WHERE {
BIND (template("http://stardog.com/movies/Job_{movieId}_{directorId}") AS ?directedMovie)
BIND (template("http://stardog.com/movies/Job_{movieId}_{actorId}") AS ?actedInMovie)
BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
BIND (template("http://stardog.com/movies/Name_{directorId}") AS ?director)
BIND (template("http://stardog.com/movies/Name_{actorId}") AS ?actor)
BIND (xsd:date(?datePublished) AS ?xsdDatePublished)
}
Details of the various FROM
formats follows.
FROM JSON
The structure of the FROM JSON
clause resembles the source JSON structure with some changes:
-
Values are replaced by variable names.
-
Arrays contain a single element.
-
Only one JSON document is supplied.
-
(MongoDB and Cosmos only) There is an outermost key to indicate the name of the collection (movie).
Fields are interpreted as strings unless given a specific data type, by using a
cast function in either the WHERE
clause as illustrated in the example with the
datePublished
field, or directly in the TO
clause as illustrated by the
boxOffice
field.
See the example directly above which uses FROM JSON
in a mapping file for MongoDB.
FROM CSV
FROM CSV
is used when importing CSV or TSV delimited files.
There is no content in the FROM CSV
clause. Either a set of empty braces can follow
the FROM CSV
clause, or the braces can be ommitted.
FROM GraphQL
The FROM GraphQL
definition is an alternative format for hierarchical data. It
is a Selection Set
consisting of Fields,
which can be aliased and can contain nested selection sets. By default, each field
will be mapped to a variable with the same name as the field. If the field is aliased
the alias will serve as the variable name. To identify an array, use an @array
directive.
The following mapping uses a FROM GraphQL
clause to produce the same results as our prior
example that used a FROM JSON
clause.
A noteworthy difference between FROM GraphQL
and FROM JSON
is the order in
which source names are replaced with target names. In FROM JSON
you can
reference the value associated with each "_id" attribute by specifying
"_id":"?movieid"
, the variable is on the right. In FROM GraphQL
you do the
same by specifying movieId: _id
, the variable is on the left.
PREFIX : <http://stardog.com/movies/>
MAPPING <urn:movies>
FROM GraphQL {
movie {
movieId: _id
name
datePublished
genre @array
boxOffice
description
director @array {
directorId: director
directorName: name
}
actor @array {
actorId: actor
actorName: name
}
}
}
TO {
?movie a :Movie ;
:name ?name ;
:datePublished ?xsdDatePublished ;
:genre ?genre ;
:boxOffice "?boxOffice"^^xsd:integer ;
:description ?description ;
:directed ?directedMovie ;
:actedIn ?actedInMovie .
?directedMovie a :DirectedMovie ;
:director ?director ;
:name ?directorName .
?director a :Person .
?actedInMovie a :ActedInMovie ;
:actor ?actor ;
:name ?actorName .
?actor a :Person .
}
WHERE {
BIND (template("http://stardog.com/movies/Job_{movieId}_{directorId}") AS ?directedMovie)
BIND (template("http://stardog.com/movies/Job_{movieId}_{actorId}") AS ?actedInMovie)
BIND (template("http://stardog.com/movies/Title_{movieId}") AS ?movie)
BIND (template("http://stardog.com/movies/Name_{directorId}") AS ?director)
BIND (template("http://stardog.com/movies/Name_{actorId}") AS ?actor)
BIND (xsd:date(?datePublished) AS ?xsdDatePublished)
}
Note how an array of primitives like genre
has the @array
directive while an
array of objects has the @array
directive followed by a selection set. If we wished
to map, say, the genre
field to a genres
variable, we would use an alias, giving this
complete line for the genre field: genres: genre @array
.
FROM SQL
The third option for the FROM
clause is FROM SQL
, which is for RDBMS datasources and Cassandra. It
differs from the JSON
and GraphQL
source template formats in that for SQL
we
provide a query in place of a data description. Stardog will interrogate the database
schema to determine the field names (which will become variable names) to use for mapping.
To explain the FROM SQL
format, recall the SMS mapping from above:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix emp: <http://example.com/emp/> .
@prefix dept: <http://example.com/dept/> .
@prefix sm: <tag:stardog:api:mapping:> .
dept:{"deptno"} a dept:Department ;
dept:location "{\"loc\"}" ;
dept:deptno "{\"deptno\"}"^^xsd:integer ;
sm:map [
sm:table "DEPT" ;
] .
emp:{"empno"} a emp:Employee ;
emp:name "{\"ename\"}" ;
emp:role emp:{ROLE} ;
emp:department dept:{"deptno"} ;
sm:map [
sm:query """
SELECT \"empno\", \"ename\", \"deptno\", (CASE \"job\"
WHEN 'CLERK' THEN 'general-office'
WHEN 'NIGHTGUARD' THEN 'security'
WHEN 'ENGINEER' THEN 'engineering'
END) AS ROLE FROM \"EMP\"
""" ;
] .
The SMS2 equivalent of this mapping looks like this:
PREFIX emp: <http://example.com/emp/>
PREFIX dept: <http://example.com/dept/>
MAPPING <urn:departments>
FROM SQL {
SELECT * FROM "DEPT"
}
TO {
?deptIri a dept:Department ;
dept:location ?loc ;
dept:deptno "?deptno"^^xsd:integer .
}
WHERE {
BIND (template("http://example.com/dept/{deptno}") AS ?deptIri)
}
;
MAPPING <urn:employees>
FROM SQL {
SELECT \"empno\", \"ename\", \"deptno\", (CASE \"job\"
WHEN 'CLERK' THEN 'general-office'
WHEN 'NIGHTGUARD' THEN 'security'
WHEN 'ENGINEER' THEN 'engineering'
END) AS ROLE FROM \"EMP\"
}
TO {
?empIri a emp:Employee ;
emp:name ?ename ;
emp:role ?roleIri ;
emp:department ?deptIri .
}
WHERE {
BIND (template("http://example.com/emp/{empno}") AS ?empIri)
BIND (template("http://example.com/dept/{deptno}") AS ?deptIri)
BIND (template("http://example.com/emp/{ROLE}") AS ?roleIri)
}
Note the use of the semicolon to separate multiple mappings, which were necessary because we needed the two SQL statements.
How To Use Virtual Graphs
Connect
To query a non-materialized Virtual Graph it must first be registered with Stardog. Adding a new virtual graph is done via the following command:
$ stardog-admin virtual add dept.properties dept.ttl
When adding a Virtual Graph Stardog will establish a connection to the data
source to verify the provided configuration and mappings.
Properties file
The properties file (dept.properties
in this example) contains all of the configuration for the JDBC data source and
virtual graph configuration. It must be in the Java properties file format.
A minimal example (in this case, for MySQL) looks like this:
jdbc.url=jdbc:mysql://localhost/dept
jdbc.username=MySqlUserName
jdbc.password=MyPassword
jdbc.driver=com.mysql.jdbc.Driver
Important
|
Stardog does not ship with client drivers. You must add drivers for each data source you want to connect to. See Supported Client Drivers for more information. |
The credentials for the JDBC connection need to be provided in plain text. An
alternative way to provide credentials is to use the password file mechanism. The credentials should be stored in a password file
called services.sdpass
located in STARDOG_HOME
directory. The password file
entries are in the format hostname:port:database:username:password
so for the
above example there should be an entry localhost:*:dept:dept:MySqlUserName:MyPassword
in this file. Then the credentials in the properties file can be omitted.
The properties file can also contain a property called base
to specify a
base URI for resolving relative URIs
generated by the mappings (if any). If no value is provided, the base URI will
be virtual://myGraph
where myGraph
is the name of the virtual graph.
Mapping file
The mapping file (dept.ttl
in this example) contains the mapping from the virtual data source into RDF. The mapping can be in one of three formats:
-
SMS, which is the default
-
Standard R2RML, which is indicated using
--format r2rml
-
SMS2 (Stardog Mapping Syntax 2), a syntax that better supports hierarchical datasources like JSON and MongoDB. This is indicated using
--format sms2
A mapping file is required for data sources without a built-in schema, e.g. some NoSQL databases like MongoDB.
A mapping file is not required if your data has a built-in schema, e.g. MySQL or other relational databases. In this case you can omit a mapping file and the the virtual graph will be automatically mapped using R2RML direct mapping. Omitting a mapping file is most commonly used with one or both of the options default.mapping.include.tables
and sql.schemas
to indicate the specific tables to include.
List Registered VGs and Inspect Mappings / Properties
Registered virtual graphs can be listed:
$ stardog-admin virtual list +----------------|----------|--------+ | Virtual Graphs | Database | Online | +----------------|----------|--------+ | virtual://dept | * | true | +----------------|----------|--------+ 1 virtual graphs
Notice the *
in the Database
column of the output of the `
virtual list` command. This indicates that the dept
virtual graph can be used with any
database. To associate a virtual graph with a specific database, use the -d <db>
or
--database <db>
command-line option with the virtual add
command.
If a virtual graph fails to load during startup it will be listed as offline (Online
false
). Use the virtual online
command to retry loading
an offline virtual graph.
The commands virtual mappings
and
virtual options
can be used to retrieve the mappings and
configuration options associated with a virtual graph, respectively. Registered virtual
graphs can be removed using the virtual remove
command.
See the Man Pages for the details of these commands.
Query
Querying Virtual Graphs is done by using the GRAPH
clause, using a special
graph URI in the form virtual://myGraph
to query the Virtual Graph named
myGraph
.
The following example shows how to query dept
:
SELECT * {
GRAPH <virtual://dept> {
?person a emp:Employee ;
emp:name "SMITH"
}
}
Virtual graphs can be defined globally in Stardog Server, which is the default, or they
can be linked to a specific database when they are created (see
here). If a virtual
graph is linked to a specific database, it can only be accessed from that
database. Attempts to access a linked virtual graph from some other database
will result in no data being returned from that virtual graph.
Once a virtual graph is registered, it can be accessed as allowed by the
access rules.
We can query the local Stardog database and virtual graph’s remote data in a single
query. Suppose we have the dept
virtual graph, defined as above, that contains
employee and department information, and the Stardog database contains data
about the interests of people. We can use the following query to combine the
information from both sources:
SELECT * {
GRAPH <virtual://dept> {
?person a emp:Employee ;
emp:name "SMITH" .
}
?person foaf:interest ?interest
}
Or, with Virtual Transparency enabled, the following query will include remote data from the virtual graph as well as from the default graph.
SELECT * {
?person a emp:Employee ;
emp:name "SMITH" .
?person foaf:interest ?interest
}
Note
|
Query performance will be best if the GRAPH clause for Virtual Graphs is as selective as possible.
|
Virtual Graph queries are implemented by executing a query against the remote data source. This is a powerful feature and care must be taken to ensure peak performance. SPARQL and SQL don’t have feature parity, especially given the varying capabilities of SQL implementations. Stardog’s query translator supports most of the salient features of SPARQL including:
-
Arbitrarily nested subqueries (including solution modifiers)
-
Aggregation
-
FILTER
(including most SPARQL functions) -
OPTIONAL
,UNION
,BIND
That said, there are also limitations on translated queries. This includes:
-
SPARQL
MINUS
is not currently translated to SQL -
Comparisons between objects with different datatypes don’t always follow XML Schema semantics
-
Named graphs in R2RML are not supported
Import
In some cases you need to materialize the information stored in RDBMS directly
into RDF. For example, a combination of high network latency, slow-changing
data, and strict query performance requirements can make materialization a good
fit.
There is a command virtual import
that can be used
to import the contents of the RDBMS into Stardog. The command can be used as follows:
$ stardog-admin virtual import myDb dept.properties dept.ttl
This command adds all the mapped triples from the RDBMS into the default graph.
Similar to virtual add
, this command assumes
Stardog Mapping Syntax by default and can accept
R2RML mappings using the --format r2rml
option or
Stardog Mapping Syntax 2 mappings using the --format sms2
option.
It is also possible to specify a target named graph by using the
-g
/--namedGraph
option:
$ stardog-admin virtual import -g http://example.com/targetGraph myDb dept.properties dept.ttl
This virtual import
command is equivalent to the
following SPARQL update query:
ADD <virtual://dept> TO <http://example.com/targetGraph>
If the RDBMS contents change over time, and we need to update the materialization results
in the future, we can clear the named graph contents and rematerialize again. This
can be done by using the --remove-all
option in virtual
import
or with the following SPARQL query:
COPY <virtual://dept> TO <http://example.com/targetGraph>
Query performance over materialized graphs will be better as the data will be indexed locally by Stardog, but materialization may not be practical in cases where frequency of change is very high.
Secure
To manage virtual graphs, the user must be granted access to the virtual-graph
security resource type (see Security). create
permission is required to
add
a virtual graph, delete
permission is needed to either remove
or add
a virtual graph with the -o
or --overwrite
option, and read
permission is
required for all other management commands such as options
or mappings
.
Accessing virtual graphs is controlled the same way as regular named graphs as
explained in the Named Graph Security section. If named graph security is
not enabled for a database, all registered virtual graphs in the server will
be accessible through that database. If named graph security is enabled for a
database, then users will be able to query only the virtual graphs for which
they have been granted access.
If the virtual graphs contain any sensitive information, then it is recommended
to enable named graph security globally by setting security.named.graphs=true
in stardog.properties
. Otherwise creating a new database without proper
configuration would allow users to access those virtual graphs.
The Named Graph Security settings apply to virtual graphs regardless of the
manner in which they are accessed. The following three queries are identical with
the one exception that attempts to access a virtual graph using the SERVICE
keyword result in an error when there are insufficient permissions while queries
that use the GRAPH
or FROM
keywords will treat the virtual graphs as empty
and return no results but without error.
SELECT * {
GRAPH <virtual://dept> {
?person a emp:Employee ;
emp:name "SMITH"
}
}
SELECT * FROM <virtual://dept> {
?person a emp:Employee ;
emp:name "SMITH"
}
SELECT * {
SERVICE <virtual://dept> {
?person a emp:Employee ;
emp:name "SMITH"
}
}
Troubleshooting Virtual Graphs
The most common problem with virtual graphs is configuring the mappings correctly. Each database has different casing behavior, quoting characters, etc. The following features can be helpful when troubleshooting these issues.
Enable Debug Logging
Stardog uses log4j2 for logging. You can enable debug logging at the
virtual graph level by adding this line to the Loggers
section of your log4j2.xml
file:
<Logger name="com.complexible.stardog.virtual" level="DEBUG" additivity="false">
<AppenderRef ref="stardogAppender"/>
</Logger>
This will increase the logging to the stardog.log
file (but not to the console).
Metadata Inspection Tool
You can use the metadata inspection tool to retrive the schema and column names as they appear to Stardog when connecting using the the virtual graph connection properties (driver name, connection string, credentials, etc.). The inspection tool is accessed using the command-line interface (CLI). The documentation is accessed using the help command:
stardog-admin help source_metadata
Unstructured Data (BITES)
Unifying unstructured data is, by necessity, a different process from unifying structured or semistructured data. Stardog includes a document storage subsystem called BITES[12], which provides configurable storage and processing for unifying unstructured data with the Stardog graph. The following figure shows the main BITES components:

Storage
BITES allows storage and retrieval of documents in the form of files. Stardog treats documents as opaque blobs of data; it defers to the extraction process to make sense of individual documents. Document storage is independent of file and data formats.
Stardog internally stores documents as files. The location of these files
defaults to a subdirectory of STARDOG_HOME
but this can be overridden.
Documents can be stored on local filesystem, or an abstraction thereof,
accessible from the Stardog server or on Amazon S3 by setting the
docs.filesystem.uri
configuration option. The exact location is given by the
docs.path
configuration option.
Structured Data Extraction
BITES supports an optional processing stage in which a document is processed to extract an RDF graph to add to the database. BITES has the following built-in RDF extractors:
-
tika
: This extractor is based on Apache Tika, collects metadata about the document and asserts this set of RDF statements to a named graph specific to the document. -
text
: Adds a RDF statement with the full text extracted from the document. Side-effect of this extractor is that a document’s text will be indexed by the search index twice: one for the document itself, other for the value of this RDF statement. -
entities
: This extractor uses OpenNLP to extract all the mentions of named entities from the document and adds this information to the document named graph. -
linker
: This extractor works just likeentities
but after it finds a named entity mention in the document it also finds the entity in the database that best matched that mention. -
dictionary
: Similar tolinker
, but using a user-provided dictionary that maps named entity mentions to IRIs. -
CoreNLPEntityLinkerRDFExtractor
,CoreNLPMentionRDFExtractor
, andCoreNLPRelationRDFExtractor
available through the bites-corenlp repository.
See Entity Extraction and Linking section for more details about some of these extractors.
Text Extraction
The document store is fully integrated with Stardog’s Search. As
with RDF extraction, text extraction supports arbitrary file formats and
pluggable extractors are able to retrieve the textual contents of a document for
indexing. Once a document is added to BITES, its contents can be searched in the
same way as other literals using the standard textMatch
predicate in SPARQL
queries.
Managing Documents
CRUD operations on documents can be performed from the command line, Java API or HTTP API. Please refer to the StardocsConnection API for details of using the document store from Java.
The following is an example session showing how to manage documents from the command line:
# We have a document stored in the file `whyfp90.pdf' which we will add to the document store
$ ls -al whyfp90.pdf
-rw-r--r-- 1 user user 200007 Aug 30 09:46 whyfp90.pdf
# We add it to the document store and receive the document's IRI as a return value
$ bin/stardog doc put myDB whyfp90.pdf
Successfully put document in the document store: tag:stardog:api:docs:myDB:whyfp90.pdf
# Adding the same document again will delete all previous extraction results and insert new ones.
# By setting the correct argument, previous assertions will be kept, and new ones appended.
$ bin/stardog doc put myDB —keep-assertions -r text whyfp90.pdf
Successfully put document in the document store: tag:stardog:api:docs:myDB:whyfp90.pdf
# Alternatively, we can add it with a different name. Repeated calls
# will update the document and refresh extraction results
$ bin/stardog doc put myDB --name why-functional-programming-matters.pdf whyfp90.pdf
Successfully put document in the document store: tag:stardog:api:docs:myDB:why-functional-programming-matters.pdf
# We can subsequently retrieve documents and store them locally
$ bin/stardog doc get myDB whyfp90.pdf
Wrote document 'whyfp90.pdf' to file 'whyfp90.pdf'
# Local files will not be overwritten
$ bin/stardog doc get myDB whyfp90.pdf
File 'whyfp90.pdf' already exists. You must remove it or specify a different filename.
# How many documents are in the document store?
$ bin/stardog doc count myDB
Count: 2 documents
# Removing a document will also clear it's named graph and full-text search index entries
$ bin/stardog doc delete myDB whyfp90.pdf
Successfully executed deletion.
# Re-indexing the docstore allows to apply a different rdf or text extractor
# to all the documents, refreshing extraction results
$ bin/stardog doc reindex myDB -r entities
"Re-indexed 1 documents"
See the Man Pages for more details about the CLI commands.
Named Graphs and Document Queries
Documents in BITES are identified by IRI. As shown in the command line examples
above, the IRI is returned from a document put
call. The IRI is a combination
of a prefix, the database name, and the document name. The CLI uses the document
name to refer to the documents. The RDF index, and therefore SPARQL queries, use
the IRIs to refer to the documents. RDF assertions extracted from a document are
placed into a named graph identified by the document’s IRI.
Here we can see the results of querying a document’s named graph when using the default metadata extractor:
$ stardog query execute myDB "select ?p ?o { graph <tag:stardog:api:docs:myDB:whyfp90.pdf> { ?s ?p ?o } }"
+--------------------------------------------+--------------------------------------+
| p | o |
+--------------------------------------------+--------------------------------------+
| rdf:type | http://xmlns.com/foaf/0.1/Document |
| rdf:type | tag:stardog:api:docs:Document |
| tag:stardog:api:docs:fileSize | 200007 |
| http://purl.org/dc/elements/1.1/identifier | "whyfp90.pdf" |
| rdfs:label | "whyfp90.pdf" |
| http://ns.adobe.com/pdf/1.3/PDFVersion | "1.3" |
| http://ns.adobe.com/xap/1.0/CreatorTool | "TeX" |
| http://ns.adobe.com/xap/1.0/t/pg/NPages | 23 |
| http://purl.org/dc/terms/created | "2006-05-19T13:42:00Z"^^xsd:dateTime |
| http://purl.org/dc/elements/1.1/format | "application/pdf; version=1.3" |
| http://ns.adobe.com/pdf/1.3/encrypted | "false" |
+--------------------------------------------+--------------------------------------+
Query returned 11 results in 00:00:00.045
Entity Extraction and Linking
BITES, by default, uses the tika
RDF extractor that only extracts metadata from documents. Stardog
can be configured to use the OpenNLP library to detect named entities
mentioned in documents and optionally link those mentions to existing resources in the database.
Stardog can also be configured to use Stanford’s CoreNLP library for entity extraction, linking, and relationship extraction. More information in the bites-corenlp repository.
The first step to use entity extractors is to identify the set of OpenNLP models that will be used. The following models are always required:
-
A tokenizer and sentence detector. OpenNLP provides models for several languages (e.g.,
en-token.bin
anden-sent.bin
) -
At least one name finder model. Stardog supports both dictionary-based and custom trained models. OpenNLP provides models for several types of entities and languages (e.g.,
en-ner-person.bin
). We provide our own name finder models created from Wikipedia and DBPedia, which provide high recall / low precision in identifying Person, Organization, and Location types from English language documents.
All these files should be put in the same directory and, after or during
database creation time, the configuration option docs.opennlp.models.path
should be set to its location.
For example, suppose you have a folder /data/stardog/opennlp
with files en-token.bin
, en-sent.bin
, and en-ner-person.bin
.
The database creation command would be as follows:
$ stardog-admin db create -o docs.opennlp.models.path=/data/stardog/opennlp -n movies
For consistency, model filenames should follow specific patterns:
-
*-token.bin
for tokenizers (e.g.,en-token.bin
) -
*-sent.bin
for sentence detectors (e.g.,en-sent.bin
) -
*-ner-*.dict
for dictionary-based name finders (e.g.,dbpedia-en-ner-person.dict
) -
*-ner-*.bin
for custom trained name finders (e.g.,wikipedia-en-ner-organization.bin
)
Entities
The entities
extractor detects the mentions of named entities based on the configured models and creates
RDF statements for those entities. When we are putting a document we need to specify that we want to use a
non-default extractor. We can use both the tika
metadata extractor and the entities
extractor at the
same time:
$ stardog doc put --rdf-extractors tika,entities movies CastAwayReview.pdf
The result of entity extraction will be in a named graph where an auto-generated IRI is used for the entity:
<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
<tag:stardog:api:docs:entity:9ad311b4-ddf8-4da2-a49f-3fa8f79813c2> rdfs:label "Wilson" .
<tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" .
<tag:stardog:api:docs:entity:e559b828-714f-407d-aa73-7bdc39ee8014> rdfs:label "Robert Zemeckis" .
}
Linker
The linker
extractor performs the same task as entities
but after the entities are extracted
it links those entities to the existing resources in the database. Linking is done by matching the
mention text with the identifier and labels of existing resources in the database. This extractor requires the
search feature to be enabled to find the matching candidates and uses string similarity metrics to choose
the best match. The commonly used properties for labels are supported: rdfs:label
, foaf:name
, dc:title
,
skos:prefLabel
and skos:altLabel
.
$ stardog doc put --rdf-extractors linker movies CastAwayReview.pdf
The extraction results of linker
will be similar to entities
, but only contain existing resources
for which a link was found. The link is available through the dc:references
property.
<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
<tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" ;
<http://purl.org/dc/terms/references> <http://www.imdb.com/name/nm0000158> .
<tag:stardog:api:docs:entity:e559b828-714f-407d-aa73-7bdc39ee8014> rdfs:label "Robert Zemeckis" ;
<http://purl.org/dc/terms/references> <http://www.imdb.com/name/nm0000709> .
}
Dictionary
The dictionary
extractor fullfills the same purpose as the linker
, but instead of heuristically trying
to match a mention’s text with existent resources, it uses a user-defined dictionary to perform that task.
The dictionary provides a set of mappings between text and IRIs. Each mention found in the document will be
searched in the dictionary and, if found, the IRIs will be added as dc:references
links.
Dictionaries are .linker
files, which need to be available in the docs.opennlp.models.path
folder.
Stardog provides several dictionaries
created from Wikipedia and DBPedia, which allow users to automatically link entity mentions to IRIs in
those knowledge bases.
$ stardog doc put --rdf-extractors dictionary movies CastAwayReview.pdf
When using the dictionary
option, all .linker
files in the docs.opennlp.models.path
folder will be used.
The output follows the same syntax as the linker
.
<tag:stardog:api:docs:movies:CastAwayReview.pdf> {
<tag:stardog:api:docs:entity:0d25b4ed-9cd4-4e00-ac3d-f984012b67f5> rdfs:label "Tom Hanks" ;
<http://purl.org/dc/terms/references> <http://en.wikipedia.org/wiki/Tom_Hanks> ;
<http://purl.org/dc/terms/references> <http://dbpedia.org/resource/Tom_Hanks> .
}
User-defined dictionaries can be created programmatically. For example, the Java class below will create a
dictionary that links every mention of Tom Hanks
to two IRIs.
import java.io.File;
import java.io.IOException;
import com.complexible.stardog.docs.nlp.impl.DictionaryLinker;
import com.google.common.collect.ImmutableMultimap;
import com.stardog.stark.model.IRI;
import static ccom.stardog.stark.Values.iri;
public class CreateLinker {
public static void main(String[] args) throws IOException {
ImmutableMultimap<String, IRI> aDictionary = ImmutableMultimap.<String, IRI>builder()
.putAll("Tom Hanks", iri("https://en.wikipedia.org/wiki/Tom_Hanks"), iri("http://www.imdb.com/name/nm0000158"))
.build();
DictionaryLinker.Linker aLinker = new DictionaryLinker.Linker(aDictionary);
aLinker.to(new File("/data/stardog/opennlp/TomHanks.linker"));
}
}
SPARQL
Both entities
, linker
, and dictionary
extractors are also available as a SPARQL service, which makes them applicable
to any data in the graph, whether stored directly in Stardog or accessed remotely on SPARQL endpoints or virtual graphs.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention
}
}
The entities
extractor is accessed by using the docs:entityExtractor
service, which receives one input argument, docs:text
, with the text to be analyzed.
The output will be the extracted named entity mentions, bound to the variable given in the docs:mention
property.
+-----------------------------------------------------------------------------------+------------------+---------------+
| text | mention | review |
+-----------------------------------------------------------------------------------+------------------+---------------+
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Robert Zemeckis"| :MovieReview |
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Tom Hanks" | :MovieReview |
| "Directed by Robert Zemeckis, featuring Tom Hanks and a volleyball called Wilson" | "Wilson" | :MovieReview |
+-----------------------------------------------------------------------------------+------------------+---------------+
By adding an extra output variable, docs:entity
, the linker
extractor will be used instead.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention ;
docs:entity ?entity
}
}
+-------------------------+------------------+----------------+---------------+
| text | mention | entity | review |
+-------------------------+------------------+----------------+---------------+
| "Directed by Robert..." | "Tom Hanks" | imdb:nm0000158 | :MovieReview |
| "Directed by Robert..." | "Robert Zemeckis"| imdb:nm0000709 | :MovieReview |
+-------------------------+------------------+----------------+---------------+
The dictionary
extractor is called in a similar way to linker
, with an extra argument docs:mode
set to docs:Dictionary
.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention ;
docs:entity ?entity ;
docs:mode docs:Dictionary
}
}
+-------------------------+------------------+---------------------+---------------+
| text | mention | entity | review |
+-------------------------+------------------+---------------------+---------------+
| "Directed by Robert..." | "Tom Hanks" | imdb:nm0000158 | :MovieReview |
| "Directed by Robert..." | "Tom Hanks" | wikipedia:Tom_Hanks | :MovieReview |
+-------------------------+------------------+---------------------+---------------+
All extractors accept one more output variable, docs:type
, which will output the type of entity (e.g., Person, Organization),
when available.
prefix docs: <tag:stardog:api:docs:>
select * {
?review :content ?text
service docs:entityExtractor {
[] docs:text ?text ;
docs:mention ?mention ;
docs:entity ?entity ;
docs:type ?type
}
}
+-------------------------+------------------+----------------+-----------+---------------+
| text | mention | entity | type | review |
+-------------------------+------------------+----------------+-----------+---------------+
| "Directed by Robert..." | "Tom Hanks" | imdb:nm0000158 | :Person | :MovieReview |
| "Directed by Robert..." | "Robert Zemeckis"| imdb:nm0000709 | :Person | :MovieReview |
+-------------------------+------------------+----------------+-----------+---------------+
Custom Extractors
The included extractors are intentionally basic, especially when compared to machine learning or text mining algorithms. A custom extractor connects the document store to algorithms tailored specifically to your data. The extractor SPI allows integration of any arbitrary workflow or algorithm from NLP methods like part-of-speech tagging, entity recognition, relationship learning, or sentiment analysis to machine learning models such as document ranking and clustering.
Extracted RDF assertions are stored in a named graph specific to the document, allowing provenance tracking and versatile querying. The extractor must implement the RDFExtractor interface. A convenience class, TextProvidingRDFExtractor, extracts the text from the document before calling the extractor. In addition, AbstractEntityRDFExtractor - or one of its existing subclasses - extends TextProvidingRDFExtractor so you can customize entity linking extraction to your specific needs.
The text extractor SPI gives you the opportunity to support arbitrary document formats. Implementations will be given a raw document and be expected to extract a string of text which will be added to the full-text search index. Text extractors should implement the TextExtractor interface.
Custom extractors are registered with the Java ServiceLoader under the RDFExtractor or TextExtractor class names. Custom extractors can be referred to from the command line or APIs by their fully qualified or "simple" class names.
For an example of a custom extractor, see our github repository.
Virtual Transparency
Virtual graphs provide a facility for accessing external data sources by mapping them to individual named graphs. The example queries shown previously all specify the source of the data using the virtual graph name. This fine-grained declaration can be useful in some circumstances but it’s also desirable to query over the set of all graphs without enumerating them individually. Virtual transparency is a feature that, when enabled, will include results from virtual graphs in queries over the default or set of named graphs.
How does it work? First you need to enable the VIRTUAL_TRANSPARENCY database option. When this is enabled, queries are evaluated not only over local graphs, but also over accessible virtual graphs. The set of accessible virtual graphs is determined by the virtual graph access rules. It may differ by database and user.
Virtual Transparency Query Semantics
The example queries shown previously use explicit graph blocks to name the source graph of the data, e.g.:
SELECT * {
GRAPH <virtual://dept> {
?person a emp:Employee ;
emp:name "SMITH"
}
}
In contrast, a query without a graph block would only return data from the local default graph:
SELECT * {
?person a emp:Employee ;
emp:name "SMITH"
}
However, if virtual transparency is enabled, this query will return
data from both the local default graph and any accessible virtual
graphs. Note that this requires the query.all.graphs
database property be set to true
.
Additionally, a graph block with a variable will bind to both local named graphs and any accessible virtual graphs. For instance, the original query can be restated to use a graph variable:
SELECT * {
GRAPH ?g {
?person a emp:Employee ;
emp:name "SMITH"
}
}
The result would include results from the virtual graph, with ?g
bound to <virtual://dept>
, along with any results from local named
graphs.
With virtual transparency, the key difference between including or omitting a graph block comes from how triple patterns are joined together. Just as a graph block over a set of local named graphs limits BGP matches to a single named graph, a graph block with virtual transparency limits BGP matches to a single local or virtual graph. To illustrate, consider the query with a graph block:
SELECT * {
GRAPH ?g {
?person a emp:Employee ;
emp:name "SMITH"
}
}
If the set of employees is stored in a different virtual graph than the employee names, this query will return an empty result because the entire BGP will not match any set of triples in any individual graph. However, if we remove the graph block, each individual triple pattern will match triples from different graphs and these results will be joined together. The result is similar to what we would obtain by specifying the sources manually:
SELECT * {
GRAPH <virtual://employees> {
?person a emp:Employee
}
GRAPH <virtual://names> {
?person emp:name "SMITH"
}
}
Virtual Transparency Dataset Specification
Fine grained control of virtual graph selection is possible using
SPARQL’s dataset specification. When virtual transparency is enabled,
the dataset specification allows inclusion of virtual graphs in FROM
and FROM NAMED
clauses. Arbitrary mixing of local and virtual graphs
is allowed. For instance, the following query will include results
from multiple virtual graphs:
SELECT *
FROM <http://local-employees-graph>
FROM <virtual://names>
{
?person a emp:Employee ;
emp:name "SMITH"
}
The Special Named Graphs tag:stardog:api:context:virtual
and
tag:stardog:api:context:all
can be used to refer to the set of all
virtual graphs and the union of all virtual and local graphs,
respectively.
Virtual Transparency Features and Options
Virtual transparency is compatible with all SPARQL operators with the exception of "zero or more" and "one or more" property paths. These constructs are supported on some DBMS platforms when placed inside the graph block specifying the virtual graph source.
A query hint is provided to disable virtual transparency for all or
part of a query. Placing the hint #pragma virtual.transparency off
in a SPARQL block will disable consideration of virtual graphs for
that block.
High Availability Cluster
In this section we explain how to configure, use, and administer Stardog Cluster
for uninterrupted operations. Stardog Cluster is a collection of Stardog Server
instances running on one or more virtual or physical machines that, from the
client’s perspective, behave like a single Stardog Server instance. To fully
achieve this effect requires DNS (i.e., with SRV
records) and proxy
configuration that’s left as an exercise for the user.
Of course Stardog Cluster should have some different operational properties, the main one of which is high availability. But from the client’s perspective Stardog Cluster should be indistinguishable from non-clustered Stardog.[13] While Stardog Cluster is primarily geared toward HA, it is also important to remember that it should be tuned for your specific use case. Our detailed blog post discusses a variety of factors that you should consider when deploying Stardog Cluster as well as some adjustments you should make depending on your workload.

Note
|
Stardog Cluster depends on Apache ZooKeeper. High Availability requires at least three Stardog and three ZooKeeper nodes in the Cluster. ZooKeeper works best, with respect to fault resiliency, with an ensemble size that is an odd-number greater than or equal to three: 3, 5, 7, etc.[14] With respect to performance, larger Stardog clusters perform better than smaller ones for reads, while larger cluster sizes perform worse for writes. It is the responsibility of the administrator to find the right balance. |
Guarantees
A cluster is composed of a set of Stardog servers and a ZooKeeper ensemble running together. One of the Stardog servers is the Coordinator and the others are Participants. The Coordinator orchestrates transactions and maintains consistency by expelling any nodes that fail an operation. An expelled node must sync with a current member to rejoin the cluster.
In case the Coordinator fails at any point, a new Coordinator will be elected
out of the remaining available Participants. Stardog Cluster supports both
read
(e.g., querying) and write
(e.g., adding data) requests. All read and
write requests can be handled by any of the nodes in the cluster.
When a client commits a transaction (containing a list of write
requests), it
will be acknowledged by the receiving node only
after every non-failing peer node has committed the transaction. If a peer node
fails during the process of committing a transaction, it will be expelled from the
cluster by the Coordinator and put in a temporary failed
state. If the
Coordinator fails during the process, the transaction will be aborted. At that point
the client can retry the transaction and it should succeed with the new cluster
coordinator.
Since failed
nodes are not used for any subsequent read
or write
requests,
if a commit is acknowledged, then Stardog Cluster
guarantees that the data has been accordingly modified at every available node in
the cluster.
While this approach is less performant with respect to write operations than eventual consistency used by other distributed databases, typically those databases offer a much less expressive data model than Stardog, which makes an eventually consistency model more appropriate for those systems (and less so for Stardog). But since Stardog’s data model is not only richly expressive but rests in part on provably correct semantics, we think that a strong consistency model is worth the cost.[15]
Single Server Migration
It is assumed that Stardog nodes in a Stardog Cluster are always going to be used within a cluster context. Therefore, if you want to migrate from a Stardog instance running in single server mode to running in a cluster, it is advised that you create backups of your current databases and then import them to the cluster in order to be able to provide the guarantees explained above. If you simply add a Stardog instance to cluster that was previously running in single server mode, it will sync to the state of the cluster; local data could be removed when syncing with the cluster state.
Configuration
In this section we will explain how to manually deploy a Stardog Cluster using
stardog-admin
commands and some additional configuration. If you are deploying
your cluster to AWS then you can use the Stardog Graviton that will automate
this process.
You can use the stardog-admin cluster generate
command to bootstrap a cluster configuration and, thus, to ease installation by
simply passing a list of hostnames or IP addresses for the cluster’s nodes.
$ stardog-admin cluster generate --output-dir /home/stardog 10.0.0.1 10.0.0.2 10.0.0.3
See the man page for the details.
In a production environment we strongly recommend that you deploy and configure ZooKeeper from its documentation. Per the ZooKeeper documentation, we also recommend each ZooKeeper process runs in a different machine and, if possible, that ZooKeeper has a separate drive for its data directory. If you need a larger cluster, adjust accordingly.
In the following example we will set up a cluster with total of 6 nodes.
Zookeeper will be deployed on nodes 1-3 whereas Stardog will be deployed on
nodes 4-6. In this example we’ll generate the required ZooKeeper configuration
in a zookeeper.properties
file.
-
Install Stardog 7.4.5 on each machine in the cluster.
NoteThe best thing to do here, of course, is to use whatever infrastructure you have in place to automate software installation. Adapting Stardog installation to Chef, Puppet, cfengine, etc. is left as an exercise for the reader. -
Make sure a valid Stardog license key (whether Developer, Enterprise, or a 30-day eval key) for the size of cluster you’re creating exists and resides in
STARDOG_HOME
on each node. You must also have astardog.properties
file with the following information for each Stardog node in the cluster:# Flag to enable the cluster, without this flag set, the rest of the properties have no effect pack.enabled=true # this node's IP address (or hostname) where other Stardog nodes are going to connect # this value is optional but if provided it should be unique for each Stardog node pack.node.address=196.69.68.4 # the connection string for ZooKeeper where cluster state is stored pack.zookeeper.address=196.69.68.1:2180,196.69.68.2:2180,196.69.68.3:2180
pack.zookeeper.address
is a ZooKeeper connection string where cluster stores its state.pack.node.address
is not a required property. The local address of the node, by default, isInetAddress.getLocalhost().getAddress()
, which should work for many deployments. However if you’re using an atypical network topology and the default value is not correct, you can provide a value for this property. -
Create the ZooKeeper configuration for each ZooKeeper node. This config file is just a standard ZooKeeper configuration file and the same config file can be used for all ZooKeeper nodes. The following config file should be sufficient for most cases.
tickTime=3000 # Make sure this directory exists and # ZK can write and read to and from it. dataDir=/data/zookeeperdata/ clientPort=2180 initLimit=5 syncLimit=2 # This is an enumeration of all Zk nodes in # the cluster and must be identical in # each node's config. server.1=196.69.68.1:2888:3888 server.2=196.69.68.2:2888:3888 server.3=196.69.68.3:2888:3888
NoteThe clientPort
specified inzookeeper.properties
and the ports used inpack.cluster.address
instardog.properties
must be the same. -
dataDir
is where ZooKeeper persists cluster state and where it writes log information about the cluster.$ mkdir /data/zookeeperdata # on node 1 $ mkdir /data/zookeeperdata # on node 2 $ mkdir /data/zookeeperdata # on node 3
-
ZooKeeper requires a
myid
file in thedataDir
folder to identify itself, you will create that file as follows fornode1
,node2
, andnode3
, respectively:$ echo 1 > /data/zookeeperdata/myid # on node 1 $ echo 2 > /data/zookeeperdata/myid # on node 2 $ echo 3 > /data/zookeeperdata/myid # on node 3
Installation
In the next few steps you will use the Stardog Admin CLI commands to deploy Stardog Cluster: that is, ZooKeeper and Stardog itself. We’ll also configure HAProxy as an example of how to use Stardog Cluster behind a proxy for load-balancing and fail-over capability. There’s nothing special about HAProxy here; you could implement this proxy functionality in many different ways. For example, Stardog Graviton uses Amazon’s Elastic Load Balancer.
-
Start ZooKeeper instances
First, you need to start ZooKeeper nodes. You can do this using the standard command line tools that come with ZooKeeper.
$ ./zookeeper-3.4.14/bin/zkServer.sh start /path/to/zookeeper/config # on node 1 $ ./zookeeper-3.4.14/bin/zkServer.sh start /path/to/zookeeper/config # on node 2 $ ./zookeeper-3.4.14/bin/zkServer.sh start /path/to/zookeeper/config # on node 3
-
Start Stardog instances
Once ZooKeeper is started, you can start Stardog instances:
$ ./stardog-admin server start --home ~/stardog --port 5821 # on node 4 $ ./stardog-admin server start --home ~/stardog --port 5821 # on node 5 $ ./stardog-admin server start --home ~/stardog --port 5821 # on node 6
Important: When starting Stardog instances for the cluster, unlike single server mode, you need to provide the credentials of a superuser that will be used for securing the data stored in ZooKeeper and for intra-cluster communication. Each node should be started with the same superuser credentials. By default, Stardog comes with a superuser
admin
that has password"admin"
and that is the default credentials used by the above command. For a secure installation of Stardog cluster you should change these credentials by specifying thepack.zookeeper.auth
setting in stardog.properties and restart the cluster with new credentials:pack.zookeeper.auth=username:password
And again, if your
$STARDOG_HOME
is set to~/stardog
, you don’t need to specify the--home
option.NoteMake sure to allocate roughly twice as much heap for Stardog than you would normally do for single-server operation since there can be an additional overhead involved for replication in the cluster. Also, we start Stardog here on the non-default port ( 5821
) so that you can use a proxy or load-balancer in the same machine which can run on the default port (5820
), meaning that Stardog clients can act normally (i.e., use the default port,5820
) since they need to interact with HAProxy. -
Start HAProxy (or equivalent)
In most Unix-like systems, HAProxy is available via package managers, e.g. in Debian-based systems:
$ sudo apt-get update $ sudo apt-get install haproxy
At the time of this writing, this will install HAProxy 1.4. You can refer to the official site to install a later release.
Place the following configuration in a file (such as
haproxy.cfg
) in order to point HAProxy to the Stardog Cluster. You’ll notice that there are two backends specified in the config file:stardog_coordinator
andall_stardogs
. An ACL is used to route all requests containingtransaction
in the path to the coordinator. All other traffic is routed via the default backend, which is simply round-robin across all of the Stardog nodes. For some use cases routing transaction-specific operations (e.g. commit) directly to the coordinator performs slightly better. However, round-robin routing across all of the nodes is generally sufficient.global daemon maxconn 256 defaults # you should update these values to something that makes # sense for your use case timeout connect 5s timeout client 1h timeout server 1h mode http # where HAProxy will listen for connections frontend stardog-in option tcpka # keep-alive bind *:5820 # the following lines identify any routes with "transaction" # in the path and send them directly to the coordinator, if # haproxy is unable to determine the coordinator all requests # will fall through and be routed via the default_backend acl transaction_route path_sub -i transaction use_backend stardog_coordinator if transaction_route default_backend all_stardogs # the Stardog coordinator backend stardog_coordinator option tcpka # the following line returns 200 for the coordinator node # and 503 for non-coordinators so traffic is only sent # to the coordinator option httpchk GET /admin/cluster/coordinator # the check interval can be increased or decreased depending # on your requirements and use case, if it is imperative that # traffic be routed to the coordinator as quickly as possible # after the coordinator changes, you may wish to reduce this value default-server inter 5s # replace these IP addresses with the corresponding node address # maxconn value can be upgraded if you expect more concurrent # connections server stardog1 196.69.68.1:5821 maxconn 64 check server stardog2 196.69.68.2:5821 maxconn 64 check server stardog3 196.69.68.3:5821 maxconn 64 check # the Stardog servers backend all_stardogs option tcpka # keep-alive # the following line performs a health check # HAProxy will check that each node accepts connections and # that it's operational within the cluster. Health check # requires that Stardog nodes do not use --no-http option option httpchk GET /admin/healthcheck default-server inter 5s # replace these IP addresses with the corresponding node address # maxconn value can be upgraded if you expect more concurrent # connections server stardog1 196.69.68.1:5821 maxconn 64 check server stardog2 196.69.68.2:5821 maxconn 64 check server stardog3 196.69.68.3:5821 maxconn 64 check
If you wish to operate the cluster in HTTP-only mode, you can add the
mode http
to backend settings.Finally,
$ haproxy -f haproxy.cfg
For more info on configuring HAProxy please refer to the official documentation. In production environments we recommend running multiple proxies to avoid single point of failures, and use DNS solutions for fail-over.
Now Stardog Cluster is running on 3 nodes, one each on 3 machines. Since HAProxy was conveniently configured to use port
5820
you can execute standard Stardog CLI commands to the Cluster:$ ./stardog-admin db create -n myDb $ ./stardog data add myDb /path/to/my/data $ ./stardog query myDb "select * { ?s ?p ?o } limit 5"
If your cluster is running on another machine then you will need to provide a fully qualified connection string in the above commands.
Shutdown
In order to shut down the cluster you only need to execute the following command once:
$ ./stardog-admin cluster stop
The cluster stop
request will cause all available nodes in
the cluster to shutdown. If a node was expelled from the cluster due to a failure it would
not receive this command and might need to be shutdown manually. In order to shutdown a
single node in the cluster use the regular server stop
command and be sure to specify the server address:
$ ./stardog-admin --server http://localhost:5821 server stop
If you send the server stop
command to the load balancer
then a random node selected by the load balancer will shutdown.
In addition to the Stardog cluster you still need to shut down the ZooKeeper cluster independently. Refer to the ZooKeeper documentation for details.
You’ll need to stop each ZooKeeper node independently:
$ ./zookeeper-3.4.14/bin/zkServer.sh stop /path/to/zookeeper/config
Standby Nodes
Note
|
This feature is in Beta. |
The notion of a standby node was introduced in Stardog 6.2.3. A standby
node runs next to the Stardog cluster and periodically requests updates.
The standby does not service any user requests, neither reads nor writes.
Its purpose is to stay very closely synchronized with the cluster but without
disturbing the cluster with the more difficult join
event. By only
drifting from full synchronization by limited time windows it allows for
two important features:
-
The standby node can safely run database and server backups without taking CPU cycles from servicing user requests.
-
The standby node can be upgraded to a full node and thereby quickly join the cluster because it is already closely in sync.
This latter point is important for maintaining HA clusters. If one node goes down a standby node can be promoted to a real, functional node quickly quickly restoring the cluster to full strength.
Managing A Standby Node
To start a cluster node as a standby node simply add the following line
to stardog.properties
:
pack.standby=true
pack.standby.node.sync.interval=5m
This will configure the node to be in standby mode and to wait 5 minutes between synchronization attempts. The interval begins when the synchronization completes, eg: If a synchronization takes 3 minutes it will be 8 minutes before the next synchronization attempt.
Once a standby node is running it can be converted to a full node with
the stardog-admin
command.
$ ./stardog-admin --server http://<standby node IP>:5820 cluster standby-join
Note that you cannot use the IP address of a full cluster node nor that of
a load balancer directing requests to full cluster nodes. You must
point directly to the standby node. Once upgraded it may take a bit
of time for the node to fully join the cluster. Its progress can be monitored
with stardog-admin cluster status
.
Another feature of a standby node is the ability to pause
synchronization.
To request a pause of synchronization run:
$ ./stardog-admin --server http://<standby node IP>:5820 cluster standby-pause
This tells the standby node that you want to pause it, however it does not mean it is paused. Pausing can take some time if the node is in the middle of a large synchronization event. The status of pausing can be monitored with:
$ ./stardog-admin --server http://<standby node IP>:5820 cluster standby-status
A node is not safely paused until the state PAUSED
is returned. To
resume synchronization run:
$ ./stardog-admin --server http://<standby node IP>:5820 cluster standby-resume
Automated Deployment
As of Stardog 5, we support AWS as a first-class deployment environment.
Stardog Graviton
Configuring and managing highly available cluster applications can be a complex black art. Graviton is a tool that leverages the power of Amazon Web Services to make launching the Stardog cluster easy.
The source code is available as Apache 2.0 licensed code.
Download
-
Linux
-
OSX
Setup Your Environment
In order to use stardog-graviton
in its current form the following environment
variables must be set.
AWS_ACCESS_KEY_ID=<a valid aws access key>
AWS_SECRET_ACCESS_KEY=<a valid aws secret key>
The account associated with the access tokens must have the ability to create IAM credentials and full EC2 access.
Both terraform
and packer
must be in your system path.
The easiest way to launch a cluster is to run stardog-graviton
in interactive
mode. This will cause the program to ask a series of questions in order to get
the needed values to launch a cluster. Here is a sample session:
$ stardog-graviton --log-level=DEBUG launch mystardog423
What version of stardog are you launching?: 4.2.3
What is the path to the Stardog release?:
A value must be provided.
What is the path to the Stardog release?: /Users/bresnaha/stardog-4.2.3.zip
There is no base image for version 4.2.3.
- Running packer to build the image...
done
AMI Successfully built: ami-c06246a0
Creating the new deployment mystardog423
Would you like to create an SSH key pair? (yes/no): no
EC2 keyname (default): <aws key name>
Private key path: /path/to/private/key
What is the path to your Stardog license?: /path/to/stardog/license
\ Calling out to terraform to create the volumes...
- Calling out to terraform to stop builder instances...
Successfully created the volumes.
\ Creating the instance VMs......
Successfully created the instance.
Waiting for stardog to come up...
The instance is healthy
Changing the default password...
Password changed successfully for user admin.
\ Opening the firewall......
Successfully opened up the instance.
The instance is healthy
The instance is healthy
Stardog is available here: http://mystardog423sdelb-1763823291.us-west-1.elb.amazonaws.com:5821
ssh is available here: mystardog423belb-124202215.us-west-1.elb.amazonaws.com
Using 3 stardog nodes
10.0.101.189:5821
10.0.100.107:5821
10.0.100.140:5821
Success.
To avoid being asked questions a file named ~/.graviton/default.json
can be
created. An example can be found in the
defaults.json.example file.
All of the components needed to run a Stardog cluster are considered part of a
deployment. Every deployment must be given a name that is unique to each cloud
account. In the above example the deployment name is mystardog2
.
Status
Once the image has been successfully launched its health can be monitored with
the status
command:
$ stardog-graviton --log-level=DEBUG status mystardog423
The instance is healthy
Stardog is available here: http://mystardog423sdelb-1763823291.us-west-1.elb.amazonaws.com:5821
ssh is available here: mystardog423belb-124202215.us-west-1.elb.amazonaws.com
Using 3 stardog nodes
10.0.101.189:5821
10.0.100.107:5821
10.0.100.140:5821
Success.
Cleanup
AWS EC2 charges by the hour for the VMs that Graviton runs thus when the cluster
is no longer in use it is important to clean it up with the destroy
command.
stardog-graviton --log-level=DEBUG destroy mystardog423
This will destroy all volumes and instances associated with this deployment.
Do you really want to destroy? (yes/no): yes
/ Deleting the instance VMs...
Successfully destroyed the instance.
\ Calling out to terraform to delete the images...
Successfully destroyed the volumes.
Success.
Configuration Issues
Topologies & Size
In the configuration instructions above, we assume a particular Cluster
topology, which is to say, for each node n
of a cluster, we run Stardog,
ZooKeeper, and a load balancer. But this is not the only topology supported by Stardog
Cluster. ZooKeeper nodes run independently, so other topologies—three ZooKeeper
servers and five Stardog servers are possible—you just have to point Stardog to
the corresponding ZooKeeper ensemble.
To add more Stardog Cluster nodes, simply repeat the steps for Stardog on additional machines. Generally, as mentioned above, Stardog Cluster size should be an odd number greater or equal to 3.
Warning
|
ZooKeeper uses a very write heavy protocol; having Stardog and ZooKeeper both writing to the same disk can yield contention issues, resulting in timeouts at scale. We recommend at a minimum having the two services writing to separate disks to reduce contention or, ideally, have them run on separate nodes entirely. |
Open File Limits
If you expect to use Stardog Cluster with heavy concurrent write workloads, then
you should probably increase the number of open files that host OS will permit
on each Cluster node. You can typically do this on a Linux machine with ulimit
-n
or some variant thereof. Because nodes communicate between themselves and
with ZooKeeper, it’s important to make sure that there are sufficient file
handle resources available.[16]
Connection/Session Timeouts
Stardog nodes connect to the ZooKeeper cluster and establishes a session. The session is kept alive by PING requests sent by the client. If the Stardog node does not send these requests to the ZooKeeper server (due to network issues, node failure, etc.) the session will timeout and the Stardog node will get into a suspended state and it will reject any queries or transactions until it can establish the session again.
If a Stardog node is overloaded then it might fail to send the PING requests to ZooKeeeper server in a timely manner. This usually happens when Stardog’s memory usage is close to the limit and there are frequent GC pauses. This would cause Stardog nodes to be suspended unnecessarily. In order to prevent this problem make sure Stardog nodes have enough memory allocated and tweak the timeout options.
There are two different configuration options that control timeouts for the
ZooKeeper server. The pack.connection.timeout
option specifies the max time
that Stardog waits to establish a connection to ZooKeeper. The pack.session.timeout
option specifies the session timeout explained above. You can set these values
in stardog.properties
as follows:
pack.connection.timeout=15s
pack.session.timeout=60s
Note that, ZooKeeper has limitations about how these values can be set based on
the tickTime
value specified in the ZooKeeper configuration file. Session
timeout needs to be a minimum of 2 times the tickTime
and a maximum of 20 times
the tickTime
. So a session timeout of 60s
requires the tickTime
to be at
least 3s
(in ZooKepeer configuration file this value should be entered in
milliseconds). If the session timeout is not in the allowed range the ZooKeeper
will negotiate a new timeout value and Stardog will print a warning about this
in the stardog.log
file.
Client Usage
To use Stardog Cluster with standard Stardog clients and CLI tools in the
ordinary way--stardog-admin
and stardog
--you must have Stardog installed
locally. With the provided Stardog binaries in the Stardog Cluster distribution
you can query the state of Cluster:
$ ./stardog-admin --server http://<ipaddress>:5820/ cluster info
where ipaddress
is the IP address of any of the nodes in the cluster. This
will print the available nodes in the cluster, as well as the roles (participant
or coordinator). You can also input the proxy IP address and port to get the
same information.
To add or remove data, issue stardog data add
or
remove
commands to any node in the cluster. Queries
can be issued to any node in the cluster using the `
stardog query` command. All the stardog-admin
features are also available in
Cluster, which means you can use any of the commands to create databases,
administer users, and the rest of the functionality.
Adding Nodes to a Cluster
Stardog cluster stores the UUID of the last committed transaction for each database in ZooKeeper. When a new node is joining the cluster it will compare the local transaction ID of each database with the corresponding transaction ID stored in ZooKeeper. If there is a mismatch the node will synchronize the database contents from another node in the cluster. If there are no nodes in the cluster the new node cannot join the cluster and will shut itself down. For this reason, if you are starting a new cluster then you should make sure that the ZooKeeper state is cleared. If you are retaining an existing cluster then new nodes should be started when there is at least one node in the cluster.
If there are active transactions in the cluster joining node will wait for those transactions to finish and then synchronize its databases. More transactions may take place during synchronization and in that case the joining node will continue synchronization and retrieve the data from new transactions. Thus, it will take longer for a node to join the cluster if there are continuous transactions. Note that, the new node will not be available for requests until all the databases are synchronized. The proxy/load-balancer should perform a health check before forwarding the requests to a new node (as shown in the above configuration) so user requests will always be forwarded to available nodes.
Upgrading the Cluster
The process to upgrade Stardog Cluster is straightfoward; however, there are a few extra steps you should take to ensure the upgrade goes as quickly and smoothly as possible. Before you begin the upgrade, make sure to place the new Stardog binaries on all of the cluster nodes.
Also make sure to note which node is the coordinator since this is the first node
that will be started as part of the upgrade. stardog-admin cluster info
will show the nodes in the cluster and which one is the coordinator.
Next you should ensure that there are no transactions running, e.g.,
stardog-admin db status <db name>
will show if there are any open transactions
for a database. This step is not strictly required, however, it can minimize downtime
and streamline the process, allowing the cluster to stop quickly and helping avoid
non-coordinator nodes from having to re-sync when they attempt to join the upgraded
cluster.
When you are ready to begin the upgrade, you can shutdown the cluster with
stardog-admin cluster stop
. Once all nodes have stopped, backup the STARDOG_HOME
directories on all of the nodes.
With the new version of Stardog, bring the cluster up one node at a time, starting with the previous coordinator. As each node starts make sure that it is able to join the cluster cleanly before moving on to the next node.
Backing Up the Cluster
Backing up the cluster is similar to single-node backups. However, there are a few points to be aware of. All nodes in the cluster will perform a backup unless S3 is the backup location, in which case only a single node will perform the backup to the S3 bucket.
If you are backing up to S3 then backup.location
should be the same on all nodes
in the cluster since any node may perform the backup.
You can also disable backup replication in the cluster by setting the following option:
pack.backups.replicated.scheme=none
If that is disabled then only the node that receives the command will perform the backup,
otherwise all nodes in the cluster will run the backup (either db backup
or server backup
).
If backup.location
is used to specify a backup directory mounted on the Stardog nodes then backup.location
can specify different directories on each node in the cluster, if required. In this case
you should disable replicated backups and issue the backup command to each node individually.
Finally, if --to
is passed to the backup command, it will take precedence over either
backup.dir
or backup.location
specified in stardog.properties
and all nodes will
perform a backup to the specified location.
Restoring the Cluster
Similar to single-node Stardog, you can restore individual databases with db restore
and the cluster will replicate the database to all nodes in the cluster.
Because Stardog Cluster uses ZooKeeper to ensure strong consistency between all of the
nodes in the cluster, we recommend server restore
be done on only one node with a
fresh deploy of ZooKeeper (i.e., clear ZooKeeper’s state once it is no longer in use).
The operational process is to:
-
Shutdown Stardog on all nodes in the cluster
-
Shutdown the ZooKeeper ensember, if possible. If that’s not possible we recommend backing up ZooKeeper’s state and wiping the contents stored by Stardog.
-
Create an empty
$STARDOG_HOME
directory on all of the Stardog Cluster nodes. -
Export
$STARDOG_HOME
to the empty home and runserver restore
(the same as you would for a single node) on a single node. -
Start a fresh ZooKeeper ensemble with an empty data directory.
-
Start ONLY the Stardog node where you performed
server restore
. Verify the node starts and is in the cluster with thecluster info
command before continuing to step 7. -
Start a second node in the cluster with its empty home directory, wait for it to sync and join the cluster, as reported by
cluster info
. Wait until the node joins before moving to step 8. -
Repeat step 7, one node at a time, for the remaining cluster nodes.
Errors During Database Creation
In order to ensure consistency in the cluster, if there is an error adding one or more data files at database creation time, the
operation will fail and no database will be created. In a single node setup, this is not the case as the configuration
option database.ignore.bulk.load.errors
is set equal to true
. When this configuration option is set, Stardog will
continue to load any other data files specified at creation time and ignore those that triggered errors.
It is not safe to ignore these bulk load errors in a cluster which is why database.ignore.bulk.load.errors
is set
equal to false
by default in a cluster. This discussion is merely just to describe the differences between a single node and
cluster and how they handle errors during database creation. No additional configuration is required to ensure
consistency in the cluster.
Search
Indexing Strategy
The indexing strategy creates a "search document" per RDF literal. Each document consists of two fields: literal ID and literal value. See User-defined Lucene Analyzer for details on customizing Stardog’s search programmatically.
Enabling Search
Full-text support for a database is disabled by default but can be
enabled at any time by setting the configuration option
search.enabled
to true. For example, you can create a database with
full-text support as follows:
$ stardog-admin db create -o search.enabled=true -n myDb
Similarly, you can set the option using SearchOptions#SEARCHABLE
when
creating the database programmatically:
adminConnection.newDatabase("myDB")
.set(SearchOptions.SEARCHABLE, true)
.create()
Integration with SPARQL
We use the predicate tag:stardog:api:property:textMatch
(or
http://jena.hpl.hp.com/ARQ/property#textMatch
) to access the search index in
a SPARQL query.
The textMatch
function has one required argument, the search query in
Lucene syntax
and it returns, by default, all literals matching the query string.
For example,
SELECT DISTINCT ?s ?score
WHERE {
?s ?p ?l.
(?l ?score) <tag:stardog:api:property:textMatch> 'mac'.
}
This query selects all literals which match 'mac'. These literals are
then joined with the generic BGP ?s ?p ?l
to get the resources (?s
)
that have those literals. Alternatively, you could use
?s rdf:type ex:Book
if you only wanted to select the books which
reference the search criteria; you can include as many other BGPs as
you like to enhance your initial search results.
You can change the number of results textMatch
returns by providing an
optional second argument with the limit:
SELECT DISTINCT ?s ?score
WHERE {
?s ?p ?l.
(?l ?score) <tag:stardog:api:property:textMatch> ('mac' 100).
}
Limit in textMatch
only limits the number of literals returned, which is
different than the number of total results the query will return. When a LIMIT
is specified in the SPARQL query, it does not affect the full-text search,
rather, it only restricts the size of the result set.
Lucene returns a score with each match. It is possible to return these scores and define filters based on the score:
SELECT DISTINCT ?s ?score
WHERE {
?s ?p ?l.
(?l ?score) <tag:stardog:api:property:textMatch> ('mac' 0.5 10).
}
This query returns 10 matching literals where the score is greater than 0.5. Note that, as explained in the Lucene documentation scoring is very much dependent on the way documents are indexed and the range of scores might change significantly between different databases.
Service Form of Search
The textMatch
predicate is concise for simple queries. With up to
four input constants and two or more output variables, positional
arguments can become confusing. An alternate syntax based on SPARQL
SERVICE
clause is provided. Not only does it make the arguments
clear, but also provides some additional features, such as the ability
searching over variable bindings and return highlighted fragments,
both described below.
With the SERVICE
clause syntax, we specify each parameter by
name. Here’s an example using a number of different parameters:
prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
service fts:textMatch {
[] fts:query 'Mexico AND city' ;
fts:threshold 0.6 ;
fts:limit 10 ;
fts:offset 5 ;
fts:score ?score ;
fts:result ?res ;
}
}
Searching over Variable Bindings
Search queries aren’t always as simple as a single constant
query. It’s possible to perform multiple search queries using other
bindings in the SPARQL query as input. This can be accomplished by
specifying a variable for the fts:query
parameter. In the following
example, we use the titles of new books to find related books:
prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
# Find new books and their titles. Each title will be used as input to a
# search query in the full-text index
?newBook a :NewBook ; :title ?title .
service fts:textMatch {
[] fts:query ?title ;
fts:score ?score ;
fts:result ?relatedText ;
}
# Bindings of ?relatedText will be used to look up other books in the database
?relatedBook :title ?relatedText .
filter(?newBook != ?relatedBook)
}
Highlighting Relevant Fragments of Search Results
When building search engines, it’s essential not only to find the most
relevant results, but also to display them in a way that helps users
select the entry most relevant to them. To this end, Stardog provides
a highlight
argument to the SERVICE
clause search syntax. When
this argument is given an otherwise unbound variable, the result will
include one or more fragments from the string literal returned by the
search which include the search terms. The highlightMaxPassages
can
be used to limit the maximum number of fragments which will be
included in the highlight result.
To illustrate, an example query and results are given.
prefix fts: <tag:stardog:api:search:>
SELECT * WHERE {
service fts:textMatch {
[] fts:query "mexico AND city" ;
fts:score ?score ;
fts:result ?result ;
fts:highlight ?highlight
}
}
order by desc(?score)
limit 4
The results might include highlighted fragments such as:
a <b>city</b> in south central <b>Mexico</b> (southeast of <b>Mexico</b>
<b>City</b>) on the edge of central Mexican plateau
Search Syntax
Stardog search is based on Lucene 7.4.0: we support all of the search modifiers that Lucene supports, with the exception of fields.
-
wildcards:
?
and*
-
fuzzy:
~
and~
with similarity weights (e.g.foo~0.8
) -
proximities:
"semantic web"~5
-
term boosting
-
booleans:
OR
,AND
,NOT
, +, and `-
. -
grouping
For a more detailed discussion, see the Lucene docs.
Escaping Characters in Search
The "/" character must be escaped because Lucene says so. In fact, there are several characters that are part of Lucene’s query syntax that must be escaped.
OWL & Rule Reasoning
In this chapter we describe how to use Stardog’s reasoning capabilities; we address some common problems and known issues. We also describe Stardog’s approach to query answering with reasoning in some detail, as well as a set of guidelines that contribute to efficient query answering with reasoning. If you are not familiar with the terminology, you can peruse the section on terminology.
The semantics of Stardog’s reasoning is based in part on the OWL 2 Direct Semantics Entailment Regime. However, the implementation of Stardog’s reasoning system is worth understanding as well. For the most part, Stardog performs reasoning in a lazy and late-binding fashion: it does not materialize inferences; but, rather, reasoning is performed at query time according to a user-specified "reasoning type". This approach allows for maximum flexibility[17] while maintaining excellent performance. The one exception to this general approach is equality reasoning which is eagerly materialized. See Same As Reasoning for more details.
Reasoning Types
Reasoning can be enabled or disabled using a simple boolean flag—in HTTP,
reasoning
; in CLI, -r
or --reasoning
; and in Java APIs,
a
connection option or
a
query option:
-
false
: No axioms or rules are considered; no reasoning is performed. -
true
: Axioms and rules are considered and reasoning is performed according to the value of thereasoning.type
database option.
Reasoning is disabled by default; that is, no reasoning is performed without explicitly setting the reasoning flag to "true".
When reasoning is enabled by the boolean flag, the axioms and rules
in the database are first filtered according to the value of the
reasoning.type
database option. The default value of reasoning.type
is SL
and for the most part users don’t need to
worry too much about which reasoning type is necessary since SL
covers
all of the OWL 2 profiles as well as user-defined rules via SWRL.
However, this value may be set to any other reasoning type that
Stardog supports: RDFS
is the OWL 2 axioms allowed in
RDF Schema (mainly subclasses, subproperties,
domain, and ranges); QL
for the
OWL 2 QL axioms; RL
for the
OWL 2 RL axioms; EL
for the
OWL 2 EL axioms; DL
for
OWL 2 DL axioms; and SL
for a combination of RDFS, QL, RL, and EL axioms, plus
SWRL rules. Any axiom outside the selected
type will be ignored by the reasoner.
The DL
reasoning type behaves significantly different than other types.
Stardog normally uses the Query Rewriting technique for reasoning which
scales very well with increasing number of instances; only the schema needs to
be kept in memory. But query rewriting cannot handle axioms outside the OWL 2
profiles; however, DL
reasoning type can be used so that no axiom or rule is
ignored as long as they satisfy
the
OWL 2 DL restrictions. With DL
reasoning, both the schema and the instance
data need to pulled into memory, which limits its applicability with large
number of instances. DL
reasoning also requires the database to be logically
consistent or no reasoning can be performed. Finally, DL
reasoning requires
more computation upfront compared to query rewriting which exhibits a
"pay-as-you-go" behavior.
Note
|
DL reasoning is not something that we recommend using due to scalability issues.
It may be deprecated and removed in a future version of Stardog.
|
The reasoning.type
can also be set to the special value NONE
which will
filter all axioms and rules thus effectively disables reasoning. This value can be
used for the database option to prevent reasoning to be used by any client even
though they might enable it with the boolean flag on the client side.
Using Reasoning
In order to perform query evaluation with reasoning, Stardog requires a schema[18] to be present in the database. Since schemas are serialized as RDF, they are loaded into a Stardog database in the same way that any RDF is loaded into a Stardog database. Also, note that, since the schema is just more RDF triples, it may change as needed: it is neither fixed nor compiled in any special way.
The schema may reside in the default graph, in a specific named graph,
or in a collection of graphs. You can tell Stardog where the schema is
by setting the reasoning.schema.graphs
property to one or more named
graph URIs. If you want the default graph to be considered part of the
schema, then you can use the special built-in URI
tag:stardog:api:context:default
. If you want to use all local (non-virtual)
named graphs (that is, to tell Stardog to look for the schema in every local
named graph), you can use tag:stardog:api:context:local
.
Note
|
The default value for this property is to
use all graphs, i.e., tag:stardog:api:context:local .
|
This design is intended to support both of Stardog’s primary use cases:
-
managing the data that constitutes the schema
-
reasoning with the schema during query evaluation
Query Answering
All of Stardog’s interfaces (API, network, and CLI) support reasoning
during query evaluation. All types of queries (that is, SELECT
, ASK
, CONSTRUCT
,
PATHS
, DESCRIBE
, and updates) can be evaluated with reasoning. When reasoning
is enabled, it applies to all query patterns in WHERE
and VIA
blocks.
However, as of 6.1 it is possible to selectively disable it for certain parts of the query using
the #pragma reasoning
hint as follows:
SELECT * WHERE {
?person rdf:type :Employee .
{ #pragma reasoning off
?person ?p ?o
}
}
This query uses reasoning to select all employees thus retrieves managers, etc. but
returns only asserted properties for each of them. Disabling reasoning for ?s ?p ?o
patterns is often handy since those may cause performance problems while not providing
particularly useful inferences. In complex queries it is possible to re-enable reasoning
for a nested graph scope with #pragma reasoning on
. The hint is ignored when the query
is evaluated without reasoning.
Command Line
In order to evaluate queries in Stardog using reasoning via the command line, we use the reasoning flag:
$ ./stardog query --reasoning myDB "SELECT ?s { ?s a :Pet } LIMIT 10"
HTTP
For HTTP, the reasoning flag is specified either with the other HTTP request parameters:
$ curl -u admin:admin -X GET "http://localhost:5820/myDB/query?reasoning=true&query=..."
or, as a segment in the URL:
$ curl -u admin:admin -X GET "http://localhost:5820/myDB/query/reasoning?query=..."
Reasoning Connection API
In order to use the ReasoningConnection
API one needs to enable
reasoning. See the Java Programming section for details.
Currently, the API has two methods:
-
isConsistent()
, which can be used to check if the database is (logically) consistent with respect to the reasoning type. -
isSatisfiable(URI theURIClass)
, which can be used to check if the given class if satisfiable with respect to the database and reasoning type.
Reasoning with Multiple Schemas
There is a default schema associated with each database whose content is
controlled by the reasoning.schema.graphs
property as explained above.
However, there are certain use cases where one might need to use different
schemas to answer different queries. Some example use cases are as follows:
-
There are two different versions of a schema that evolved over time and older legacy applications need to use the previous version of the schema whereas the newer applications need to use the newer version.
-
Different applications require different rules and business logic, e.g. threshold for a concept like
Low
orHigh
might change based on the context. -
There could be a very large number of axioms and rules in the domain that can be partitioned into smaller schema subsets for performance reasons.
Starting with version 7.0, Stardog supports schema multi-tenancy: reasoning with multiple schemas and
specifying a schema to be used for answering a query. Each schema has a name and
a set of named graphs and when the schema is selected for answering a query the
axioms and rules stored in the associated graphs will be taken into account. A
named schema can be selected for a query using the --schema
parameter:
$ ./stardog query --schema petSchema myDB "SELECT ?s { ?s a :Pet } LIMIT 10"
When the --schema
parameter is used the --reasoning
parameter does not need
to be specified and will have no effect. But using --reasoning
flag without a
--schema
parameter is equivalent to specifying --schema default
.
The named schemas are defined via the reasoning.schemas
configuration option
that is a set of schema name and graph IRI pairs. There is convenience
functionality provided in the CLI and Java API to manage schemas. The named
graphs for a new or an existing schema can be set as follows using stored
namespaces or full IRIs:
$ ./stardog reasoning schema --add dogSchema --graphs :dogGraph :petGraph -- myDB
The schemas can be removed using the reasoning
schema --remove
command. The --list
option will list all the defined
schemas and their named graphs:
$ ./stardog reasoning schema --list myDB
+-----------+---------------------------------+
| Schema | Graphs |
+-----------+---------------------------------+
| default | <tag:stardog:api:context:local> |
| catSchema | :petGraph, :catGraph |
| dogSchema | :petGraph, :dogGraph |
| petSchema | :petGraph |
+-----------+---------------------------------+
Explaining Reasoning Results
Stardog can be used to check if the current database logically entails a set of triples; moreover, Stardog can explain why this is so.[19] An explanation of an inference is the minimum set of statements explicitly stored in the database that, together with the schema and any valid inferences, logically justify the inference. Explanations are useful for understanding data, schema, and their interactions, especially when large number of statements interact with each other to infer new statements.
Explanations can be retrieved using the CLI by providing an input file that contains the inferences to be explained:
$ stardog reasoning explain myDB inference_to_explain.ttl
The output is displayed in a concise syntax designed to be legible; but it can be rendered in any one of the supported RDF syntaxes if desired. Explanations are also accessible through Stardog’s extended HTTP protocol and in Java. See the examples in the stardog-examples Github repo for more details about retrieving explanations programmatically.
Proof Trees
Proof trees are a hierarchical presentation of multiple explanations (of inferences) to make data, schemas, and rules more intelligible. Proof trees[20] provide an explanation for an inference or an inconsistency as a hierarchical structure. Nodes in the proof tree may represent an assertion in a Stardog database. Multiple assertion nodes are grouped under an inferred node.
Example
For example, if we are explaining the inferred triple :Alice
rdf:type :Employee
, the root of the proof tree will show that
inference:
INFERRED :Alice rdf:type :Employee
The children of an inferred node will provide more explanation for that inference:
INFERRED :Alice rdf:type :Employee
ASSERTED :Manager rdfs:subClassOf :Employee
INFERRED :Alice rdf:type :Manager
The fully expanded proof tree will show the asserted triples and axioms for every inference:
INFERRED :Alice rdf:type :Employee
ASSERTED :Manager rdfs:subClassOf :Employee
INFERRED :Alice rdf:type :Manager
ASSERTED :Alice :supervises :Bob
ASSERTED :supervises rdfs:domain :Manager
The CLI explanation command prints the proof tree using indented text; but, using the SNARL API, it is easy to create a tree widget in a GUI to show the explanation tree, such that users can expand and collapse details in the explanation.
Another feature of proof trees is the ability to merge multiple explanations into a single proof tree with multiple branches when explanations have common statements. Consider the following example database:
#schema
:Manager rdfs:subClassOf :Employee
:ProjectManager rdfs:subClassOf :Manager
:ProjectManager owl:equivalentClass (:manages some :Project)
:supervises rdfs:domain :Manager
:ResearchProject rdfs:subClassOf :Project
:projectID rdfs:domain :Project
#instance data
:Alice :supervises :Bob
:Alice :manages :ProjectX
:ProjectX a :ResearchProject
:ProjectX :projectID "123-45-6789"
In this database, there are three different unique explanations
for the inference :Alice rdf:type :Employee
:
Explanation 1
:Manager rdfs:subClassOf :Employee
:ProjectManager rdfs:subClassOf :Manager
:supervises rdfs:domain :Manager
:Alice :supervises :Bob
Explanation 2
:Manager rdfs:subClassOf :Employee
:ProjectManager rdfs:subClassOf :Manager
:ProjectManager owl:equivalentClass (:manages some :Project)
:ResearchProject rdfs:subClassOf :Project
:Alice :manages :ProjectX
:ProjectX a :ResearchProject
Explanation 3
:Manager rdfs:subClassOf :Employee
:ProjectManager rdfs:subClassOf :Manager
:ProjectManager owl:equivalentClass (:manages some :Project)
:projectID rdfs:domain :Project
:Alice :manages :ProjectX
:ProjectX :projectID "123-45-6789"
All three explanations have some triples in common; but when explanations are retrieved separately, it is hard to see how these explanations are related. When explanations are merged, we get a single proof tree where alternatives for subtrees of the proof are shown inline. In indented text rendering, the merged tree for the above explanations would look as follows:
INFERRED :Alice a :Employee
ASSERTED :Manager rdfs:subClassOf :Employee
1.1) INFERRED :Alice a :Manager
ASSERTED :supervises rdfs:domain :Manager
ASSERTED :Alice :supervises :Bob
1.2) INFERRED :Alice a :Manager
ASSERTED :ProjectManager rdfs:subClassOf :Manager
INFERRED :Alice a :ProjectManager
ASSERTED :ProjectManager owl:equivalentClass (:manages some :Project)
ASSERTED :Alice :manages :ProjectX
2.1) INFERRED :ProjectX a :Project
ASSERTED :projectID rdfs:domain :Project
ASSERTED :ProjectX :projectID "123-45-6789"
2.2) INFERRED :ProjectX a :Project
ASSERTED :ResearchProject rdfs:subClassOf :Project
ASSERTED :ProjectX a :ResearchProject
In the merged proof tree, alternatives for an
explanation are shown with a number id. In the above tree,
:Alice a :Manager
is the first inference for which we have
multiple explanations so it gets the id 1
. Then each alternative
explanation gets an id appended to this (so explanations 1.1
and
1.2
are both alternative explanations for inference 1
). We
also have multiple explanations for inference :ProjectX a :Project
so its alternatives get ids 2.1
and 2.2
.
User-defined Rule Reasoning
Many reasoning problems may be solved with OWL’s axiom-based approach; but, of course, not all reasoning problems are amenable to this approach. A user-defined rules approach complements the OWL axiom-based approach nicely and increases the expressive power of a reasoning system from the user’s point of view. Many RDF databases support user-defined rules only. Stardog is the only RDF database that comprehensively supports both axioms and rules. Some problems (and some people) are simply a better fit for a rules-based approach to modeling and reasoning than to an axioms-based approach (and, of course, vice versa).
Note
|
There isn’t a one-size-fits-all answer to the question "rules or axioms or both?" Use the thing that makes the most sense given the task at hand. This is engineering, not religion. |
Stardog supports user-defined rule reasoning together with a rich set of built-in functions using the SWRL syntax and builtins library. In order to apply SWRL user-defined rules, you must include the rules as part of the database’s schema: that is, put your rules where your axioms are, i.e., in the schema. Once the rules are part of the schema, they will be used for reasoning automatically when using the SL reasoning type.
Assertions implied by the rules will not be materialized. Instead, rules are used to expand queries just as regular axioms are used.
Note
|
To trigger rules to fire, execute a relevant query—simple and easy as the truth. |
Stardog Rules Syntax
Stardog supports two different syntaxes for defining rules. The first is native Stardog Rules syntax and is based on SPARQL, so you can re-use what you already know about SPARQL to write rules. Unless you have specific requirements otherwise, you should use this syntax for user-defined rules in Stardog. The second is the de facto standard RDF/XML syntax for SWRL. It has the advantage of being supported in many tools; but it’s not fun to read or to write. You probably don’t want to use it. Better: don’t use this syntax!
Stardog Rules Syntax is basically SPARQL "basic graph patterns" (BGPs)
plus some very explicit new bits (IF-THEN
) to denote the head and the
body of a rule.[21] You define URI prefixes
in the normal way (examples below) and use regular SPARQL variables for
rule variables. As you can see, some SPARQL 1.1 syntactic
sugar—property paths, especially, but also bnode syntax—make complex
Stardog Rules concise and elegant.
Note
|
Starting in Stardog 3.0, it’s legal to use any valid Stardog function in Stardog Rules (see rule limitations below for few exceptions). |
How to Use Stardog Rules
There are three things to sort out:
-
Where to put these rules?
-
How to represent these rules?
-
What are the gotchas?
First, the rules go into the database, of course.
Unless you’ve changed the value of reasoning.schema.graphs
option, you can
store the rules in any named graph (or the default graph) in the database and you
will be fine; that is, just add the rules to the database and it
will all work out.[22]
Second, you include the rules directly in a Turtle file loaded into Stardog. Rules can be mixed with triples in the file. Here’s an example:
:r a :Rectangle ;
:width 5 ;
:height 8 .
IF {
?r a :Rectangle ;
:width ?w ;
:height ?h
BIND (?w * ?h AS ?area)
}
THEN {
?r :area ?area
}
That’s pretty easy. Third, what are the gotchas?
Rule Representation Options
Inline rules in Turtle data can be named for later reference and
management. We assign an IRI, :FatherRule
in this example, to the
rule and use it as the subject of other triples:
@prefix : <http://example.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
RULE :FatherRule
IF {
?x a <http://example.org/Male> , <http://example.org/Parent> .
}
THEN {
?x a <http://example.org/Father> .
}
:FatherRule rdfs:comment "This rule defines fathers" ;
a :MyRule .
In addition to the inline Turtle representation of rules, you can
represent the rules with specially constructed RDF triples. This is
useful for maintaining Turtle compatibility or for use with SPARQL
INSERT DATA
. This example shows the object of a triple which
contains one rule in Stardog Rules syntax embedded as literal.
@prefix rule: <tag:stardog:api:rule:> .
[] a rule:SPARQLRule;
rule:content """
IF {
?r a :Rectangle ;
:width ?w ;
:height ?h
BIND (?w * ?h AS ?area)
}
THEN {
?r :area ?area
}
""".
Rule Limitations & Gotchas
-
The RDF serialization of rules in, say, a Turtle file has to use the
tag:stardog:api:rule:
namespace URI and then whatever prefix, if any, mechanism that’s valid for that serialization. In the examples here, we use Turtle. Hence, we use@prefix
, etc. -
However, the namespace URIs used by the literal embedded rules can be defined in two places: the string that contains the rule—in the example above, you can see the default namespace is
urn:test:
--or in the Stardog database in which the rules are stored. Either place will work; if there are conflicts, the "closest definition wins", that is, iffoo:Example
is defined in both the rule content and in the Stardog database, the definition in the rule content is the one that Stardog will use. -
Stardog Rule Syntax has the same expressivity of SWRL which means the SPARQL features allowed in rules are limited. Specifically, a triple pattern in a rule should be in one of the following forms:
a) term1
rdf:type
class-urib) term1 prop-uri term2
where class-uri is a URI referring to a user-defined class and prop-uri is a URI referring to a user-defined property.[23]
Only type of property paths allowed in rules are inverse paths (
^p
), sequence paths (p1 / p2
) and alternative paths (p1 | p2
) but these paths should not violate the above conditions. For example, the property pathrdf:type/rdfs:label
is not valid because according to the SPARQL spec this would mean the object of ardf:type
triple pattern is a variable and not a user-defined class.Rule body (
IF
) and only rule body may optionally containUNION
,BIND
orFILTER
clauses. However, functionsEXISTS
,NOT EXISTS
, orNOW()
cannot be used in rules. User-defined functions (UDF) may be used in rules but if the UDF is not a pure function then the results are undefined.Other SPARQL features are not allowed in rules.
-
Having the same predicate both in the rule body (
IF
) and the rule head (THEN
) are supported in a limited way. Cycles are allowed only if the rule body does not contain type triples or filters and the triples in the rule body are linear (i.e. no cycles in the rule body either).In other words, a property used in the rule head depends on a property in the rule body and this dependency graph may contain cycles under some limits. One of these is that a rule body should not contain type triples or filters. Tree-like dependencies are always allowed.
Of course the rule body may also contain triple patterns, which constitute a different kind of graph: it should be linear when edge directions are ignored. So no cycles or trees are allowed in this graph pattern. Linear when directions are ignored means that
{ ?x :p ?y . ?x :p ?z }
is linear but{ ?x :p ?y . ?x :p ?z . ?x :p ?t }
is not because there are three edges for the node represented by?x
.The reason for these limits boils down to the fact that recursive rules and axioms are rewritten as SPARQL property paths. This is why rule bodies cannot contain anything but property atoms. Cycles are allowed as long as we can express these as a regular grammar. Another way to think about this is that these rules should be as expressive as OWL property chains and the same restrictions defined for property chains apply here, too.
Let’s consider some examples.
These rules are acceptable since no cycles appear in dependencies:
IF { ?x :hasFather ?y . ?y :hasBrother ?z } THEN { ?x :hasUncle ?z }
IF { ?x :hasUncle ?y . ?y :hasWife ?z } THEN { ?x :hasAuntInLaw ?z }
These rules are not acceptable since there is a cycle:
IF { ?x :hasFather ?y . ?y :hasBrother ?z } THEN { ?x :hasUncle ?z }
IF { ?x :hasChild ?y . ?y :hasUncle ?z } THEN { ?x :hasBrother ?z }
This kind of cycle is allowed:
IF { ?x :hasChild ?y . ?y :hasSibling ?z } THEN { ?x :hasChild ?z }
Note
|
(3) is a general limitation, not specific to Stardog Rules Syntax: recursion or cycles can occur through multiple rules, or it may occur as a result of interaction of rules with other axioms (or just through axioms alone). |
Stardog Rules Examples
PREFIX rule: <tag:stardog:api:rule:>
PREFIX : <urn:test:>
PREFIX gr: <http://purl.org/goodrelations/v1#>
:Product1 gr:hasPriceSpecification [ gr:hasCurrencyValue 100.0 ] .
:Product2 gr:hasPriceSpecification [ gr:hasCurrencyValue 500.0 ] .
:Product3 gr:hasPriceSpecification [ gr:hasCurrencyValue 2000.0 ] .
IF {
?offering gr:hasPriceSpecification ?ps .
?ps gr:hasCurrencyValue ?price .
FILTER (?price >= 200.00).
}
THEN {
?offering a :ExpensiveProduct .
}
This example is self-contained: it contains some data (the :Product…
triples) and a rule. It also demonstrates the use of SPARQL’s FILTER
to do numerical (and other) comparisons.
Here’s a more complex example that includes four rules and, again, some data.
PREFIX rule: <tag:stardog:api:rule:>
PREFIX : <urn:test:>
:c a :Circle ;
:radius 10 .
:t a :Triangle ;
:base 4 ;
:height 10 .
:r a :Rectangle ;
:width 5 ;
:height 8 .
:s a :Rectangle ;
:width 10 ;
:height 10 .
IF {
?r a :Rectangle ;
:width ?w ;
:height ?h
BIND (?w * ?h AS ?area)
}
THEN {
?r :area ?area
}
IF {
?t a :Triangle ;
:base ?b ;
:height ?h
BIND (?b * ?h / 2 AS ?area)
}
THEN {
?t :area ?area
}
IF {
?c a :Circle ;
:radius ?r
BIND (math:pi() * math:pow(?r, 2) AS ?area)
}
THEN {
?c :area ?area
}
IF {
?r a :Rectangle ;
:width ?w ;
:height ?h
FILTER (?w = ?h)
}
THEN {
?r a :Square
}
This example also demonstrates how to use SPARQL’s BIND
to introduce
intermediate variables and do calculations with or to them.
Let’s look at some other rules, but just the rule content this time for concision, to see some use of other SPARQL features.
This rule says that a person between 13 and 19 (inclusive) years of age is a teenager:
PREFIX swrlb: <http://www.w3.org/2003/11/swrlb#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
IF {
?x a :Person; hasAge ?age.
FILTER (?age >= 13 && ?age <= 19)
}
THEN {
?x a :Teenager.
}
This rule says that a male person with a sibling who is the parent of a female is an "uncle with a niece":
IF {
?x a Person; a :Male; :hasSibling ?y;
?y :isParentOf ?z;
?z a :Female.
}
THEN {
?x a :UncleOfNiece.
}
We can use SPARQL 1.1 property paths (and bnodes for unnecessary
variables (that is, ones that aren’t used in the THEN
)) to render this
rule even more concisely:
IF {
?x a :Person, :Male; :hasSibling/:isParentOf [a :Female]
}
THEN {
?x a :UncleOfNiece.
}
Aside: that’s pure awesome.
And of course a person who’s male and has a niece or nephew is an uncle of his niece(s) and nephew(s):
IF {
?x a :Male; :isSiblingOf/:isParentOf ?z
}
THEN {
?x :isUncleOf ?z.
}
Next rule example: a super user can read all of the things!
IF {
?x a :SuperUser.
?y a :Resource.
?z a <http://www.w3.org/ns/sparql#UUID>.
}
THEN {
?z a :Role.
?x :hasRole ?z; :readPermission ?y.
}
Supported Built-Ins
Stardog supports a wide variety of functions from SPARQL, XPath, SWRL, and some native Stardog functions, too. All of them may be used in either Stardog Rules syntax or in SWRL syntax. The supported functions are enumerated here.
Special Predicates
Stardog supports some builtin predicates with special meaning in order to make queries easier to read and write. These special predicates are primarily syntactic sugar for more complex structures.
Direct/Strict Subclasses, Subproperties, & Direct Types
Besides the standard RDF(S) predicates rdf:type
, rdfs:subClassOf
and
rdfs:subPropertyOf
, Stardog supports the following special built-in
predicates:
-
sp:directType
-
sp:directSubClassOf
-
sp:strictSubClassOf
-
sp:directSubPropertyOf
-
sp:strictSubPropertyOf
Where the sp
prefix binds to tag:stardog:api:property:
. Stardog also
recognizes sesame:directType
, sesame:directSubClassOf
, and
sesame:strictSubClassOf
predicates where the prefix sesame
binds to
http://www.openrdf.org/schema/sesame#
.
We show what these each of these predicates means by relating them to an equivalent triple pattern; that is, you can just write the predicate rather than the (more unwieldy) triple pattern.
#c1 is a subclass of c2 but not equivalent to c2
:c1 sp:strictSubClassOf :c2 => :c1 rdfs:subClassOf :c2 .
FILTER NOT EXISTS {
:c1 owl:equivalentClass :c2 .
}
#c1 is a strict subclass of c2 and there is no c3 between c1 and c2 in
#the strict subclass hierarchy
:c1 sp:directSubClassOf :c2 => :c1 sp:strictSubClassOf :c2 .
FILTER NOT EXISTS {
:c1 sp:strictSubClassOf :c3 .
:c3 sp:strictSubClassOf :c2 .
}
#ind is an instance of c1 but not an instance of any strict subclass of c1
:ind sp:directType :c1 => :ind rdf:type :c1 .
FILTER NOT EXISTS {
:ind rdf:type :c2 .
:c2 sp:strictSubClassOf :c1 .
}
The predicates sp:directSubPropertyOf
and sp:strictSubPropertyOf
are defined
analogously.
New Individuals with SWRL
Stardog also supports a special predicate that extends the expressivity of SWRL rules. According to SWRL, you can’t create new individuals (i.e., new instances of classes) in a SWRL rule.
Note
|
Don’t get hung up by the tech vocabulary here…"new individual" just means that you can’t have a rule that creates a new instance of some RDF or OWL class as a result of the rule firing. |
This restriction is well-motivated; without it, you can easily create rules that do not terminate, that is, never reach a fixed point. Stardog’s user-defined rules weakens this restriction in some crucial aspects, subject to the following restrictions, conditions, and warnings.
Warning
|
This special predicate is basically a loaded gun with which you may shoot yourselves in the foot if you aren’t very careful. |
So despite the general restriction in SWRL, in Stardog we actually can
create new individuals with a rule by using the function UUID()
as
follows:
IF {
?p a :Parent .
BIND (UUID() AS ?parent) .
}
THEN {
?parent a :Person .
}
Note
|
Alternatively, we can use the predicate
http://www.w3.org/ns/sparql#UUID as a unary SWRL built-in.
|
This rule will create a random URI for each instance of the class
:Parent
and also assert that each new instance is an instance of
:Person
--parents are people, too!
Remarks
-
The URIs for the generated individuals are meaningless in the sense that they should not be used in further queries; that is to say, these URIs are not guaranteed by Stardog to be stable.
-
Due to normalization, rules with more than one atom in the head are broken up into several rules.
Thus,
IF {
?person a :Person .
BIND (UUID() AS ?parent) .
}
THEN {
?parent a :Parent ;
a :Male .
}
will be normalized into two rules:
IF {
?person a :Person .
BIND (UUID() AS ?parent) .
}
THEN {
?parent a :Parent .
}
IF {
?person a :Person .
BIND (UUID() AS ?parent) .
}
THEN {
?parent a :Male .
}
As a consequence, instead of stating that the new individual is both an instance
of :Male
and :Parent
, we would create two different new individuals and
assert that one is male and the other is a parent. If you need to assert various
things about the new individual, we recommend the use of extra rules or axioms. In
the previous example, we can introduce a new class (:Father
) and add the
following rule to our schema:
IF {
?person a :Father .
}
THEN {
?parent a :Parent ;
a :Male .
}
And then modify the original rule accordingly:
IF {
?person a :Person .
BIND (UUID() AS ?parent) .
}
THEN {
?parent a :Father .
}
Query Rewriting
Reasoning in Stardog is based (mostly) on a query rewriting technique: Stardog rewrites the user’s query with respect to any schema or rules, and then executes the resulting expanded query (EQ) against the data in the normal way. This process is completely automated and requires no intervention from the user.
As can be seen in Figure 1, the rewriting process involves five different phases.


We illustrate the query answering process by means of an example. Consider a Stardog database, MyDB1, containing the following schema:
:SeniorManager rdfs:subClassOf :manages some :Manager
:manages some :Employee rdfs:subClassOf :Manager
:Manager rdfs:subClassOf :Employee
Which says that a senior manager manages at least one manager, that every person that manages an employee is a manager, and that every manager is also an employee.
Let’s also assume that MyDB1 contains the following data assertions:
:Bill rdf:type :SeniorManager
:Robert rdf:type :Manager
:Ana :manages :Lucy
:Lucy rdf:type :Employee
Finally, let’s say that we want to retrieve the set of all employees. We do this by posing the following query:
SELECT ?employee WHERE { ?employee rdf:type :Employee }
To answer this query, Stardog first rewrites it using the information in the schema. So the original query is rewritten into four queries:
SELECT ?employee WHERE { ?employee rdf:type :Employee }
SELECT ?employee WHERE { ?employee rdf:type :Manager }
SELECT ?employee WHERE { ?employee rdf:type :SeniorManager }
SELECT ?employee WHERE { ?employee :manages ?x. ?x rdf:type :Employee }
Then Stardog executes these queries over the data as if they were written that way to begin with. In fact, Stardog can’t tell that they weren’t. Reasoning in Stardog just is query answering in nearly every case.
The form of the EQ depends on the reasoning type. For OWL 2 QL, every EQ produced by Stardog is guaranteed to be expanded into a set of queries. If the reasoning type is OWL 2 RL or EL, then the EQ may (but may not) include a recursive rule. If a recursive rule is included, Stardog’s answers may be incomplete with respect to the semantics of the reasoning type.
Why Query Rewriting?
Query rewriting has several advantages over materialization. In materialization, the data gets expanded with respect to the schema, not with respect to any actual query. And it’s the data—all of the data—that gets expanded, whether any actual query subsequently requires reasoning or not. The schema is used to generate new triples, typically when data is added or removed from the system. However, materialization introduces several thorny issues:
-
data freshness. Materialization has to be performed every time the data or the schema change. This is particularly unsuitable for applications where the data changes frequently.
-
data size. Depending on the schema, materialization can significantly increase the size of the data, sometimes dramatically so. The cost of this data size blowup may be applied to every query in terms of increased I/O.
-
OWL 2 profile reasoning. Given the fact that QL, RL, and EL are not comparable with respect to expressive power, an application that requires reasoning with more than one profile would need to maintain different corresponding materialized versions of the data.
-
Resources. Depending on the size of the original data and the complexity of the schema, materialization may be computationally expensive. And truth maintenance, which materialization requires, is always computationally expensive.
Same As Reasoning
Stardog 3.0 adds full support for OWL 2 sameAs
reasoning. However,
sameAs
reasoning works in a different way than the rest of the
reasoning mechanism. The sameAs
inferences are computed and indexed
eagerly so that these materialized inferences can be used directly at
query rewriting time. The sameAs
index is updated automatically as the
database is modified so the difference is not of much direct concern to users.
In order to use sameAs
reasoning, the database configuration
option reasoning.sameas
should be set either at database creation
time or at a later time when the database is offline. This can be done
using the command line as follows:
$ ./stardog-admin db create -o reasoning.sameas=FULL -n myDB
There are legal three values for this option:
-
OFF
disables allsameAs
inferences, that is, only assertedsameAs
triples will be included in query results.[24] -
ON
computessameAs
inferences using only assertedsameAs
triples, considering the reflexivity, symmetry and transitivity of thesameAs
relation. -
FULL
same asON
but also considers OWL functional properties, inverse functional properties, andhasKey
axioms while computingsameAs
inferences.
Note
|
The way sameAs reasoning works differs from the OWL semantics
slightly in the sense that Stardog designates one canonical individual
for each sameAs equivalence set and only returns the canonical
individual. This avoids the combinatorial explosion in query results
while providing the data integration benefits.
|
Let’s see an example showing how sameAs
reasoning works. Consider the
following database where sameAs
reasoning is set to ON
:
dbpedia:Elvis_Presley
dbpedia-owl:birthPlace dbpedia:Mississippi ;
owl:sameAs freebase:en.elvis_presley .
nyt:presley_elvis_per
nyt:associated_article_count 35 ;
rdfs:label "Elvis Presley" ;
owl:sameAs dbpedia:Elvis_Presley .
freebase:en.elvis_presley
freebase:common.topic.official_website <http://www.elvis.com/> .
Now consider the following query and its results:
$ ./stardog query --reasoning elvis 'SELECT * { ?s dbpedia-owl:birthPlace ?o; rdfs:label "Elvis Presley" }'
+-----------------------+---------------------+
| s | o |
+-----------------------+---------------------+
| nyt:presley_elvis_per | dbpedia:Mississippi |
+-----------------------+---------------------+
Let’s unpack this carefully. There are three things to note.
First, the query returns only one result even though there are three
different URIs that denote Elvis Presley. Second, the URI returned is
fixed but chosen randomly. Stardog picks one of the URIs as the
canonical URI and always returns that and only that canonical URI in
the results. If more sameAs
triples are added the chosen canonical
individual may change. Third, it is important to point out that even
though only one URI is returned, the effect of sameAs
reasoning is
visible in the results since the rdfs:label
and
dbpedia-owl:birthPlace
properties were asserted about different
instances (i.e., different URIs).
Now, you might be inclined to write queries such as this to get all the properties for a specific URI:
SELECT * {
nyt:presley_elvis_per owl:sameAs ?elvis .
?elvis ?p ?o
}
However, this is completely unnecessary; rather, you can write the
following query and get the same results since sameAs
reasoning would
automatically merge the results for you. Therefore, the query
SELECT * {
nyt:presley_elvis_per ?p ?o
}
would return these results:
+----------------------------------------+-----------------------+
| p | o |
+----------------------------------------+-----------------------+
| rdfs:label | "Elvis Presley" |
| dbpedia-owl:birthPlace | dbpedia:Mississippi |
| nyt:associated_article_count | 35 |
| freebase:common.topic.official_website | http://www.elvis.com/ |
| rdf:type | owl:Thing |
+----------------------------------------+-----------------------+
Note
|
The URI used in the query does not need to be the same one returned in the results. Thus, the following query would return the exact same results, too: |
SELECT * {
dbpedia:Elvis_Presley ?p ?o
}
The only time Stardog will return a non-canonical URI in the query
results is when you explicitly query for the sameAs
inferences as in
this next example:
$ ./stardog query -r elvis 'SELECT * { freebase:en.elvis_presley owl:sameAs ?elvis }'
+---------------------------+
| elvis |
+---------------------------+
| dbpedia:Elvis_Presley |
| freebase:en.elvis_presley |
| nyt:presley_elvis_per |
+---------------------------+
In the FULL
sameAs
reasoning mode, Stardog will also take other
OWL axioms into account when computing sameAs
inferences. Consider
the following example:
#Everyone has a unique SSN number
:hasSSN a owl:InverseFunctionalProperty , owl:DatatypeProperty .
:JohnDoe :hasSSN "123-45-6789" .
:JDoe :hasSSN "123-45-6789" .
#Nobody can work for more than one company (for the sake of the example)
:worksFor a owl:FunctionalProperty , owl:ObjectProperty ;
rdfs:domain :Employee ;
rdfs:range :Company .
:JohnDoe :worksFor :Acme .
:JDoe :worksFor :AcmeInc .
#For each company, there can only be one employee with the same employee ID
:Employee owl:hasKey (:employeeID :worksFor ).
:JohnDoe :employeeID "1234-ABC" .
:JohnD :employeeID "1234-ABC" ;
:worksFor :AcmeInc .
:JD :employeeID "5678-XYZ" ;
:worksFor :AcmeInc .
:John :employeeID "1234-ABC" ;
:worksFor :Emca .
For this database, with sameAs
reasoning set to FULL
, we would get
the following answers:
$ ./stardog query -r acme "SELECT * {?x owl:sameAs ?y}"
+----------+----------+
| x | y |
+----------+----------+
| :JohnDoe | :JohnD |
| :JDoe | :JohnD |
| :Acme | :AcmeInc |
+----------+----------+
We can follow the chain of inferences to understand how these results were computed:
-
:JohnDoe owl:sameAs :JohnD
can be computed due to the fact that both have the same SSN numbers andhasSSN
property is inverse functional. -
We can infer
:Acme owl:sameAs :AcmeInc
since:JohnDoe
can work for at most one company. -
:JohnDoe owl:sameAs :JohnD
can be inferred using theowl:hasKey
definition since both individuals are known to work for the same company and have the same employee ID. -
No more
sameAs
inferences can be computed due to the key definition, since other employees either have different IDs or work for other companies.
Removing Unwanted Inferences
Sometimes reasoning can produce unintended inferences. Perhaps there are modeling errors in the schema or incorrect assertions in the data. After an unintended inference is detected, it might be hard to figure out how to fix it, because there might be multiple different reasons for the inference. The reasoning explain command can be used to see the different explanations and the reasoning undo command can be used to generate a SPARQL update query that will remove the minimum amount of triples necessary to remove the unwanted inference:
$ ./reasoning undo myDB ":AcmeInc a :Person"
Performance Hints
The query rewriting approach suggests some guidelines for more efficient query answering.
Hierarchies and Queries
- Avoid unnecessarily deep class/property hierarchies.
-
If you do not need to model several different types of a given class or property in your schema, then don’t do that! The reason shallow hierarchies are desirable is that the maximal hierarchy depth in the schema partly determines the maximal size of the EQs produced by Stardog. The larger the EQ, the longer it takes to evaluate, generally.
For example, suppose our schema contains a very thorough and detailed set of subclasses of the class
:Employee
::Manager rdfs:subClassOf :Employee :SeniorManager rdfs:subClassOf :Manager ... :Supervisor rdfs:subClassOf :Employee :DepartmentSupervisor rdfs:subClassOf :Supervisor ... :Secretary rdfs:subClassOf :Employee ...
If we wanted to retrieve the set of all employees, Stardog would produce an EQ containing a query of the following form for every subclass
:Ci
of:Employee
:SELECT ?employee WHERE { ?employee rdf:type :Ci }
Thus, ask the most specific query sufficient for your use case. Why? More general queries—that is, queries that contain concepts high up in the class hierarchy defined by the schema—will typically yield larger EQs.
- Avoid variable predicates and variable types.
-
When reasoning is enabled, triple patterns with a variable in the predicate position or a variable in the object position for
rdf:type
often cause performance problems especially with large class hierarchies. In some cases, Stardog’s optimizer could address this issue by using other patterns in the query which bind those variables, but it is not always possible. You may consider enumerating values for those variables explicitly, for example, instead of:SELECT * WHERE { ?employeeX ?property ?employeeY }
use
SELECT * WHERE { ?employeeX :colleague | :manager ?employeeX }
or, if all relevant properties have a common super-property, say,
:worksWith
, thenSELECT * WHERE { ?employeeX ?property ?employeeX . ?property rdfs:subPropertyOf :worksWith }
should also work better than the first query (though likely less efficient than the second).
Domains and Ranges
- Specify domain and range of the properties in the schema.
-
These types of axiom can improve query performance significantly. Consider the following query asking for people and the employees they manage:
SELECT ?manager ?employee WHERE { ?manager :manages ?employee. ?employee rdf:type :Employee. }
We know that this query would cause a large EQ given a deep hierarchy of
:Employee
subclasses. However, if we added the following single range axiom::manages rdfs:range :Employee
then the EQ would collapse to
SELECT ?manager ?employee WHERE { ?manager :manages ?employee }
which is considerably easier to evaluate.
Very Large Schemas
If you are working with a very large schema like SNOMED then there are couple things to note. First of all, Stardog reasoning works by pulling the complete schema into memory. This means you might need to increase the default memory settings for Stardog for a large schema. Stardog performs all schema reasoning upfront and only once but waits until the first reasoning query arrives. With a large schema, this step can be slow but subsequent reasoning queries will be fast. Also note that, Stardog will update schema reasoning results automatically after the database is modified so there will be some processing time spent then.
Reasoning with very expressive schemas can be time consuming and use a lot of
memory. To get the best performance out of Stardog with large schemas, limit the
expressivity of your schema to
OWL 2 EL. You
can also set the reasoning type of the database to EL
and Stardog will
automatically filter any axiom outside the EL expressivity. See
Reasoning Types for more details on reasoning types. OWL 2 EL allows
range declarations for properties and user-defined datatypes but avoiding these
two constructs will further improve schema reasoning performance in Stardog.
Not Seeing Expected Results?
Here’s a few things that you might want to consider.
Are variable types ambiguous?
When a SPARQL query gets executed, each variable is bound to a URI, blank node, or to a literal to form a particular result (a collection of these results is a result set). In the context of reasoning, URIs might represent different entities: individuals, classes, properties, etc. According to the relevant standard, every variable in a SPARQL query must bind to at most one of these types of entity.
Stardog can often figure out the right entity type from the query itself (e.g.,
given the triple pattern ?i ?p "a literal"
, we know ?p
is supposed to bind
to a data property); however, sometimes this isn’t possible (e.g., ?s ?p
?o
). In case the types can’t be determined automatically, Stardog logs a
message and evaluates the query by making some assumptions, which may not be
what the query writer intended, about the types of variables.
You can add one or more type triples to the query to resolve these ambiguities.[25]
These "type triples" have the form ?var a TYPE
, where TYPE
is a URI
representing the type of entity to which the variable ?var
is supposed to
bind: the most common are owl:ObjectProperty
or owl:DatatypeProperty
; in
some cases, you might want owl:NamedIndividual
, or owl:Class
. For instance,
you can use the following query to retrieve all object properties and their
characteristics; without the type triple, ?s
will bind only to individuals:
SELECT ?o
WHERE {
?s rdf:type ?o.
?s a owl:ObjectProperty.
}.
Since Stardog now knows that ?s
should bind to an object property, it can now
infer that ?o
binds to property characteristics of ?s
.
Is the schema where you think it is?
Starting in Stardog 3.0, Stardog will extract the schema from all named graphs and the default graph.
If you require that the schema only be extracted from one or more specific named
graphs, then you must tell Stardog where to find the schema. See database
configuration options for details. You can also use
the reasoning schema
command to
export the contents of the schema to see exactly what is included in the schema
that Stardog uses.
Are you using the right reasoning type?
Perhaps some of the modeling constructs (a.k.a. axioms) in your database are
being ignored. By default, Stardog uses the SL
reasoning type. You can find
out which axioms are being ignored by looking at the Stardog log file.
Are you using DL?
Stardog supports full OWL 2 DL reasoning but only for data that fits into main memory.
Are you using SWRL?
SWRL rules—whether using SWRL syntax or Stardog Rules Syntax—are only taken into account using the SL reasoning type.
Do you know what to expect?
The OWL 2 primer is a good place to start.
Known Issues
Stardog 7.4.5 does not
-
Follow ontology
owl:imports
statements automatically; any imported OWL ontologies that are required must be loaded into a Stardog database in the normal way. -
Handle arbitrary recursion in OWL axioms and rules (see Rule Limitations & Gotchas). If recursion is unsupported, query results will be sound (no wrong answers) but potentially incomplete (some correct answers not returned) with respect to the requested reasoning type.
Terminology
This chapter uses the following terms of art.
Databases
A database (DB), a.k.a. ontology, is composed of two different parts: the schema or Terminological Box (TBox) and the data or Assertional Box (ABox). Analogus to relational databases, the TBox can be thought of as the schema, and the ABox as the data. In other words, the TBox is a set of axioms, whereas the ABox is a set of assertions.
As we explain in OWL 2 Profiles, the kinds of assertion and axiom that one might use for a particular database are determined by the fragment of OWL 2 to which you’d like to adhere. In general, you should choose the OWL 2 profile that most closely fits the data modeling needs of your application.
The most common data assertions are class and property assertions. Class assertions are used to state that a particular individual is an instance of a given class. Property assertions are used to state that two particular individuals (or an individual and a literal) are related via a given property. For example, suppose we have a DB MyDB2 that contains the following data assertions. We use the usual standard prefixes for RDF(S) and OWL.
:complexible rdf:type :Company
:complexible :maintains :Stardog
Which says that :complexible
is a company, and that
:complexible
maintains :Stardog
.
The most common schema axioms are subclass axioms. Subclass axioms are used to state that every instance of a particular class is also an instance of another class. For example, suppose that MyDB2 contains the following TBox axiom:
:Company rdfs:subClassOf :Organization
stating that companies are a type of organization.
Queries
When reasoning is enabled, Stardog executes SPARQL queries depending on the type of Basic Graph Patterns they contain. A BGP is said to be an "ABox BGP" if it is of one of the following forms:
-
term1
rdf:type
uri -
term1 uri term2
-
term1
owl:differentFrom
term2 -
term1
owl:sameAs
term2
A BGP is said to be a TBox BGP if it is of one of the following forms:
-
term1
rdfs:subClassOf
term2 -
term1
owl:disjointWith
term2 -
term1
owl:equivalentClass
term2 -
term1
rdfs:subPropertyOf
term2 -
term1
owl:equivalentProperty
term2 -
term1
owl:inverseOf
term2 -
term1
owl:propertyDisjointWith
term2 -
term1
rdfs:domain
term2 -
term1
rdfs:range
term2
A BGP is said to be a Hybrid BGP if it is of one of the following forms:
-
term1
rdf:type
?var -
term1 ?var term2
where term (possibly with subscripts) is either an URI or variable; uri is a URI; and ?var is a variable.
When executing a query, ABox BGPs are handled by Stardog. TBox BGPs are executed by Pellet embedded in Stardog. Hybrid BGPs by a combination of both.
Reasoning
Intuitively, reasoning with a DB means to make implicit knowledge explicit. There are two main use cases for reasoning: to infer implicit knowledge and to discover modeling errors.
With respect to the first use case, recall that MyDB2 contains the following assertion and axiom:
:stardog rdf:type :Company
:Company rdfs:subClassOf :Organization
From this DB, we can use Stardog in order to infer that
:stardog
is an organization:
:stardog rdf:type :Organization
Using reasoning in order to infer implicit knowledge in the context of an enterprise application can lead to simpler queries. Let us suppose, for example, that MyDB2 contains a complex class hierarchy including several types of organization (including company). Let us further suppose that our application requires to use Stardog in order to get the list of all considered organizations. If Stardog were used with reasoning, then we would need only issue the following simple query:
SELECT ?org WHERE { ?org rdf:type :Organization}
In contrast, if we were using Stardog with no reasoning, then we would have to issue a more complex query that considers all possible types of organization, thus coupling queries to domain knowledge in a tight way:
SELECT ?org WHERE
{ { ?org rdf:type :Organization } UNION
{ ?org rdf:type :Company } UNION
...
}
Which of these queries seems more loosely coupled and more resilient to change?
Stardog can also be used in order to discover modeling errors in a DB. The most common modeling errors are unsatisfiable classes and inconsistent DBs.
An unsatisfiable class is simply a class that cannot have any instances. Say, for example, that we added the following axioms to MyDB2:
:Company owl:disjointWith :Organization
:LLC owl:equivalentClass :Company and :Organization
stating that companies cannot be organizations and vice versa, and that
an LLC is a company and an organization. The disjointness axiom causes
the class :LLC
to be unsatisfiable because, for the DB to be
free of any logical contradiction, there can be no instances of :LLC
.
Asserting (or inferring) that an unsatisfiable class has an instance, causes the
DB to be inconsistent. In the particular case of MyDB2, we know that
:stardog
is a company and an organization; therefore, we also know
that it is an instance of :LLC
, and as :LLC
is known to be unsatisfiable, we
have that MyDB2 is inconsistent.
Using reasoning in order to discover modeling errors in the context of
an enterprise application is useful in order to maintain a correct
contradiction-free model of the domain. In our example, we discovered
that :LLC
is unsatisfiable and MyDB2 is inconsistent, which leads us
to believe that there is a modeling error in our DB. In this case, it is
easy to see that the problem is the disjointness axiom between
:Company
and :Organization
.
OWL 2 Profiles
As explained in the OWL 2 Web Ontology Language Profiles Specification, an OWL 2 profile is a reduced version of OWL 2 that trades some expressive power for efficiency of reasoning. There are three OWL 2 profiles, each of which achieves efficiency differently.
-
OWL 2 QL is aimed at applications that use very large volumes of instance data, and where query answering is the most important reasoning task. The expressive power of the profile is necessarily limited; however, it includes most of the main features of conceptual models such as UML class diagrams and ER diagrams.
-
OWL 2 EL is particularly useful in applications employing ontologies that contain very large numbers of properties and classes. This profile captures the expressive power used by many such ontologies and is a subset of OWL 2 for which the basic reasoning problems can be performed in time that is polynomial with respect to the size of the ontology.
-
OWL 2 RL is aimed at applications that require scalable reasoning without sacrificing too much expressive power. It is designed to accommodate OWL 2 applications that can trade the full expressivity of the language for efficiency, as well as RDF(S) applications that need some added expressivity.
Each profile restricts the kinds of axiom and assertion that can be used
in a DB. Colloquially, QL is the least expressive of the profiles,
followed by RL and EL; however, strictly speaking, no profile is more
expressive than any other as they provide incomparable sets of
constructs. The SL
profile, which is the default for Stardog, contains all
three of them.
Validating Constraints
Stardog Integrity Constraint Validation ("ICV") validates RDF data stored in a Stardog database according to constraints described by users and that make sense for their domain, application, and data. These constraints may be written in SPARQL, OWL, or SWRL, and SHACL.
The use of high-level languages (OWL 2, SWRL, and SPARQL) to validate RDF data using closed world semantics is one of Stardog’s unique capabilities. Using high level languages like OWL, SWRL, and SPARQL as schema or constraint languages for RDF and Linked Data has several advantages:
-
Unifying the domain model with data quality rules
-
Aligning the domain model and data quality rules with the integration model and language (i.e., RDF)
-
Being able to query the domain model, data quality rules, integration model, mapping rules, etc. with SPARQL
-
Being able to use automated reasoning about all of these things to insure logical consistency, explain errors and problems, etc.
Tip
|
See the extended ICV tutorial in the stardog-examples repo on Github and our blog post Data Quality with ICV for more details about using ICV. Use of SHACL is demonstrated in our Data Validation and SHACL webinar. |
Using ICV from CLI
To add constraints to a database:
$ stardog-admin icv add myDb constraints.rdf
To drop all constraints from a database:
$ stardog-admin icv drop myDb
To remove one or more specific constraints from a database:
$ stardog-admin icv remove myDb constraints.rdf
To convert new or existing constraints into SPARQL queries for export:
$ stardog icv convert myDb constraints.rdf
To explain a constraint violation:
$ stardog icv explain --contexts http://example.org/context1 http://example.org/context2 -- myDb
To export constraints:
$ stardog icv export myDb constraints.rdf
To validate a database (or some named graphs) with respect to constraints:
$ stardog icv validate --contexts http://example.org/context1 http://example.org/context2 -- myDb
ICV & OWL 2 Reasoning
An integrity constraint may be satisfied or violated in either of two ways: by an explicit statement in a Stardog database or by a statement that’s been validly inferred by Stardog.
When ICs are being validated the user needs to specify if reasoning will be used or not. So ICV is performed with three inputs:
-
a Stardog database,
-
a set of constraints, and
-
a reasoning flag (which may be, of course, set to false for no reasoning).
This is the case because domain modelers, ontology developers, or integrity constraint authors must consider the interactions between explicit and inferred statements and how these are accounted for in integrity constraints.
ICV Guard Mode
Stardog will also apply constraints as part of its transactional cycle and fail transactions that violate constraints. We call this "guard mode". It must be enabled explicitly in the database configuration options. Using the command line, these steps are as follows:
$ ./stardog-admin db offline myDb #take the database offline
$ ./stardog-admin metadata set -o icv.enabled=true myDb #enable ICV
$ ./stardog-admin db online myDb #put the database online
Once guard mode is enabled, modifications of the database (via SPARQL Update or any other method), whether adds or deletes, that violate the integrity constraints will cause the transaction to fail.
Explaining ICV Violations
ICV violations can be explained using Stardog’s Proof Trees. The following command will explain the IC violations for constraints stored in the database:
$ stardog icv explain --reasoning "myDB"
The command is flexible to change the number of violations displayed, and to explain violations for external constraints by passing the file with constraints as an additional argument:
$ stardog icv explain --reasoning --limit 2 "myDB" constraints.ttl
Security Note
Warning
|
There is a security implication in this design that may not be obvious. Changing the reasoning type associated with a database and integrity constraint validation may have serious security implications with respect to a Stardog database and, thus, may only be performed by a user role with sufficient privileges for that action. |
Repairing ICV Violations
Stardog 3.0 adds support for automatic repair of some kinds of integrity
violation. This can be accomplished programmatically via API, as well as via CLI
using the icv fix
subcommand.
$ stardog help icv fix
Repair plans are emitted as a sequence of SPARQL Update queries, which
means they can be applied to any system that understands SPARQL
Update. If you pass --execute
the repair plan will be applied
immediately.
icv fix
will repair violations of all constraints in
the database; if you’d prefer to fix the violations for only some constraints,
you can pass those constraints as an additional argument. Although a possible
(but trivial) fix for any violation is to remove one or more constraints,
icv fix
does not suggest that kind of repair, even
though it may be appropriate in some cases.
SHACL Constraints
As of version 6.1, Stardog supports validation of SHACL constraints. SHACL constraints can be managed like any other constraint Stardog supports and all the existing validation commands work with SHACL constraints.
Normally constraints are stored in the system database and managed with special
commands icv add
and icv
remove
. This is still possible with SHACL constraints but if desired SHACL
constraints can be loaded into the database along with regular data using
data add
. Validation results will be the same in both
cases.
SHACL support comes with a new validation command that outputs the SHACL validation report:
$ stardog icv report myDb
SHACL Support Limitations
Stardog supports all the features in the core SHACL language with the following exceptions:
-
Stardog supports SPARQL-based constraints but does not support prebinding the
$shapesGraph
or$currentShape
variables in SPARQL -
Stardog does not support property validators
-
Stardog does not support the Advanced Features or the JavaScript Extensions
OWL Constraint Examples
Let’s look at some OWL constraint examples that show these constraints work.
These examples use OWL 2
Manchester syntax, and they assume a simple data schema, which is
available as an OWL ontology and as a
UML diagram. The examples assume that the
default namespace is http://example.com/company.owl#
and that
xsd:
is bound to the standard,
http://www.w3.org/2001/XMLSchema#
.
Reference Java code is available for each of the following examples and is also distributed with Stardog.
Subsumption Constraints
This kind of constraint guarantees certain subclass and superclass (i.e., subsumption) relationships exist between instances.
Managers must be employees.
Constraint
:Manager rdfs:subClassOf :Employee
Database A (invalid)
:Alice a :Manager .
Database B (valid)
:Alice a :Manager , :Employee .
This constraint says that if an RDF individual is an instance of
Manager
, then it must also be an instance of Employee
. In
A, the only instance of Manager
, namely Alice
, is not an instance of
Employee
; therefore, A is invalid. In B, Alice
is an instance of
Database both Manager
and Employee
; therefore, B is valid.
Domain-Range Constraints
These constraints control the types of subjects and objects used with a property.
Only project leaders can be responsible for projects.
Constraint
:is_responsible_for rdfs:domain :Project_Leader ;
rdfs:range :Project .
Database A (invalid)
:Alice :is_responsible_for :MyProject .
:MyProject a :Project .
Database B (invalid)
:Alice a :Project_Leader ;
:is_responsible_for :MyProject .
Database C (valid)
:Alice a :Project_Leader ;
:is_responsible_for :MyProject .
:MyProject a :Project .
This constraint says that if two RDF instances are related to each other via the
property is_responsible_for
, then the subject must be an instance of
Project_Leader
and the object must be an instance of Project
. In
Database A, there is only one pair of individuals related via
is_responsible_for
, namely (Alice, MyProject)
, and MyProject
is an
instance of Project
; but Alice
is not an instance of Project_Leader
.
Therefore, A is invalid. In B, Alice
is an instance of Project_Leader
, but
MyProject
is not an instance of Project
; therefore, B is not valid. In C,
Alice
is an instance of Project_Leader
, and MyProject
is an instance of
Project
; therefore, C is valid.
Only employees can have an SSN.
Constraint
:ssn rdfs:domain :Employee
Database A (invalid)
:Bob :ssn "123-45-6789" .
Database B (valid)
:Bob a :Employee ;
:ssn "123-45-6789" .
This constraint says that if an RDF instance i
has a data assertion via the
the property SSN
, then i
must be an instance of Employee
. In A, Bob
is
not an instance of Employee
but has SSN
; therefore, A is invalid. In B,
Bob
is an instance of Employee
; therefore, B is valid.
A date of birth must be a date.
Constraint
:dob rdfs:range xsd:date
Database A (invalid)
:Bob :dob "1970-01-01" .
Database B (valid)
:Bob :dob "1970-01-01"^^xsd:date
This constraint says that if an RDF instance i
is related to a literal
l
via the data property DOB
, then l
must have the XML Schema type
xsd:date
. In A, Bob
is related to the untyped literal
"1970-01-01"
via DOB
so A is invalid. In B, the literal
"1970-01-01"
is properly typed so it’s valid.
Participation Constraints
These constraints control whether or not an RDF instance participates in some specified relationship.
Each supervisor must supervise at least one employee.
Constraint
:Supervisor rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty :supervises ;
owl:someValuesFrom :Employee
] .
Database A (valid)
:Alice a owl:Thing .
Database B (invalid)
:Alice a :Supervisor .
Database C (invalid)
:Alice a :Supervisor ;
:supervises :Bob .
Database D (valid)
:Alice a :Supervisor ;
:supervises :Bob .
:Bob a :Employee
This constraint says that if an RDF instance i
is of type Supervisor
, then
i
must be related to an individual j
via the property supervises
and also
that j
must be an instance of Employee
. In A, Supervisor
has no instances;
therefore, A is trivially valid. In B, the only instance of Supervisor
, namely
Alice
, is related to no individual; therefore, B is invalid. In C, Alice
is
related to Bob
via supervises
, but Bob
is not an instance of Employee
;
therefore, C is invalid. In D, Alice
is related to Bob
via supervises
, and
Bob
is an instance of Employee
; hence, D is valid.
Each project must have a valid project number.
Constraint
:Project rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty :number ;
owl:someValuesFrom
[ a rdfs:Datatype ;
owl:onDatatype xsd:integer ;
owl:withRestrictions ([xsd:minInclusive 0] [ xsd:maxExclusive 5000])
]
] .
Database A (valid)
:MyProject a owl:Thing .
Database B (invalid)
:MyProject a :Project
Database C (invalid)
:MyProject a :Project ;
:number "23" .
Database D (invalid)
:MyProject a :Project ;
:number "6000"^^xsd:integer .
Database E (valid)
:MyProject a :Project ;
:number "23"^^xsd:integer .
This constraint says that if an RDF instance i
is of type Project
, then i
must be related via the property number
to an integer between 0
and 5000
(inclusive)—that is, projects have project numbers in a certain range. In A,
the individual MyProject
is not known to be an instance of Project
so the
constraint does not apply at all and A is valid. In B, MyProject
is an
instance of Project
but doesn’t have any data assertions via number
so A is
invalid. In C, MyProject
does have a data property assertion via number
but
the literal "23"
is untyped—that is, it’s not an integer—therefore, C is
invalid. In D, MyProject
is related to an integer via number
but it is out
of the range: D is invalid. Finally, in E, MyProject
is related to the integer
23
which is in the range of [0,5000]
so E is valid.
Cardinality Constraints
These constraints control the number of various relationships or property values.
Employees must not work on more than 3 projects.
Constraint
:Employee rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty :works_on;
owl:maxQualifiedCardinality "3"^^xsd:nonNegativeInteger ;
owl:onClass :Project
] .
Database A (valid)
:Bob a owl:Thing.
Database B (valid)
:Bob a :Employee ;
:works_on :MyProject .
:MyProject a :Project .
Database C (invalid)
:Bob a :Employee ;
:works_on :MyProject , :MyProjectFoo , :MyProjectBar , :MyProjectBaz .
:MyProject a :Project .
:MyProjectFoo a :Project .
:MyProjectBar a :Project .
:MyProjectBaz a :Project .
If an RDF instance i
is an Employee
, then i
must not be related via the
property works_on
to more than 3 instances of Project
. In A, Bob
is not
known to be an instance of Employee
so the constraint does not apply and the A
is valid. In B, Bob
is an instance of Employee
but is known to work on only
a single project, namely MyProject
, so B is valid. In C, Bob
is related to 4
instances of Project
via works_on
.
Note
|
Stardog ICV implements a weak form of the unique name assumption, that is, it assumes that things which have different names are, in fact, different things.[26] |
Since Stardog ICV uses closed world (instead of open world) semantics,[27] it assumes that the different projects with different names are, in fact, separate projects, which (in this case) violates the constraint and makes C invalid.
Departments must have at least 2 employees.
Constraint
:Department rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty [owl:inverseOf :works_in] ;
owl:minQualifiedCardinality "2"^^xsd:nonNegativeInteger ;
owl:onClass :Employee
] .
Database A (valid)
:MyDepartment a owl:NamedIndividual .
Database B (invalid)
:MyDepartment a :Department .
:Bob a :Employee ;
:works_in :MyDepartment .
Database C (valid)
:MyDepartment a :Department .
:Alice a :Employee ;
:works_in :MyDepartment .
:Bob a :Employee ;
:works_in :MyDepartment .
This constraint says that if an RDF instance i
is a Department
, then there
should exist at least 2 instances j
and k
of class Employee
which are
related to i
via the property works_in
(or, equivalently, i
should be
related to them via the inverse of works_in
). In A, MyDepartment
is not
known to be an instance of Department
so the constraint does not apply. In B,
MyDepartment
is an instance of Department
but only one instance of
Employee
, namely Bob
, is known to work in it, so B is invalid. In C,
MyDepartment
is related to the individuals Bob
and Alice
, which are both
instances of Employee
and (again, due to weak Unique Name Assumption in
Stardog ICV), are assumed to be distinct, so C is valid.
Managers must manage exactly 1 department.
Constraint
:Manager rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty :manages ;
owl:qualifiedCardinality "1"^^xsd:nonNegativeInteger ;
owl:onClass :Department
] .
Database A (valid)
Individual: Isabella
Database B (invalid)
:Isabella a :Manager .
Database C (invalid)
:Isabella a :Manager ;
:manages :MyDepartment .
Database D (valid)
:Isabella a :Manager ;
:manages :MyDepartment .
:MyDepartment a :Department .
Database E (invalid)
:Isabella a :Manager ;
:manages :MyDepartment , :MyDepartment1 .
:MyDepartment a :Department .
:MyDepartment1 a :Department .
This constraint says that if an RDF instance i
is a Manager
, then it must be
related to exactly 1 instance of Department
via the property manages
. In A,
the individual Isabella
is not known to be an instance of Manager
so the
constraint does not apply and A is valid. In B, Isabella
is an instance of
Manager
but is not related to any instances of Department
, so B is invalid.
In C, Isabella
is related to the individual MyDepartment
via the property
manages
but MyDepartment
is not known to be an instance of Department
, so
C is invalid. In D, Isabella
is related to exactly one instance of
Department
, namely MyDepartment
, so D is valid. Finally, in E, Isabella
is
related to two (assumed to be) distinct (again, because of weak UNA) instances
of Department
, namely MyDepartment
and MyDepartment1
, so E is invalid.
Entities may have no more than one name.
Constraint
:name a owl:FunctionalProperty .
Database A (valid)
:MyDepartment a owl:Thing .
Database B (valid)
:MyDepartment :name "Human Resources" .
Database C (invalid)
:MyDepartment :name "Human Resources" , "Legal" .
This constraint says that no RDF instance i
can have more than one assertion via
the data property name
. In A, MyDepartment
does not have any data property
assertions so A is valid. In B, MyDepartment
has a single assertion via
name
, so the ontology is also valid. In C, MyDepartment
is related to 2
literals, namely "Human Resources"
and "Legal"
, via name
, so C is invalid.
Property Constraints
These constraints control how instances are related to one another via properties.
The manager of a department must work in that department.
Constraint
:manages rdfs:subPropertyOf :works_in .
Database A (invalid)
:Bob :manages :MyDepartment
Database B (valid)
:Bob :works_in :MyDepartment ;
:manages :MyDepartment .
This constraint says that if an RDF instance i
is related to j
via the
property manages
, then i
must also be related to j
via the property
works_in
. In A, Bob
is related to MyDepartment
via manages
, but not via
works_in
, so A is invalid. In B, Bob
is related to MyDepartment
via both
manages
and works_in
, so B is valid.
Department managers must supervise all the department’s employees.
Constraint
:is_supervisor_of owl:propertyChainAxiom (:manages [owl:inverseOf :works_in]) .
Database A (invalid)
:Jose :manages :MyDepartment ;
:is_supervisor_of :Maria .
:Maria :works_in :MyDepartment .
:Diego :works_in :MyDepartment .
Database B (valid)
:Jose :manages :MyDepartment ;
:is_supervisor_of :Maria , :Diego .
:Maria :works_in :MyDepartment .
:Diego :works_in :MyDepartment .
This constraint says that if an RDF instance i
is related to j
via the
property manages
and k
is related to j
via the property works_in
, then
i
must be related to k
via the property is_supervisor_of
. In A, Jose
is
related to MyDepartment
via manages
, Diego
is related to MyDepartment
via works_in
, but Jose
is not related to Diego
via any property, so A is
invalid. In B, Jose
is related to Maria
and Diego
--who are both related to
MyDepartment
by way of works_in
--via the property is_supervisor_of
, so B
is valid.
Complex Constraints
Constrains may be arbitrarily complex and include many conditions.
Employee Constraints
Each employee works on at least one project, or supervises at least one employee that works on at least one project, or manages at least one department.
Constraint
:Employee rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty :works_on ;
owl:someValuesFrom
[ owl:unionOf (:Project
[ a owl:Restriction ;
owl:onProperty :supervises ;
owl:someValuesFrom
[ owl:intersectionOf (:Employee
[ a owl:Restriction ;
owl:onProperty :works_on ;
owl:someValuesFrom :Project
])
]
]
[ a owl:Restriction ;
owl:onProperty :manages ;
owl:someValuesFrom :Department
])
]
] .
Database A (invalid)
:Esteban a :Employee .
Database B (invalid)
:Esteban a :Employee ;
:supervises :Lucinda .
:Lucinda a :Employee .
Database C (valid)
:Esteban a :Employee ;
:supervises :Lucinda .
:Lucinda a :Employee ;
:works_on :MyProject .
:MyProject a :Project .
Database D (valid)
:Esteban a :Employee ;
:manages :MyDepartment .
:MyDepartment a :Department .
Database E (valid)
:Esteban a :Employee ;
:manages :MyDepartment ;
:works_on :MyProject .
:MyDepartment a :Department .
:MyProject a :Project .
This constraint says that if an individual i
is an instance of
Employee
, then at least one of three conditions must be met:
-
it is related to an instance of
Project
via the propertyworks_on
-
it is related to an instance
j
via the propertysupervises
andj
is an instance ofEmployee
and is also related to some instance ofProject
via the propertyworks_on
-
it is related to an instance of
Department
via the propertymanages
.
A and B are invalid because none of the conditions are met. C
meets the second condition: Esteban
(who is an Employee
) is related
to Lucinda
via the property supervises
whereas Lucinda
is both an
Employee
and related to MyProject
, which is a Project
, via the
property works_on
. D meets the third condition: Esteban
is related
to an instance of Department
, namely MyDepartment
, via the property
manages
. Finally, E meets the first and the third conditions because
in addition to managing a department Esteban
is also related an
instance of Project
, namely MyProject
, via the property works_on
.
Employees and US government funding
Only employees who are American citizens can work on a project that receives funds from a US government agency.
Constraint
[ owl:intersectionOf (:Project
[ a owl:Restriction ;
owl:onProperty :receives_funds_from ;
owl:someValuesFrom :US_Government_Agency
]) .
] rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty [owl:inverseOf :works_on] ;
owl:allValuesFrom [ owl:intersectionOf (:Employee
[ a owl:Restriction ;
owl:hasValue "US" ;
owl:onProperty :nationality
])
]
] .
Database A (valid)
:MyProject a :Project ;
:receives_funds_from :NASA .
:NASA a :US_Government_Agency
Database B (invalid)
:MyProject a :Project ;
:receives_funds_from :NASA .
:NASA a :US_Government_Agency .
:Andy a :Employee ;
:works_on :MyProject .
Database C (valid)
:MyProject a :Project ;
:receives_funds_from :NASA .
:NASA a :US_Government_Agency .
:Andy a :Employee ;
:works_on :MyProject ;
:nationality "US" .
Database D (invalid)
:MyProject a :Project ;
:receives_funds_from :NASA .
:NASA a :US_Government_Agency .
:Andy a :Employee ;
:works_on :MyProject ;
:nationality "US" .
:Heidi a :Supervisor ;
:works_on :MyProject ;
:nationality "US" .
Database E (valid)
:MyProject a :Project ;
:receives_funds_from :NASA .
:NASA a :US_Government_Agency .
:Andy a :Employee ;
:works_on :MyProject ;
:nationality "US" .
:Heidi a :Supervisor ;
:works_on :MyProject ;
:nationality "US" .
:Supervisor rdfs:subClassOf :Employee .
SubClassOf: Employee
This constraint says that if an individual i
is an instance of Project
and
is related to an instance of US_Government_Agency
via the property
receives_funds_from
, then any individual j
which is related to i
via the
property works_on
must satisfy two conditions:
-
it must be an instance of
Employee
-
it must not be related to any literal other than
"US"
via the data propertynationality
.
A is valid because there is no individual related to MyProject
via works_on
,
so the constraint is trivially satisfied. B is invalid since Andy
is related
to MyProject
via works_on
, MyProject
is an instance of Project
and is
related to an instance of US_Government_Agency
, that is, NASA
, via
receives_funds_from
, but Andy
does not have any data property assertions. C
is valid because both conditions are met. D is not valid because Heidi
violated the first condition: she is related to MyProject
via works_on
but
is not known to be an instance of Employee
. Finally, this is fixed in E—by
way of a handy OWL axiom—which states that every instance of Supervisor
is an
instance of Employee
, so Heidi
is inferred to be an instance of Employee
and, consequently, E is valid.[28]
If you made it this far, you deserve a drink!
Constraints Formats
In addition to OWL, ICV constraints can be expressed in SPARQL and Stardog Rules. In both cases, the constraints define queries and rules to find violations. These constraints can be added individually, or defined together in a file as shown below:
@prefix rule: <tag:stardog:api:rule:> .
@prefix icv: <tag:stardog:api:icv:> .
# Rule Constraint
[] a rule:SPARQLRule ;
rule:content """
prefix : <http://example.org/>
IF {
?x a :Employee .
}
THEN {
?x :employeeNum ?number .
}
""" .
# SPARQL Constraint
[] a icv:Constraint ;
icv:query """
prefix : <http://example.org/>
select * {
?x a :Employee .
FILTER NOT EXISTS {
?x :employeeNum ?number .
}
}
""" .
Using ICV Programmatically
Here we describe how to use Stardog ICV via the SNARL APIs. For more information on using SNARL in general, please refer to the chapter on Java Programming.
There is command-line interface support for many of the operations necessary to using ICV with a Stardog database; please see Administering Stardog for details.
To use ICV in Stardog, one must:
-
create some constraints
-
associate those constraints with a Stardog database
Creating Constraints
Constraints
can be created using the
ConstraintFactory
which provides methods for creating integrity constraints. ConstraintFactory
expects your constraints, if they are defined as OWL axioms, as RDF triples (or
graph). To aid in authoring constraints in OWL,
ExpressionFactory
is provided for building the RDF
equivalent of the OWL axioms of your constraint.
You can also write your constraints in OWL in your favorite editor and load them into the database from your OWL file.
We recommend defining your constraints as OWL axioms, but you are free
to define them using SPARQL SELECT
queries. If you choose to define a
constraint using a SPARQL SELECT
query, please keep in mind that if your
query returns results, those are interpreted as the violations of the
integrity constraint.
An example of creating a simple constraint using ExpressionFactory
:
IRI Product = Values.iri("urn:Product");
IRI Manufacturer = Values.iri("urn:Manufacturer");
IRI manufacturedBy = Values.iri("urn:manufacturedBy");
// we want to say that a product should be manufactured by a Manufacturer:
Constraint aConstraint = ConstraintFactory.constraint(subClassOf(Product,
some(manufacturedBy, Manufacturer)));
Adding Constraints to Stardog
The ICVConnection interface provides programmatic access to the ICV support in Stardog. It provides support for adding, removing and clearing integrity constraints in your database as well as methods for checking whether or not the data is valid; and when it’s not, retrieving the list of violations.
This example shows how to add an integrity constraint to a Stardog database.
// We'll start out by creating a validator from our SNARL Connection
ICVConnection aValidator = aConn.as(ICVConnection.class);
// add add a constraint, which must be done in a transaction.
aValidator.addConstraint(aConstraint);
Here we show how to add a set of constraints as defined in a local OWL ontology.
// We'll start out by creating a validator from our SNARL Connection
ICVConnection aValidator = aConn.as(ICVConnection.class);
// add add a constraint
aValidator.addConstraints()
.format(RDFFormat.RDFXML)
.file(Paths.get("myConstraints.owl"));
IC Validation
Checking whether or not the contents of a database are valid is easy. Once you
have an
ICVConnection
you can simply call its
isValid()
method which will return whether or not the contents of the database are valid
with respect to the constraints associated with that database. Similarly, you
can provide some
constraints
to
the isValid()
method to see if the data in the database is invalid for those
specific constraints; which can be a subset of the constraints associated
with the database, or they can be new constraints you are working on.
If the data is invalid for some constraints—either the explicit
constraints in your database or a new set of constraints you have
authored—you can get some information about what the violation was from
the SNARL IC Connection.
ICVConnection.getViolationBindings()
will return the constraints which are violated, and for each constraint,
you can get the violations as the set of bindings that satisfied the
constraint query. You can turn the bindings into the individuals which
are in the violation using
ICV.asIndividuals()
.
ICV and Transactions
In addition to using the ICConnection as a data oracle to tell whether or not your data is valid with respect to some constraints, you can also use Stardog’s ICV support to protect your database from invalid data by using ICV as a guard within transactions.
When guard mode for ICV is enabled in Stardog, each commit is inspected to ensure that the contents of the database are valid for the set of constraints that have been associated with it. Should someone attempt to commit data which violates one or more of the constraints defined for the database, the commit will fail and the data will not be added/removed from your database.
By default, reasoning is not used when you enable guard mode, however you are free to specify any of the reasoning types supported by Stardog when enabling guard mode. If you have provided a specific reasoning type for guard mode it will be used during validation of the integrity constraints. This means you can author your constraints with the expectation of inference results satisfying a constraint.
try (AdminConnection dbms = AdminConnectionConfiguration.toEmbeddedServer().credentials("admin", "admin").connect()) {
dbms.disk("icvWithGuard") // disk db named 'icvWithGuard'
.set(ICVOptions.ICV_ENABLED, true) // enable icv guard mode
.set(ICVOptions.ICV_REASONING_ENABLED, true) // specify that guard mode should use reasoning
.create(new File("data/sp2b_10k.n3")); // create the db, bulk loading the file(s) to start
}
This illustrates how to create a persistent disk database with ICV guard mode and reasoning enabled. Guard mode can also be enabled when the database is created on the CLI.
Terminology
This chapter may make more sense if you read this section on terminology a few times.
ICV, Integrity Constraint Validation
The process of checking whether some Stardog database is valid with
respect to some integrity constraints. The result of ICV is a boolean
value (true if valid, false if invalid) and, optionally, an explanation
of constraint violations
.
Schema, TBox
A schema (or "terminology box" a.k.a., TBox) is a set of statements that define the relationships between data elements, including property and class names, their relationships, etc. In practical terms, schema statements for a Stardog database are RDF Schema and OWL 2 terms, axioms, and definitions.
Data, ABox
All of the triples in a Stardog database that aren’t part of the schema are part of the data (or "assertional box" a.k.a. ABox).
Integrity Constraint
A declarative expression of some rule or constraint which data must conform to in order to be valid. Integrity Constraints are typically domain and application specific. They can be expressed in OWL 2 (any legal syntax), SWRL rules, or (a restricted form of) SPARQL queries.
Constraints
Constraints that have been associated with a Stardog database and which are used to validate the data it contains. Each Stardog may optionally have one and only one set of constraints associated with it.
Closed World Assumption, Closed World Reasoning
Stardog ICV assumes a closed world with respect to data and constraints:
that is, it assumes that all relevant data is known to it and included
in a database to be validated. It interprets the meaning of Integrity
Constraints in light of this assumption; if a constraint says a value
must
be present, the absence of that value is interpreted as a
constraint violation and, hence, as invalid data.
Open World Assumption, Open World Reasoning
A legal OWL 2 inference may violate or satisfy an Integrity Constraint
in Stardog. In other words, you get to have your cake (OWL as a
constraint language) and eat it, too (OWL as modeling or inference
language). This means that constraints are applied to a Stardog database
with respect to an OWL 2 profile
.
Monotonicity
OWL is a monotonic language: that means you can never add
anything to a
Stardog database that causes there to be fewer
legal inferences. Or, put
another way, the only way to decrease the number of legal inferences is
to delete
something.
Monotonicity interacts with ICV in the following ways:
-
Adding data to or removing it from a Stardog database may make it invalid.
-
Adding schema statements to or removing them from a Stardog database may make it invalid.
-
Adding new constraints to a Stardog database may make it invalid.
-
Deleting constraints from a Stardog database cannot make it invalid.
GraphQL Queries
Introduction
Stardog supports querying data stored (or mapped) in a Stardog database using GraphQL queries. You can load data into Stardog as usual and execute GraphQL queries without creating a GraphQL schema. You can also associate one or more GraphQL schemas with a database and execute GraphQL queries against one of those schemas.
The following table shows the correspondence between RDF concepts and GraphQL:
RDF |
GraphQL |
Node |
Object |
Class |
Type |
Property |
Field |
Literal |
Scalar |
Execution of GraphQL queries in Stardog does not follow the procedural rules defined in
the GraphQL spec.
Instead Stardog translates GraphQL queries to SPARQL and then SPARQL results to GraphQL
results based on the correspondences shown in the preceding table. Each RDF node
represents a GraphQL object. Properties of the node are the fields of the object with
the exception of rdf:type
property which represents the type of the object. Literals
in RDF are mapped to GraphQL scalars.
RDF |
GraphQL |
|
|
|
|
|
|
|
|
|
|
IRI |
|
In the following sections we will use a slightly modified version of the canonical GraphQL Star Wars example to explain how GraphQL queries work in Stardog. The following graph shows the core elements of the dataset and links between those nodes:

The full dataset in Turtle format is available in the examples repo.
Executing GraphQL Queries
The GraphQL command can be executed by providing a query string:
$ stardog graphql starwars "{ Human { name }}"
or a file containing the query:
$ stardog graphql starwars query.file
The --reasoning
flag can be used with the CLI command to enable reasoning.
The HTTP command can be used to execute GraphQL queries. The endpoint for GraphQL queries is http://HOST:PORT/{db}/graphql
.
The following command uses curl
to execute a GraphQL query:
$ curl -G -vsku admin:admin --data-urlencode query="{ Human { name }}" localhost:5820/starwars/graphql
Reasoning can be enabled by setting a special variable @reasoning
in the GraphQL query variables.
Any standard GraphQL client, like GraphiQL, can be used with Stardog:

Note
|
Stardog by default uses HTTP basic access authentication.
In order to use GraphiQL
with Stardog you either need to start the Stardog server with --disable-security option so
it won’t require credentials or set the HTTP header Authorization in the request. If the default
credentials admin/admin are being used in non-production settings, the HTTP header Authorization
may be set to the value Basic YWRtaW46YWRtaW4= in the GraphiQL UI. The curl example above can
be used to see the correct value of the header for your credentials.
|
Fields and Selection Sets
A top-level element in GraphQL by default represents a type and will return all the nodes with that type. The fields in the query will return matching properties:
Query |
Result |
|
|
Each field in the query is treated as a required property of the node (unless an @optional
directive is used)
so any node without corresponding properties will not appear in the results:
Query |
Result |
|
|
If a node in the graph has multiple properties, then in the query results those results will be returned as an array:
Query |
Result |
|
|
Also note that Stardog does not enforce the GraphQL requirement that leaf fields must be scalars. In the previous example friends of a droid are objects but the query does not provide any fields. In those cases, the identifier of the node will be returned as a string.
Arguments
In GraphQL fields are, conceptually, functions which return values and may accept
arguments that alter their behavior.
Arguments have no predefined semantics but the typical usage is for defining lookup values for fields.
Stardog adopts this usage and treats arguments as filters for the query. The following query return
only the node whose id
field is 1000
:
Query |
Result |
|
|
Arrays can be used to specify multiple values for a field in which case nodes matching any field will be returned:
Query |
Result |
|
|
Reasoning
GraphQL queries by default only return results based on explicit nodes and edges in the graph. Reasoning may be enabled in
the usual ways to run queries with inferred nodes and edges, e.g. to perform type inheritance. In the example graph,
Human
and Droid
are
defined as subclasses of the Character
class. The following query will return no results without
reasoning but when reasoning
is enabled Character
will act like a
GraphQL interface and
the query will return both humans and droids:
Query |
Result |
|
|
Query Variables |
|
|
Fragments
Stardog supports GraphQL fragments (both inline or via fragment definitions). This query shows how fragments can be combined with reasoning to select different fields for subtypes:
Query |
Result |
|
|
Aliases
By default, the key in the response object will use the field name queried. However, you can define a different name by specifying an alias. The following query renames both of the fields in the query:
Query |
Result |
|
|
Variables
A GraphQL query can be parameterized with
variables
which must be defined at the top of an operation. Variables are in scope throughout the execution of that
operation. A value
should be provided for GraphQL variables before execution or an error will occur. The following
query will return a single
result when executed with the input {"id": 1000}
:
Query |
Result |
|
|
Query Variables |
|
|
Ordering Results
The results of GraphQL queries may be randomly ordered. A special argument orderBy
can be used at the top level
to specify which field to use for ordering the results. The following query uses the values of the name
field for
ordering the results:
Query |
Result |
|
|
The results are ordered in ascending order by default. We can sort results in descending order as follows:
Query |
Result |
|
|
Multiple ordering criteria can be used:
Query |
Result |
|
|
We first use the homePlanet
field for ordering and the results with no home planet come up first.
If two results have the same value for the first order criteria, e.g. Luke Skywalker
and Darth Vader
,
then the second criteria is used for ordering.
Paging Results
Paging through the GraphQL results is accomplished with first
and skip
arguments used at the top level.
The following query returns the first three results:
Query |
Result |
|
|
The following query skips the first result and returns the next two results:
Query |
Result |
|
|
Directives
Directives provide a way to describe alternate runtime execution and type validation behavior in GraphQL.
The spec defines two built-in directives:
@skip
and @include
. Stardog supports both directives and introduces several others.
@skip(if: EXPR)
The skip
directive includes a field value in the result conditionally. If the provided expression evaluates to true
the field will not be included. Stardog allows arbitrary SPARQL expressions
to be used as the conditions. Any of the supported SPARQL Query Functions can be used in these expressions.
The expression
can refer to any field in the same selection set and is not limited to the field directive is attached to. The following
query returns the name field only if the name does not start with the letter L
:
Query |
Result |
|
|
@include(if: EXPR)
The @include
directive works negation of the @skip
directive; that is, the field is included only if the expression
evaluates to true
. We can use variables inside the conditions, too. The following example executed with
input {"withFriends": false}
will not include friends in the results:
Query |
Result |
|
|
Query Variables |
|
|
@filter(if: EXPR)
The @filter
directive looks similar to @skip
and @include
but filters the whole object
instead of just a single field. In that regard it works more like arguments but
arbitrary expressions can be used to select specific nodes. The next query returns all humans
whose id
is less than 1003
:
Query |
Result |
|
|
Unlike the previous two filters it doesn’t matter which field the @filter
directive is syntactically adjacent to since
it applies to the whole selection set.
@optional
Stardog treats every field as required by default and will not return any nodes if they don’t have
a matching value for the fields in the selection set. The @optional
directive can be used to
mark a field as optional. The following query returns the home planets for humans if it exists but
skips that field if it doesn’t:
Query |
Result |
|
|
@type
By default every field in the GraphQL query other than the topmost field represents a property in the
graph. Sometimes we might want to filter some nodes based on their types; that is, based on the
values of the special built-in property rdf:type
. Stardog provides a directive as a shortcut for this
purpose. The following query returns only the droid friends of humans because the Droid
field is
marked with the @type
directive:
Query |
Result |
|
|
@bind(to: EXPR)
Fields bind to properties in the graph but it is also possible to have fields with computed values. When
@bind
directive is used for a field the value of that field will be compared by evaluating the given
SPARQL expression. The following example splits the name
field on a space to compute firstName
and lastName
fields:
Query |
Result |
|
|
@hide
Query results can be flattened using the @hide
directive. For example, in our data
characters are linked to episode instances that have an index
property. The following query retrieves
the episode indexes but, by hiding the intermediate episode instances, humans are directly linked to
the episode index:
Query |
Result |
|
|
Namespaces
RDF uses IRIs as identifiers whereas in GraphQL we have simple names as identifiers. The examples so far use a single default namespace where names in GraphQL are treated as local names in that namespace. If a Stardog graph uses multiple namespaces, then it is possible to use them in GraphQL queries in several different ways.
If there are stored namespaces in the database then the associated prefixes can be used in the queries.
For example, suppose we have the prefix foaf
associated with the namespace http://xmlns.com/foaf/0.1/
in the database.
In SPARQL the prefixed name foaf:Person
would be used for the IRI http://xmlns.com/foaf/0.1/Person
. In GraphQL, the :
character cannot be used in field names so instead Stardog uses the _
character: the prefixed name here would be foaf_Person
.
The query using FOAF namespace would look like this:
{
foaf_Person {
foaf_name
foaf_mbox
}
}
If the namespace is not stored in the database an inline prefix definition can be provided with the @prefix
directive:
query withPrefixes @prefix(foaf: "http://xmlns.com/foaf/0.1/") {
foaf_Person {
foaf_name
foaf_mbox
}
}
Note
|
Sometimes field names might use the underscore character and it might not indicate a prefix. To differentiate two
cases Stardog looks at the prefix before the underscore and checks if it is defined in the query or if it
is stored in the database.
In some cases the IRI local name might be using characters like - that is not allowed in
GraphQL names. In those cases an
alias can be defined to map a field name to an IRI. These aliases are defined in a @config
directive at the query level as follows:
|
query withAliases @config(alias: {myType: "http://example.com/my-type",
myProp: "http://example.com/my-prop"})
{
myType {
myProp
}
}
Named Graphs
GraphQL queries by default are evaluated over the union of all graphs stored in the Stardog database. It is possible to limit the scope of the query to one or more specific named graphs. Suppose we partition the Star Wars dataset by moving instances of each type to a different named graph using the following SPARQL update query:
DELETE { ?s ?p ?o }
INSERT { GRAPH ?type { ?s ?p ?o } }
WHERE { ?s a ?type ; ?p ?o }
The following queries (with reasoning) will return 5 humans, 2 droids and all 7 characters respectively:
Query |
Result |
|
|
Query |
Result |
|
|
Query |
Result |
|
|
GraphQL Schemas
GraphQL is a strongly-typed language where the fields used in a query should conform to the type definitions in a GraphQL schema. By default, Stardog relaxes this restriction and allows queries to be executed without an explicit schema. However, if desired, one or more GraphQL schemas can be added to the database and used during query execution. The benefits of using an explicit schema are as follows:
-
Queries will be validated with strict typing
-
Default translation of RDF values to GraphQL values can be overridden
-
Only the parts of the graph defined in the schema will be exposed to the user
A GraphQL schema can be created and registered manually or a default GraphQL schema can be generated automatically from the RDFS/OWL schema or SHACL constraints defined in the database as explained in the next sections.
Manually registered GraphQL schemas
Here is an example schema that can be used with the Star Wars dataset:
schema {
query: QueryType
}
type QueryType {
Character: Character
Human(id: Int, first: Int, skip: Int, orderBy: ID): Human
Droid(id: Int): Droid
}
interface Character {
id: Int!
name: String!
friends(id: Int): [Character]
appearsIn: [Episode]
}
type Human implements Character {
iri: ID!
id: Int!
name: String!
friends(id: Int): [Character]
appearsIn: [Episode]
}
type Droid implements Character {
id: Int!
name: String!
friends(id: Int): [Character]
appearsIn: [Episode]
primaryFunction: String
}
type Episode {
index: Int!
name: String!
}
Each GraphQL schema defines a query type which specifies the top-level field that can be
used in a query. In Stardog the query type is simply an enumeration of classes in the
database that we want to expose in queries. For example, the schema defines the Episode
type but does not list it under QueryType
which means you cannot query for episodes
directly.
Note that, without a schema each top-level type can have various built-in arguments like
first
or skip
. In this schema we chose to define them for the Human
type but not for others.
This means a query like { Droid(first: 1) { name } }
will be invalid with respect to this
schema and rejected even though it is valid if executed without a schema.
This schema can be added to the database by giving it a name:
$ stardog graphql schema --add characters starwars characters.graphqls
Added schema characters
We can then execute the query by specifying the schema name along with the query:
$ stardog graphql --schema characters starwars "{ Human { name friends { name } } }"
When a schema is specified for a query it gets added to the query parameters using a special
variable named @schema
. When using the HTTP API directly this variable can be set to choose
the schema for a query by sending the query variable {"schema": "characters" }
.
Query |
Result |
|
|
Query Variables |
|
|
Note that the friends field in the result is an array value due to the corresponding definition in the schema. This query executed with a schema would return the single object value for the field.
An important point about schemas is that the types defined in the schema do not filter the query results. For
example, we can define a much simpler humans
schema against the Star Wars dataset:
schema {
query: QueryType
}
type QueryType {
Human(id: [Int]): Human
}
type Human {
id: Int!
name: String!
friends: [Human]
}
This query allows only Human
instances to be queried at the top level and declares that the friend of each Human
is also a Human
. This schema definition is incompatible with the data since humans have droid friends. Stardog
does not check if the schema is correct with respect to the data and will not enforce type restrictions
in the results. So if we ask for the friends of a human, then the droids will also be returned in the results:
Query |
Result |
|
|
Query Variables |
|
|
Auto-generated GraphQL schemas
When the option graphql.auto.schema
is set to true
in the database configuration a default GraphQL
schema will be generated automatically the first time a query is executed or schemas are listed. By default, the
GraphQL schema will be generated from the RDFS/OWL schema in the database but the graphql.auto.schema.source
can
be changed to use SHACL shapes definitions for GraphQL schema generation.
Each RDFS/OWL class will be mapped to a GraphQL type with one built-in field iri
. Any property whose rdfs:domain
is set to that class will be added as an additional field to the type. If the property is defined to be owl:FunctionalProperty
then it will be a single-valued field, otherwise it will be defined as a multi-valued field. All the classes in the database
will be exposed under the top-level QueryType
.
As an example, you can load the Star Wars schema
to the database and set the configuration option graphql.auto.schema
(the database needs to be offline for this option to be set).
After that point, all the queries will be validated and answered based on this schema if no schema is specified in the query
request.
The auto-generated GraphQL schema is not updated automatically when the source schema is updated. The default schema can be manually removed with the following command so that it will be auto-generated next time it is needed:
$ stardog graphql schema --remove default starwars
The auto-generated schema sometimes might require manual tweaks. If that is the case the auto-generated schema can be exported to a local file and manually updated:
$ stardog graphql schema --show default starwars > starwars.graphqls.
After the local schema file has been updated it can be registered as usual:
$ stardog graphql schema --add default starwars starwars.graphqls
Introspection
Stardog supports GraphQL introspection which means GraphQL tooling works out of the box with Stardog. Introspection allows schema queries to be discovered, exposed, and executed and to retrieve information about the types and fields defined in a schema. This feature is used in GraphQL tools to support features like autocompletion, query validation, etc.
Stardog supports introspection queries for the GraphQL schemas
registered in the system. There is a separate dedicated endpoint for each schema
registered in the system in the form http://HOST:PORT/{db}/graphql/{schema}
. The
introspection queries executed against this endpoint will be answered using the
corresponding schema.

Introspection queries are supported by the default GraphQL endpoint only when the
auto-generated schemas are enabled by setting the
graphql.auto.schema
database option to true
. Otherwise, there is no
dedicated schema associated with the default endpoint and introspection will be disabled.
Implementation
Stardog translates GraphQL queries to SPARQL and SPARQL results to JSON. The CLI command
graphql explain
can be used to see the generated SPARQL
query and the low-level query plan created for the SPARQL query which is useful for
debugging correctness and performance issues:
$ stardog graphql explain starwars "{
Human(id: 1000) {
name
knows: friends {
name
}
}
}"
SPARQL:
SELECT *
FROM <tag:stardog:api:context:all>
{
?0 rdf:type :Human .
?0 :id "1000"^^xsd:integer .
?0 :name ?1 .
?0 :friends ?2 .
?2 :name ?3 .
}
FIELDS:
0 -> {1=name, 2=knows}
2 -> {3=name}
PLAN:
prefix : <http://api.stardog.com/>
From all
Projection(?0, ?1, ?2, ?3) [#3]
`─ MergeJoin(?2) [#3]
+─ Sort(?2) [#3]
│ `─ NaryJoin(?0) [#3]
│ +─ Scan[POSC](?0, :id, "1000"^^<http://www.w3.org/2001/XMLSchema#integer>) [#1]
│ +─ Scan[POSC](?0, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :Human) [#5]
│ +─ Scan[PSOC](?0, :name, ?1) [#10]
│ `─ Scan[PSOC](?0, :friends, ?2) [#20]
`─ Scan[PSOC](?2, :name, ?3) [#10]
The variables in the SPARQL query will be mapped to objects and field values in the JSON results. The binding
for variable 0
will be the root object. The FIELDS
output show that 0
is linked to 1
via the name
field
and linked to 2
via the knows
field (note that knows
is an alias and in the actual query we have the pattern
?0 :friends ?2
).
The GraphQL query plans can also be retrieved by setting the special query variable @explain
to true when
executing a query.
Path Queries
Stardog extends SPARQL to find paths between nodes in the RDF graph, which we call path queries. They are similar to SPARQL 1.1 property paths which traverse an RDF graph and find pairs of nodes connected via a complex path of edges. But SPARQL property paths only return the start and end nodes of a path and do not allow variables in property path expressions. Stardog path queries return all intermediate nodes on each path—that is, they return a path from start to end—and allow arbitrary SPARQL graph patterns to be used in the query.
Path Query Syntax
We add path queries as a new top-level query form, i.e. separate from SELECT
,
CONSTRUCT
or other query types. The syntax is as follows:
PATHS [SHORTEST|ALL] [CYCLIC] [<DATASET>]
START ?s [ = <IRI> | <GRAPH PATTERN> ] END ?e [ = <IRI> | <GRAPH PATTERN> ]
VIA <GRAPH PATTERN> | <VAR> | <PATH>
[MAX LENGTH <int>]
[OFFSET <int>]
[LIMIT <int>]
The graph pattern in the VIA
clause must bind both ?s
and ?e
variables.
Next we informally present examples of common path queries and finally the formal Path Query Evaluation Semantics.
Shortest Paths
Suppose we have a simple social network where people are connected via different relationships:

If we want to find all the people Alice is connected to and how she is connected to them we can use the following path query:
PATHS START ?x = :Alice END ?y VIA ?p
We specify a start node for the path query but the end node is unrestricted. So
all paths starting from Alice will be returned. Note that we use the shortcut VIA ?p
instead of a graph pattern to match each edge in the path. This is a syntactic sugar for
VIA { ?s ?p ?e }
. Similarly we could use a predicate, e.g. VIA :knows
or a property path
expression, e.g. VIA :knows | :worksWith
.
This query is effectively equivalent to the SPARQL property path :Alice :knows+ ?y
,
but the results will include the nodes in the path(s). The path query results are printed in
a tabular format by default:
+----------+------------+----------+
| x | p | y |
+----------+------------+----------+
| :Alice | :knows | :Bob |
| | | |
| :Alice | :knows | :Bob |
| :Bob | :knows | :David |
| | | |
| :Alice | :knows | :Bob |
| :Bob | :worksWith | :Charlie |
| | | |
| :Alice | :knows | :Bob |
| :Bob | :worksWith | :Charlie |
| :Charlie | :parentOf | :Eve |
+----------+------------+----------+
Query returned 4 paths in 00:00:00.055
Each row of the result table shows one edge and adjacent edges on a path are printed
on subsequent rows of the table. Multiple paths in the results are separated by an empty
row. We can change the output format to text
which serializes the results in a
property graph like syntax:
$ stardog query -f text exampleDB "PATHS START ?x = :Alice END ?y VIA ?p"
(:Alice)-[p=:knows]->(:Bob)
(:Alice)-[p=:knows]->(:Bob)-[p=:knows]->(:David)
(:Alice)-[p=:knows]->(:Bob)-[p=:worksWith]->(:Charlie)
(:Alice)-[p=:knows]->(:Bob)-[p=:worksWith]->(:Charlie)-[p=:parentOf]->(:Eve)
Query returned 4 paths in 00:00:00.047
Execution happens by recursively evaluating the graph pattern in the query and replacing the start variable with the binding of the end variable in the previous execution. If the query specifies a start node, that value is used for the first evaluation of the graph pattern. If the query specifies an end node, which our example doesn’t, execution stops when we reach the end node. Only simple cycles, i.e. paths where the start and the end nodes coincide, are allowed in the results.
Note
|
The Stardog optimizer may choose to traverse paths backwards, i.e. from the end node to the start, for performance reasons but it does not affect the results. |
We can specify the end node in the query and restrict the kind of patterns in
paths to a specific property as in the next example that queries how Alice
is connected to David
via knows
relationships:
PATHS START ?x = :Alice END ?y = :David VIA :knows
This query would return a single path with two edges:
+--------+--------+
| x | y |
+--------+--------+
| :Alice | :Bob |
| :Bob | :David |
+--------+--------+
Complex Paths
Graph patterns inside the path queries can be arbitrarily complex. Suppose, we
want to find undirected paths between Alice
and David
in this graph. Then we can
make the graph pattern to match both outgoing and incoming edges:
$ stardog query exampleDB "PATHS START ?x = :Alice END ?y = :David VIA ^:knows | :knows"
+--------+--------+
| x | y |
+--------+--------+
| :Alice | :Bob |
| :Bob | :David |
+--------+--------+
Sometimes a relationship between two nodes might be implicit and there might not be an explicit link between those two nodes in the RDF graph. Consider the following set of triples that show some movies and actors who starred in those movies:
:Apollo_13 a :Film ; :starring :Kevin_Bacon , :Gary_Sinise .
:Spy_Game a :Film ; :starring :Brad_Pitt , :Robert_Redford .
:Sleepers a :Film ; :starring :Kevin_Bacon , :Brad_Pitt .
:A_Few_Good_Men a :Film ; :starring :Kevin_Bacon , :Tom_Cruise .
:Lions_for_Lambs a :Film ; :starring :Robert_Redford , :Tom_Cruise .
:Captain_America a :Film ; :starring :Gary_Sinise , :Robert_Redford .
There is an implicit relationship between actors based on the movies they appeared together. We can use a basic graph pattern with multiple triple patterns in the path query to extract this information:
PATHS START ?x = :Kevin_Bacon END ?y = :Robert_Redford
VIA { ?movie a :Film ; :starring ?x , ?y }
This query executed against the above set of triples would return three paths:
+--------------+------------------+-----------------+
| x | movie | y |
+--------------+------------------+-----------------+
| :Kevin_Bacon | :Apollo_13 | :Gary_Sinise |
| :Gary_Sinise | :Captain_America | :Robert_Redford |
| | | |
| :Kevin_Bacon | :Sleepers | :Brad_Pitt |
| :Brad_Pitt | :Spy_Game | :Robert_Redford |
| | | |
| :Kevin_Bacon | :A_Few_Good_Men | :Tom_Cruise |
| :Tom_Cruise | :Lions_for_Lambs | :Robert_Redford |
+--------------+------------------+-----------------+
If the movie is irrelevant, then a more concise version can be used:
PATHS START ?x = :Kevin_Bacon END ?y = :Robert_Redford VIA ^:starring/:starring
All Paths
Path queries return only shortest paths by default. We can use the ALL
keyword in the
query to retrieve all paths between two nodes. For example, the query above returned only
one path between Alice
and David
. We can get all paths as follows:
$ stardog query exampleDB "PATHS ALL START ?x = :Alice END ?y = :David
VIA { {?x ?p ?y} UNION {?y ?p ?x} }"
+----------+------------+----------+
| x | p | y |
+----------+------------+----------+
| :Alice | :knows | :Bob |
| :Bob | :knows | :David |
| | | |
| :Alice | :knows | :Bob |
| :Bob | :worksWith | :Charlie |
| :Charlie | :parentOf | :Eve |
| :Eve | :knows | :David |
+----------+------------+----------+
Caution
|
The ALL qualifier can dramatically increase the number of paths so use with caution.
|
Cyclic Paths
There’s a special keyword CYCLIC
to specifically query for cyclic paths in the data.
For example, there might be a dependsOn
relationship in the database and
we might want to query for cyclic dependencies:
PATHS CYCLIC START ?start END ?end VIA :dependsOn
Again, arbitrary cycles in the paths are not allowed to ensure a finite number of results.
Limiting Paths
In a highly connected graph the number of possible paths between two nodes can be impractically
high. There are two different ways we can limit the results of path queries. The first possibility
is to use the LIMIT
keyword just like in other query types. We can ask for at most 2 paths starting
from Alice
as follows:
PATHS START ?x = :Alice END ?y VIA ?p LIMIT 2
This query would return 2 results as expected :
+----------+------------+----------+
| x | p | y |
+----------+------------+----------+
| :Alice | :knows | :Bob |
| | | |
| :Alice | :knows | :Bob |
| :Bob | :knows | :David |
+----------+------------+----------+
Note that, the path from Alice
to Charlie
is not included in this result even though it is not any
longer than the path between Alice
and David
. This is because with LIMIT
the query will stop
producing results as soon as the maximum number of paths are returned.
The other alternative for limiting the results is by specifying the maximum length of paths that can be returned. The following query shows how to query for paths thar are at most 2-edge long:
PATHS START ?x = :Alice END ?y VIA ?p MAX LENGTH 2
This time we will get 3 results:
+----------+------------+----------+
| x | p | y |
+----------+------------+----------+
| :Alice | :knows | :Bob |
| | | |
| :Alice | :knows | :Bob |
| :Bob | :knows | :David |
| | | |
| :Alice | :knows | :Bob |
| :Bob | :worksWith | :Charlie |
+----------+------------+----------+
It is also possible to use both LIMIT
and MAX LENGTH
keywords in a single query.
Path Queries With Start and End Patterns
In all examples presented so far the start and end variables were either free variables or bound to a single IRI. This is insufficient for navigating paths which must begin at multiple nodes satisfying certain conditions and terminate at nodes satisfying some other conditions. Assume the movie and actor data above is extended with information about the date of birth of each actor:
:Kevin_Bacon :birthdate "1958-07-08"^^xsd:date
:Gary_Sinise :birthdate "1957-03-17"^^xsd:date
:Brad_Pitt :birthdate "1963-12-18"^^xsd:date
:Robert_Redford :birthdate "1936-08-18"^^xsd:date
:Tom_Cruise :birthdate "1962-07-03"^^xsd:date
Now, having only variables and constants as valid path start and end expressions would make it hard to write a query to find all connections between Kevin Bacon and actors over 80 years old. The following attempt, for example, won’t match any data:
PATHS START ?x = :Kevin_Bacon END ?y VIA {
?movie a :Film ; :starring ?x , ?y .
?y :birthdate ?date .
FILTER (year(?date) - year(now()) >= 80)
}
The problem is that the age filter is applied at each recursive step, i.e. the query is looking for paths where every intermediate actor is over 80, but none of those co-starred with Kevin Bacon (in our toy dataset). Instead we need a query which checks the condition only at candidate end nodes:
PATHS START ?x = :Kevin_Bacon
END ?y { ?y :birthdate ?date .
FILTER (year(?date) - year(now()) >= 80) }
VIA ^:starring/:starring
This query will return the expected results along with the date of birth for end nodes:
+------------------+---------------------+------------------------+
| x | y | date |
+------------------+---------------------+------------------------+
| test:Kevin_Bacon | test:Gary_Sinise | |
| test:Gary_Sinise | test:Robert_Redford | "1936-08-18"^^xsd:date |
| | | |
| test:Kevin_Bacon | test:Brad_Pitt | |
| test:Brad_Pitt | test:Robert_Redford | "1936-08-18"^^xsd:date |
| | | |
| test:Kevin_Bacon | test:Tom_Cruise | |
| test:Tom_Cruise | test:Robert_Redford | "1936-08-18"^^xsd:date |
+------------------+---------------------+------------------------+
Tip
|
The shortest path semantics applies to each pair of start and end nodes independently.
This means that for nodes We recommend using |
Path Queries With Reasoning
As other kinds of queries, path queries can be evaluated with reasoning. If reasoning is enabled, a path query will return paths in the inferred graph, i.e. each edge corresponds to a relationship between the nodes which is inferred from the data based on the schema.
Consider the following example:
:Arlington :partOf :DCArea .
:DCArea :locatedIn :EastCoast .
:EastCoast :partOf :US .
:US :locatedIn :NorthAmerica .
Adding the following rule (or an equivalent OWL sub-property chain axiom) infers :partOf
edges based on compositions of :partOf
and :locatedIn
edges:
IF
{ ?x :partOf ?y . ?y :locatedIn ?z }
THEN
{ ?x :partOf ?z }
Now the following path query will find the inferred path from :Arlington
to :NorthAmerica
via :DCArea
and :US
:
PATHS START ?x = :Arlington END ?y = :NorthAmerica VIA {
?x :partOf ?y
}
Note
|
This feature should be used with care. There may be a lot more paths than one expects. Also keep in mind that some patterns
are particularly expensive with reasoning, e.g. triple patterns with the unbound predicate variable or with a variable in the object
position of rdf:type .
|
Path Query Evaluation Semantics
Given a pair of variable names s
and e
a path is a sequence of SPARQL solutions S[1], …, S[n]
s.t.
S[i](t) = S[i-1](s)
for i
from 2
to n
. We call the S[0](s)
and S[n](t)
values the start and end nodes of the path, resp.
Each solution in the sequence is called an edge.
The evaluation semantics of path queries is based on the following recursive extension of SPARQL solution:
(1) Solution := { (V -> Value)* } // solution: mapping from variables to values (as in SPARQL)
(2) Value := RDF-Term // an RDF term is a value (as in SPARQL)
(3) Value := Solution // a solution is a value (extension)
(4) Value := [ Value* ] // an array of values is a value (extension)
Informally such extensions allow us to represent each path as a single solution where a distinguished variable (in the sequel called path variable) is mapped to an ordered array of solutions representing edges.
We first consider simple path queries for ALL
paths with only variables after the START
and END
keywords, i.e. queries
of the form PQ(s, e, p, P)
, where s
and e
are start and end variable names, p
is a path variable name,
and P
is a SPARQL graph pattern. Given a dataset
D
with the active graph G
, abbreviated as D(G)
, we define eval(PQ(s, e, P), D(G))
as a set of all
such (extended) solutions S
that:
(1) S(p) is a path Sp[1] ... Sp[n] w.r.t. s and e
(2) Sp(1) is in eval(P, D(G))
(3) Sp[i] is a solution to eval(sub(P, s, Sp[i-1](e), D(G)) for i = 2 ... n
(4) S(s) = Sp[1](s)
(5) S(e) = Sp[n](e)
(6) All terms which s and e bind to in all Sp[i] are unique except that Sp[1](s) could be equal to Sp[n](e)
where sub(P, var, t)
is a graph pattern obtained by substituting the variable var
by the fixed RDF term t
.
Informally conditions (2) and (3) state that each edge in a path is obtained by evaluating the path pattern
with the start variable substituted by the end variable value of the previous edge (to ensure connectedness).
The conditions (4) and (5) bind the s
and e
variables in the top level solution.
Next we define the semantics of path queries with start and end patterns:
eval(PQ(s, PS, e, PE, PQ) = Join(PS, Join(PE, eval(PQ(s, e, PQ), DG)))
where PS
and PE
are start and end graph patterns which must bind s
and e
variables, respectively.
Here Join
stands for the standard SPARQL join semantics which does not require extensions since joins
are performed on variables s
and e
which bind to RDF terms only, rather than arrays or solutions
(conditions (4) and (5) above ensure that).
Finally we note that path queries with start or end constants are a special case of path queries with
the corresponding singleton VALUES
patterns, e.g.
PATHS START ?s = :Alice END ?e = :Dave VIA :knows
is a syntactic sugar for
PATHS START ?s { VALUES ?s { :Alice } } END ?e { VALUES ?e { :Dave } } VIA :knows
Keywords SHORTEST
(default) and CYCLIC
are self-explanatory and place further restrictions on each S(p)
:
the sequence should be the shortest among all results or represent a simple cycle. The solution modifiers LIMIT
and OFFSET
have the exact same semantics as in SPARQL 1.1.
Geospatial Query
Stardog supports geospatial queries over data encoded using WGS 84 or the OGC’s GeoSPARQL vocabulary. Any RDF data stored in Stardog using one or both of these vocabularies will be automatically indexed for geospatial queries.
Enabling Geospatial Support
To get started using Stardog’s geospatial support, you’ll need to create a
database with geospatial support enabled. You can do this by setting the option
spatial.enabled
to true
:
stardog-admin db create -o spatial.enabled=true -n mySpatialDb
Similarly, you can set the option using GeospatialOptions#SPATIAL_ENABLED
when
creating the database programmatically:
aAdminConnection.disk("mySpatialDb")
.set(GeospatialOptions.SPATIAL_ENABLED, true)
.create()
Precision & Accuracy
When creating a database with geospatial support, you can specify the precision
with which the features are indexed. The database property spatial.precision
or programmatically via GeospatialOptions#SPATIAL_PRECISION
, which can only be
specified when the database is created, can control the index precision. The
default value is 11
which yields sub-meter precision; a value of 8
will give
a precision +/- 50m. Setting the precision value lower than the default can
improve the performance of spatial queries at the cost of accuracy.
Geospatial Data
The WGS84 or OGC vocabularies can be used to encode geospatial features within your dataset. When data is committed, Stardog will look for these vocabularies and automatically extract all features and insert them into the geospatial index. Here is an example of using WKT to define the location of the White House:
:whiteHouse a geo:Feature ;
rdfs:label "White House" ;
geo:hasGeometry :whiteHouseGeo .
:whiteHouseGeo a geo:Geometry ;
geo:asWKT "Point(-77.03653 38.897676 )"^^geo:wktLiteral .
Note that for WKT formatted points, the location is <long, lat>
. The location
of the White House can also be encoded using the WGS 84 vocabulary:
:whiteHouse a :Location ;
rdfs:label "White House" ;
wgs:lat "38.897676"^^xsd:float ;
wgs:long "-77.03653"^^xsd:float .
SPARQL Integration
Once your data has been indexed, you can perform several type of geospatial queries on the data. These are seamlessly integrated into SPARQL so you can query for non-spatial information about features in your dataset alongside the geospatial queries.
The operators supported by Stardog are geof:relate
, geof:distance
,
geof:within
, geof:nearby
and geof:area
. The geof
namespace is
http://www.opengis.net/def/function/geosparql/
.
This query gets all features within 2km of Stardog HQ in DC:
select ?name where {
?loc rdfs:label ?name .
?loc geo:hasGeometry ?feature .
?hq geo:hasGeometry ?hqGeo ; rdfs:label "Stardog HQ" .
?feature geof:nearby (?hqGeo 2 <http://qudt.org/vocab/unit#Kilometer>).
}
More query examples can be found on our blog.
Geospatial Datatypes
The QUDT ontology, namespace http://qudt.org/vocab/unit#
, is
used to specify units for distances; Kilometer
, Meter
, Centimeter
, MileUSStatute
,
Yard
, Foot
, Inch
. Additionally, the OGC units vocabulary
http://www.opengis.net/def/uom/OGC/1.0/
defines degree
, radian
and metre
.
Enhanced Polygons
Stardog’s geospatial support covers the use of basic WKT formatted shapes; specifically points and rectangles. However, WKT can encode more complex spatial structures, most notably, polygons.
To enable support for these more complex shapes, download
JTS and include the JAR in
Stardog’s classpath by placing it into the server/ext
folder of the installation (you may need to create this folder)
or into the folder specified by the STARDOG_EXT
environment variable. Then set spatial.use.jts=true
in your
stardog.properties
file. When you restart Stardog, it will pick up JTS and
you’ll be able to use more complex WKT formatted shapes.
Machine Learning
In this section, you’ll learn how to use Stardog’s machine learning capabilities for the general problem of predictive analytics. We’ll show you how to build a machine learning model and use it for prediction, plus best practices on modelling your data and improving the quality of results.
Predictive Analytics
Suppose you have data about movies. But that data is incomplete; some movies are
missing the genre
field. Filling out that missing data is time consuming, and
you would like to do it automatically using all the information you already have
about the movies. This is where Stardog’s predictive analytics comes into the
game. Using the data you have about movies with genre, you can create a machine
learning model that will predict the genre for the movies that are missing it.
Isn’t that sweet?
Supervised learning is the basis of this capability. You give Stardog some data about the domain you’re interested in, and it will learn a model that can be used to make predictions about properties of that data.
Learning a Model
First step is learning a model, by defining which data will be used in the learning and the target that we are actually trying to predict.
With Stardog, all this is naturally done via SPARQL. The best way to understand the syntax is through an example. Here, we learn a model to predict the genre of a movie given its director, year, and studio.
prefix spa: <tag:stardog:api:analytics:>
INSERT {
graph spa:model {
:myModel a spa:ClassificationModel ;
spa:arguments (?director ?year ?studio) ;
spa:predict ?genre .
}
}
WHERE {
?movie :directedBy ?director ;
:year ?year ;
:studio ?studio ;
:genre ?genre .
}
The WHERE
clause selects the data and a special graph, spa:model
, is used to
specify the parameters of the training. :myModel
is the unique identifier
given to this model and is composed of 3 mandatory properties.
First, we need to define the type of learning we are performing:
-
classification,
spa:ClassificationModel
, if we are interested in predicting a categorical value that has a limited set of possible values (e.g., genre of a movie) -
regression,
spa:RegressionModel
, if we predict a numerical value that can naturally have an unlimited set of values (e.g., box office of a movie) -
similarity,
spa:SimilarityModel
, if we want to predict the degree of similarity between two objects (e.g., most similar movies)
The second property, spa:arguments
, defines the variables from the WHERE
clause that will be used as features when learning the model.
Here is where you define the data that you think will help to predict the third
property, given by spa:predict
.
In this case, our model will be trained to predict the value of ?genre
based
on the values of ?director
, ?year
, and ?studio
.
Properly defining these 3 properties is the main task when creating any model. Using more advanced parameters is covered in the Mastering the Machine section.
Making Predictions
Now that we’ve learned a model, we can move on to more exciting stuff and use it to actually predict things.
prefix spa: <tag:stardog:api:analytics:>
SELECT * WHERE {
graph spa:model {
:myModel spa:arguments (?director ?year ?studio) ;
spa:predict ?predictedGenre .
}
:TheGodfather :directedBy ?director ;
:year ?year ;
:studio ?studio ;
:genre ?originalGenre .
}
We select a movie’s properties and use them as arguments to the model Stardog
previously learned. The magic comes with the ?predictedGenre
variable; during
query execution, its value is not going to come from the data itself (like
?originalGenre
), but will instead be predicted by the model, based on the
values of the arguments.
The result of the query will look like this:
| director | year | studio | originalGenre | predictedGenre |
| ------------------- | ---- | ------------------ | ------------- | -------------- |
| :FrancisFordCoppola | 1972 | :ParamountPictures | Drama | Drama |
Our model seems to be predicting correctly the genre for The Godfather. Yee!
Query Syntax Restrictions
At this point, only basic graph patterns can be used directly inside the prediction query.
If more advanced constructs, like OPTIONAL
or FILTER
, are necessary,
that part of the query needs to be in a sub-query, e.g.:
prefix spa: <tag:stardog:api:analytics:>
SELECT * WHERE {
graph spa:model {
:myModel spa:arguments (?director ?year ?studio) ;
spa:predict ?predictedGenre .
}
{
SELECT * WHERE {
?movie :directedBy ?director ;
:year ?year ;
:genre ?originalGenre .
OPTIONAL { ?movie :studio ?studio }
FILTER (?year > 2000)
}
}
}
Assessing Model Quality
Metrics
We provide some special aggregate operators that help quantify the quality of a model.
For classification and similarity problems, one of the most important measures is accuracy
,
that is, the frequency that we predict the target variable correctly.
prefix spa: <tag:stardog:api:analytics:>
SELECT (spa:accuracy(?originalGenre, ?predictedGenre) as ?accuracy) WHERE {
graph spa:model {
:myModel spa:arguments (?director ?year ?studio) ;
spa:predict ?predictedGenre .
}
?movie :directedBy ?director ;
:year ?year ;
:studio ?studio ;
:genre ?originalGenre .
}
+---------------------+
| accuracy |
| ------------------- |
| 0.92488254018 |
+---------------------+
For regression, we provide three different measures:
-
Mean absolute error, or, on average, how far away is the prediction from the real target number:
spa:mae(?originalValue, ?predictedValue)
-
Mean square error, on average, how much is the squared difference between prediction and the target number:
spa:mse(?originalValue, ?predictedValue)
-
Root mean square error, the square root of the mean square error:
spa:rmse(?originalValue, ?predictedValue)
Automatic Evaluation
Classification and regression models are automatically evaluated with the data used in their training.
The score and respective metric can be queried from spa:model
.
prefix spa: <tag:stardog:api:analytics:>
SELECT * WHERE {
graph spa:model {
:myModel spa:evaluationMetric ?metric ;
spa:evaluationScore ?score .
}
}
+------------------------------------+-------+
| metric | score |
+------------------------------------+-------+
| tag:stardog:api:analytics:accuracy | 1.0 |
+------------------------------------+-------+
By default, spa:accuracy
is used for classification problems, and spa:mae
for regression.
This metric can be changed during model learning, by setting the spa:evaluationMetric
argument.
prefix spa: <tag:stardog:api:analytics:>
INSERT {
graph spa:model {
:myModel a spa:RegressionModel ;
spa:evaluationMetric spa:rmse ;
...
}
}
...
Cross Validation
The default automatic evaluation technique of measuring the accuracy of the model on the same data as training might be prone to overfitting. The most accurate measure we can have is testing on data that the model has never seen before.
We provide a spa:crossValidation
property, which will automatically apply K-Fold cross validation
on the training data, with the number of folds given as an argument.
prefix spa: <tag:stardog:api:analytics:>
INSERT {
graph spa:model {
:myModel a spa:RegressionModel ;
spa:crossValidation 10 ;
spa:evaluationMetric spa:rmse ;
...
}
}
...
prefix spa: <tag:stardog:api:analytics:>
SELECT * WHERE {
graph spa:model {
:myModel spa:evaluation ?validation ;
spa:evaluationMetric ?metric ;
spa:evaluationScore ?score .
}
}
+-------------+------------------------------------+-------+
| validation | metric | score |
+-------------+------------------------------------+-------+
| "KFold=10" | tag:stardog:api:analytics:rmse | 0.812 |
+-------------+------------------------------------+-------+
Modelling Data
The way you input data into Stardog during model learning is of utmost importance in order to achieve good quality predictions.
Data Representation
For better results, each individual you are trying to model should be encoded in a single SPARQL result.
For example, suppose you want to add information about actors into the previous model. The query selecting the data would look as follow:
SELECT * WHERE {
?movie :actor ?actor ;
:directedBy ?director ;
:year ?year ;
:studio ?studio ;
:genre ?genre .
}
| movie | actor | director | year | studio | genre |
| ------------- | ------------- | ------------------- | ---- | ------------------ | ------ |
| :TheGodfather | :MarlonBrando | :FrancisFordCoppola | 1972 | :ParamountPictures | Drama |
| :TheGodfather | :AlPacino | :FrancisFordCoppola | 1972 | :ParamountPictures | Drama |
Due to the nature of relational query languages like SPARQL, results are returned for all the combinations between the values of the selected variables.
In order to properly model relational domains like this, we introduced a special
aggregate operator, set
.
Used in conjunction with GROUP BY
, we can easily model this kind of data as a
single result per individual.
prefix spa: <tag:stardog:api:analytics:>
SELECT ?movie (spa:set(?actor) as ?actors) ?director ?studio ?genre WHERE {
?movie :actor ?actor ;
:directedBy ?director ;
:year ?year ;
:studio ?studio ;
:genre ?genre .
}
GROUP BY ?movie ?director ?studio ?genre
| movie | actors | director | year | studio | genre |
| ------------- | ------------------------- | ------------------- | ---- | ------------------ | ------ |
| :TheGodfather | [:MarlonBrando :AlPacino] | :FrancisFordCoppola | 1972 | :ParamountPictures | Drama |
Data Types
Carefully modelling your data with the correct datatypes can dramatically increase the quality of your model.
As of 7.4.5, Stardog does special treatment on values of the following types:
-
Numbers, such as
xsd:int
,xsd:short
,xsd:byte
,xsd:float
, andxsd:double
, are treated internally as weights and properly model the difference between values -
Strings,
xsd:string
andrdf:langString
, are tokenized and used in a bag-of-words fashion -
Sets, created with the
spa:set
operator, are interpreted as a bag-of-words of categorical features -
Booleans,
xsd:boolean
, are modeled as binary features
Everything else is modeled as categorical features.
Setting the correct data type for the target variable, given through spa:predict
,
is extremely important:
-
with regression, make sure values are numeric
-
with classification, individuals of the same class should have consistent data types and values
-
with similarity, use values that uniquely identify an object, e.g., an IRI
For evertything else, using the datatype that is closer to its original meaning is a good rule of thumb.
Mastering the Machine
Let’s look at some other issues around the daily care and feeding of predictive analytics and models in Stardog.
Overwriting Models
By default, you cannot create a new model with the same identifier as an already existent one.
If you try to do so, you’ll be greeted with a Model already exists
error.
In order to reuse an existent identifier, users can set the spa:overwrite
property to True
.
This will delete the previous model and save the new one in its place.
prefix spa: <tag:stardog:api:analytics:>
INSERT {
graph spa:model {
:myModel a spa:RegressionModel ;
spa:overwrite True ;
...
}
}
...
Deleting Models
Finding good models is an iterative process, and sometimes you’ll want to delete
your old---not as awesome and now unnecessary---models. This can be achieved
with DELETE DATA
and the spa:deleteModel
property applied to the model
identifier.
prefix spa: <tag:stardog:api:analytics:>
DELETE DATA {
graph spa:model {
[] spa:deleteModel :myModel .
}
}
Classification and Similarity with Confidence Levels
Sometimes, besides predicting the most probable value for a property, you will
be interested to know the confidence of that prediction. By providing the
spa:confidence
property, you can get confidence levels for all the possible
predictions.
prefix spa: <tag:stardog:api:analytics:>
SELECT * WHERE {
graph spa:model {
:myModel spa:arguments (?director ?year ?studio) ;
spa:confidence ?confidence ;
spa:predict ?predictedGenre .
}
:TheGodfather :directedBy ?director ;
:year ?year ;
:studio ?studio .
}
ORDER BY DESC(?confidence)
LIMIT 3
| director | year | studio | predictedGenre | confidence |
| ------------------- | ---- | ------------------ | -------------- | -------------- |
| :FrancisFordCoppola | 1972 | :ParamountPictures | Drama | 0.649688932 |
| :FrancisFordCoppola | 1972 | :ParamountPictures | Crime | 0.340013045 |
| :FrancisFordCoppola | 1972 | :ParamountPictures | Sci-fi | 0.010298023 |
These values can be interpreted as the probability of the given prediction being the correct one and are useful for tasks like ranking and multi-label classification.
Tweaking Parameters
Both Vowpal Wabbit and similarity search
can be configured with the spa:parameters
property.
prefix spa: <tag:stardog:api:analytics:>
INSERT {
graph spa:model {
:myModel a spa:ClassificationModel ;
spa:library spa:VowpalWabbit ;
spa:parameters [
spa:learning_rate 0.1 ;
spa:sgd True ;
spa:hash 'all'
] ;
spa:arguments (?director ?year ?studio) ;
spa:predict ?genre .
}
}
...
Parameter names for both libraries are valid properties in the spa
prefix, and their values can be set during model creation.
Vowpal Wabbit
By default, models are learned with [ spa:loss_function "logistic"; spa:probabilities true; spa:oaa true ]
in classification mode,
and [ spa:loss_function "squared" ]
in regression.
Those parameters are overwritten when using the spa:arguments
property with regression, and appended in classification.
Check the official documentation for a full list of parameters. Some tips that might help with your choices:
-
Use cross-validation when tweaking parameters. Otherwise, make sure your testing set is not biased and represents a true sample of the original data.
-
The most important parameter to tweak is the learning rate
spa:l
. Values between 1 and 0.01 usually give the best results. -
To prevent overfitting, set
spa:l1
orspa:l2
parameters, preferably with a very low value (e.g., 0.000001). -
If number of distinct features is large, make sure to increase the number of bits
spa:b
to a larger value (e.g., 22). -
Each argument given with
spa:arguments
has its own namespace, identified by its numeric position in the list (starting with 0). For example, to create quadratic features between?director
and?studio
, setspa:q "02"
. -
If caching is enabled (e.g., with
spa:passes
), always use the[ spa:k true; spa:cache_file "fname" ]
parameters, wherefname
is a unique filename for that model. -
In regression, the target variable given with
spa:predict
is internally normalized into the[0-1]
range, and denormalized back to its normal range during query execution. For certain problems where numeric arguments have large values, performance might be improved by performing a similar normalization as a pre-processing step.
Similarity Search
The underlying algorithm is based on cluster pruning, an approximate search algorithm which groups items based on their similarity in order to speed up query performance.
The minimum number of items per cluster can be configured with the spa:minClusterSize
property, which is set to 100 by default.
prefix spa: <tag:stardog:api:analytics:>
INSERT {
graph spa:model {
:myModel a spa:SimilarityModel ;
spa:parameters [
spa:minClusterSize 100 ;
] ;
spa:arguments (?director ?year ?studio) ;
spa:predict ?movie .
}
}
...
This number should be increased with datasets containing many near-duplicate items.
During prediction, there are two parameters available:
-
spa:limit
, which restricts the number of top N items to return; by default, it returns only the top item, or all items if usingspa:confidence
. -
spa:clusters
, which sets the number of similarity clusters used during the search, with a default value of 1. Larger numbers will increase recall, at the expense of slower query time.
For example, the following query will return the top 3 most similar items and their confidence scores, restricting the search to 10 clusters.
prefix spa: <tag:stardog:api:analytics:>
SELECT * WHERE {
graph spa:model {
:myModel spa:parameters [
spa:limit 3 ;
spa:clusters 10 .
] ;
spa:confidence ?confidence ;
spa:arguments (?director ?year ?studio) ;
spa:predict ?similar .
}
}
...
Hyperparameter Optimization
Finding the best parameters for a model is a time consuming, laborious, process. Stardog helps to ease the pain by performing an exhaustive search through a manually specified subset of parameter values.
prefix spa: <tag:stardog:api:analytics:>
INSERT {
graph spa:model {
:myModel a spa:ClassificationModel ;
spa:library spa:VowpalWabbit ;
spa:parameters [
spa:learning_rate (0.1 1 10) ;
spa:hash ('all' 'strings')
] ;
spa:arguments (?director ?year ?studio) ;
spa:predict ?genre .
}
}
...
All possible sets of parameter configurations that can be built from the given values (spa:learning_rate 0.1 ; spa:hash 'all'
, spa:learning_rate 1 ; spa:hash 'all'
, and so on)
will be evaluated.
The best configuration will be chosen, and its model saved in the database.
Afterwards, parameters are available for querying, just like any other model metadata.
prefix spa: <tag:stardog:api:analytics:>
SELECT * WHERE {
graph spa:model {
:myModel spa:parameters [ ?parameter ?value ]
}
}
+-------------------+-------+
| parameter | value |
+-------------------+-------+
| spa:hash | "all" |
| spa:learning_rate | 1 |
+-------------------+-------+
Native Library Errors
Stardog ships with a pre-compiled version of Vowpal Wabbit (VW) that works out of the box with most MacOSX/Linux 64bit distributions.
If you have a 32 bit operating system, or an older version of Linux, you will
be greeted with a Unable to load analytics native library
error when trying to create your
first model.
Exception in thread "main" java.lang.RuntimeException: Unable to load analytics native library. Please refer to http://www.stardog.com/docs/#_native_library_errors
at vowpalWabbit.learner.VWLearners.loadNativeLibrary(VWLearners.java:94)
at vowpalWabbit.learner.VWLearners.initializeVWJni(VWLearners.java:76)
at vowpalWabbit.learner.VWLearners.create(VWLearners.java:44)
...
Caused by: java.lang.RuntimeException: Unable to load vw_jni library for Linux (i386)
In this case, you will need to install VW manually. Fear not! Instructions are easy to follow.
git clone https://github.com/cpdomina/vorpal.git
cd vorpal/build-jni/
./build.sh
sudo cp transient/lib/vw_wrapper/vw_jni.lib /usr/lib/libvw_jni.so
You might need to install some dependencies, namely zlib-devel
, automake
, libtool
, and autoconf
.
After this process is finished, restart the Stardog server and everything should work as expected.
Edge Properties
Note
|
This feature is in beta in Stardog 7.1.
To enable it, create the database with the edge.properties=true option.
|
Stardog 7.1 supports extensions to the RDF data model and SPARQL query engine to store and query properties of RDF statements (edges in the RDF graph, thus the name "edge properties"). Edge properties allow the user to attach specific information to RDF statements by using them as subjects in other RDF statements. They are somewhat similar to named graphs in the sense that they allow adding metadata to existing data but on the statement level, not on the graph level.
Edge properties bridge the gap between the RDF data model and the Property Graph data model.
Example
Common examples of statement metadata include provenance, uncertainty, and time. The two statements below can be annotated to specify where they come from, and in what time period they hold.
:Pete a :Engineer ;
:worksAt :Stardog
as follows:
:Pete a { :since 2010 } :Engineer ;
:worksAt { :source :HR } :Stardog
Note
|
See Syntax(es) on the details of the Stardog syntax for edge properties. |
Motivation
While it is technically possible to maintain statement level metadata in plain RDF,
the approaches are unintuitive and tend to complicate queries for accessing the data.
For example, one way is to treat relations
such as :worksAt
as n-ary predicates, model them as nodes, and link them to metadata
using ordinary RDF statements:
:PeteEmployment :employee :Pete ;
:employer :Stardog ;
:source :HR
The similar domain-independent approach is to use the RDF reification
vocabulary and turn each edge into a set of rdf:Statement/rdf:subject/rdf:predicate/rdf:object
triples.
Both ways increase the number of triples in the database and the number of triple patterns in SPARQL queries.
That typically leads to performance penalties for both data updates and queries.
Instead the edge property support in Stardog 7.1 is based on the recent work on RDF*/SPARQL* extensions and includes changes to both the storage layer and the query engine for performance reasons.
Syntax(es)
Stardog supports two syntactic flavors to represent edge properties in RDF and query them in SPARQL. The first notation was originally suggested as a part of the RDF*/SPARQL* proposal. It is based on Turtle and looks as follows:
<< :Pete a :Engineer >> :since 2010 .
<< :Pete :worksAt :Stardog >> :source :HR .
The corresponding triple pattern syntax in SPARQL also uses the << >>
notation:
SELECT * {
<< ?emp a :Engineer >> :since ?year .
<< ?emp :worksAt :Stardog >> :source ?who .
}
It is also possible to use extended BIND operator in SPARQL to bind variables to RDF edges:
SELECT ?emp ?year {
BIND(<< ?emp a :Engineer >> as ?edge)
?edge :since ?year .
}
Both the Turtle and SPARQL extensions support the standard Turtle shortcuts like implicit bnodes ([]
),
predicate lists (;
), and object lists (,
) outside of << >>
patterns but not inside them. It makes
it verbose when the same subject has multiple outgoing edges some of which have properties and some do not.
That tends to be annoying in SPARQL where multiple triple patterns with the same subject are common.
To address this issue Stardog also supports an alternative syntax which puts edge properties next to predicates, not full triples or triple patterns. That allows for all Turtle-style shortcuts:
:Pete a { :since 2010 ; until 2018} :Engineer ;
:worksAt { :source :HR } :Stardog
The same works in SPARQL:
SELECT ?emp ?start ?end ?who {
?emp a { :since ?start ; :until ?end } :Engineer ;
:worksAt { :source ?who } :Stardog .
}
Both notations are equally expressive so that data and queries can be translated back and forth.
Scope of the Support
Stardog supports edge properties across all APIs (Java and HTTP) and types of SPARQL queries.
Specifically, not only can edge property patterns be used in the WHERE
clauses of CONSTRUCT
and
SPARQL Update queries but also in their graph templates.
CONSTRUCT
query example:
CONSTRUCT { << ?emp :worksAt ?org >> :source :HR } WHERE {
?emp a :Engineer ;
:worksAt ?org
}
SPARQL Update query examples:
INSERT DATA { << :Pete :workAt :Stardog >> :since :2010 }
INSERT { << ?emp :worksAt ?org >> :source :HR } WHERE {
?emp a :Engineer ;
:worksAt ?org
}
Stardog supports RDF* extensions for Turtle, TriG, the Binary RDF format, and JSON-LD.
SPARQL SELECT query results with edge properties can be sent over in the XML, binary, and JSON
format (content types application/sparql-results+xml
, application/x-binary-rdf-results-table
,
and application/sparql-results+json
, respectively).
Details
We have made several decisions regarding edge properties support and how it relates to the RDF*/SPARQL* proposal. They are motivated by performance and ease of use considerations.
-
Only subjects of RDF triples can be triples (not also predicates or objects as in the original RDF* proposal). Nested edge properties are not allowed.
-
Named Graphs: Edges with properties can be stored in a named graph just as other triples. However, edge properties must be stored in the same named (or default) graph as the corresponding edges.
-
Asserting Edges: The question of whether edges with properties should be asserted in the graph or not has been the subject of a lively debate. In Stardog, similarly to the Property Graph Model, they are implicitly added to the graph as soon as their properties are added. Note that the Stardog syntax makes it obvious in contrast to the RDF* syntax which allows both interpretations.
-
Cascading Deletes: The consequence of asserting edges is that edge properties are deleted in a cascading fashion when edges themselves are deleted.
-
Transactions: Stardog automatically selects the Abort on Conflict strategy for conflict resolution during commits. This is required to guarantee that there are no orphaned edge properties under concurrent transactions.
Cache Management
In the 6.2.0 release, Stardog introduced the notion of a distributed cache. A set of cached datasets can be run in conjunction with a Stardog server or cluster. A dataset can be an entire graph, a virtual graph, or a query result (currently experimental). This feature gives users the following abilities:
-
Reduce Load on Upstream Database Servers When using virtual graphs it can be the case that the upstream server is slow, overworked, far away, or lacks operational capacity. This feature addresses this by allowing operators to create a cached data set running in its own node. In this way the upstream database can be largely avoided and cache refreshes can be scheduled for times when the its workload is lighter.
-
Read Scale-out for a Stardog Cluster Cache nodes allow operators to add read capacity of slowly moving data to a cluster without affecting write capacity. The Stardog cluster is a consistent database that replicates out writes to every node, so as you add consistent read nodes, write capacity is strained. However, when serving slowly moving data that doesn’t need to be fully consistent, a cache graph node can be added to provide additional read capacity.
-
Partial Materialization of Slowly Changing Data A cache dataset can be created that contains a portion of a virtual graph that does not update frequently while allowing federated virtual graph queries as needed over portions of the data that update more frequently.
Architecture
Running inside of a Stardog server (either a cluster or single node) is a component called the cache manager. The cache manager is responsible for tracking what caches exist, where they are, and what is in them. The query planner must work with the cache manager to determine whether or not it can use a cache in the plan.
Cache Targets
Cache targets are separate processes that look a lot like a Single node Stardog server in the inside. They contain a single database into which cached information is loaded and updated. Many caches can be on a single cache target. How to balance them is up to each operator as they consider their own resource and locality needs.
The following diagram shows how the distributed cache can be used to answer queries where some of the data is cached and some still remains in it original source.

Setting Up A Distributed Cache
To setup Stardog running with a distributed cache first start a Stardog server as described in (Administering Stardog).
For every cache target needed another Stardog server must be run. Stardog servers are configured to be cache targets with the following options:
# Flag to run this server as a cache target
cache.target.enabled=true
# If using the cache target with a cluster we need to tell it where
# the clusters zookeeper installation is running
pack.zookeeper.address=196.69.68.1:2180,196.69.68.2:2180,196.69.68.3:2180
# Flag to automatically register this cache target on startup. This is
# only applicable when using a cluster.
cache.target.autoregister=false
# The name to use for this cache target on auto register. This defaults
# to the hostname
cache.target.name = "mycache"
Once both are running we need to register the cache target with Stardog. That is done in the following way:
stardog-admin --server http://<cluster IP>:5820 cache target add <target name> <target hostname>:5820 admin <admin pw>
Once Stardog knows of the existing cache target datasets can be cached on it. To cache a graph run a command similar to the following:
stardog-admin --server http://<cluster ip>:5820 cache create cache://cache1 --graph virtual://dataset --target ctarget --database movies
This is will create a cache named cache://cache1
which will hold the contents
of the virtual graph virtual://dataset
which is associated with the database
movies and store that cache on the target ctarget
.
Kubernetes
Helm Charts
Support for k8s is provided as Helm charts, which makes it easy to deploy and test in k8s. Helm charts describe the services and applications to run in k8s and how they should be deployed, providing a single means for repeatable application deployment.
As of Stardog 7.3.0, the Helm charts are open source and available on GitHub.
Customizing the Stardog Image
To customize the Stardog image in k8s we recommend extending the base Stardog image and pushing your custom image into a container registry, either a local or on-prem registry, or a cloud-based one, such as DockerHub, Elastic Container Registry (ECR), Google Container Registry (GCR), or Azure Container Registry (ACR).
As an example, to configure an image with extra client drivers, create a Dockerfile:
FROM stardog/stardog:latest
RUN mkdir -p /var/opt/drivers/
COPY ./elasticsearch-rest-client-7.4.0.jar /var/opt/drivers/
ENV STARDOG_EXT=/var/opt/drivers/
This defines a simple Docker image, based on Stardog’s official
image hosted on DockerHub, which adds the Elastic Search client
and sets the STARDOG_EXT
environment variable. When this image is
run the ENTRYPOINT
for the stardog/stardog:latest
image will
be used to start Stardog but with the additions included in the
custom Dockerfile.
After creating the Dockerfile, you need to build, tag, and push it into your registry (authenticating as required). Below we build the image and push it to ECR with a custom tag:
docker build . -t customstardog:0.0.1
$(aws ecr get-login --region us-west-1 --no-include-email)
docker push customstardog:0.0.1
Once your image is available, you can set the image.*
parameters for Helm to deploy Stardog cluster into k8s
with your custom image.
Business Intelligence Tools and SQL Queries
BI & SQL Introduction
Stardog provides a facility to bridge the flexible schema of the graph data model to the traditional relational model required by business intelligence tools using SQL. This enables seamless use of business intelligence and visualization tools such as Tableau and Power BI.
Using this feature requires creating a schema mapping from Stardog’s data model to a relational data model. This mapping will be used to generate a relational schema. After creating the schema, the full power of SQL becomes available to relational clients.
Important
|
Authentication requirements dictate that users created in versions of Stardog prior to 7.0.2 will need to have their password reset before connecting through the BI server. This requirement is not applicable if LDAP authentication is configured. |
Configuring the BI Server
To integrate with business intelligence applications, Stardog includes
a BI Server that makes Stardog communicate like a fully SQL-compliant
RDBMS. The BI Server can be configured to run inside Stardog using the
following configuration options in stardog.properties
:
-
sql.server.enabled
: Turns on the BI Server. Must be set totrue
to use this feature. -
sql.server.port
: Controls the TCP port which the SQL query endpoint listens on. The default port is5806
. -
sql.server.commit.invalidates.schema
: Controls when changes to the schema mappings are visible to new BI connections. The default value istrue
. Iftrue
, changes to the schema mappings will be visible to newly created connections. This can lead to unnecessary load if the mappings rarely change. Iffalse
, changes to the schema mappings will only affect the schema after the database is taken offline. In all cases, users with long running connections will not see schema changes until they reconnect.
The schema mapping is stored in a named graph which can be configured
separately for each database. The sql.schema.graph
database option
should be set to IRI of the named graph which stores the schema. The
default value is tag:stardog:api:sql:schema
.
If the schema mapping is not
manually added by the user and the database configuration
option sql.schema.auto
is set to true
(which is the default value)
then a schema mapping will be automatically created and used. See
below for the details of
auto-schema generation.
BI Server Supported Clients
Clients connect to the BI Server using the MySQL client/server protocol. This means that a wide range of clients are supported including MySQL ODBC, JDBC and ADO.NET drivers.
Currently the following BI tools are officially supported:
-
Tableau, which requires the MySQL Connector/ODBC driver.
-
Power BI, which requires the MySQL Connector/NET driver.
-
cumul.io, where you can simply select the Stardog connector
-
Apache Superset, which requires
mysqlclient
to be installed in the Dockerfile. -
Siren (Beta), which requires the MySQL JDBC driver.
-
IBM Cognos (Beta), where you can select the MySQL connector
-
Metabase, which supports connecting natively using the MySQL protocol.
-
RapidMiner, which can connect using MySQL’s JDBC driver and the "Read Database" operator.
Once the appropriate client driver is installed, select the option to
connect to a MySQL server, enter your Stardog hostname
(if running locally, use your IP address instead of localhost
)
and configured BI server port and provide credentials for a Stardog user.
Although not officially supported, other visualization and reporting tools should work with the BI Server. Please let us know if you’re using a different tool and have any questions or difficulties.
LDAP Authentication in BI Server
The BI Server uses the same authentication mechanisms configured
globally in the Stardog instance. This is generally transparent but
may require special configuration in the MySQL client. MySQL’s ODBC
driver requires the "cleartext" password authentication to be enabled
when using LDAP authentication. This can be done by setting
ENABLE_CLEARTEXT_PLUGIN=1
when editing DSN file or by checking the
"Enable Cleartext Authentication" option of the "Connection" tab in
the
GUI
configuration.
SSL Connections to BI Server
The BI Server supports SSL/TLS connections which is especially
important when sending cleartext passwords with LDAP authentication as
described above. The required configuration is to provide a server key
as documented in Configuring Stardog to use SSL. The BI Server
currently only reads the SSL/TLS keyStore property from the JVM
arguments, not from stardog.properties
.
When the server key is provided, the BI Server will advertise SSL/TLS
capability to clients. Preventing the use of unencrypted connections
can be done on the client
side. MySQL’s
ODBC driver can be configured with
SSLMODE=REQUIRED
. MySQL’s
JDBC driver can be configured with requireSSL=true
or
sslMode=REQUIRED
.
SQL Schema Mappings
A schema mapping defines a view over the RDF triples in a
database in terms of tables in a relational schema. A schema mapping is expressed
as RDF and stored in a named graph (as identified by the sql.schema.graph
option
which is by default set to the IRI tag:stardog:api:sql:schema
)
in the same database where the data to query is stored.
The top-level elements in the schema mapping are table mappings. A table
mapping defines the relationship between a subset of triples in the
database and a set of rows conforming to a fixed schema table view of
the data. A table mapping consists of a set of field mappings. Field
mappings define the schema of the table as well as specify which
values are present in each row.
To define a table mapping, you should add an entity of type
tag:stardog:api:sql:TableMapping
. Field mappings can be included in
a table mapping using the tag:stardog:api:sql:hasField
property. The
name of the table can be provided using the
tag:stardog:api:sql:tableName
property. Field mappings can be freely
created outside of table mappings and re-used in multiple table
mappings.
The table mapping is used to match a set of triples in the database
and transform them into rows in the table. Triples are joined together
implicitly with a shared subject. The shared subject is mapped to a
field named id
. For instance, if you map a name
and age
property, this would give the same results as the SPARQL query:
SELECT ?id, ?name, ?age {
?id :name ?name ; :age ?age
}
Table mappings can include constraints. Constraints provide the
ability to restrict the set of triples present in a mapped
table. Constraints are similar to field mappings in that they require
a property which will be linked to the shared subject. However, the
object is specified as a constant. A common usage is to require the
rows in a table to be mapped from instances of a given class. The
tag:stardog:api:sql:hasConstraint
property can be used to specify
constraints for a table mapping.
Example SQL Schema Mapping
The following example is composed of several elements:
-
A Stardog database
-
A SQL schema mapping defining the schema and it’s relationship to the triples in the database
-
A SQL table schema which is auto-generated by Stardog
-
SQL queries and results
The contents of the database:
@prefix : <http://example.com/> .
:Alice a :Person ;
:name "Alice" ;
:nationality :USA .
:Bob a :Person ;
:name "Bob" ;
:nationality :UK .
The SQL schema definition:
@prefix sql: <tag:stardog:api:sql:> .
@prefix : <http://example.com/> .
:PersonTableMapping a sql:TableMapping ;
sql:tableName "Person" ;
sql:hasConstraint [ sql:property rdf:type ; sql:object :Person ] ;
sql:hasField [ sql:property :name ; sql:fieldName "person_name" ] ;
sql:hasField [ sql:property :nationality ] .
The generated SQL table schema:
CREATE TABLE Person (
id varchar NOT NULL,
person_name varchar NOT NULL,
nationality NOT NULL,
PRIMARY KEY (id)
)
Example query and result:
SELECT id, person_name, nationality FROM Person
| id | person_name | nationality |
|--------------------------+-------------+-------------|
| http://example.com/Alice | Alice | USA |
| http://example.com/Bob | Bob | UK |
If you have multiple tables, you can use tag:stardog:api:sql:refersTo
to create
a foreign key type relationship. For example, if your database looks like this:
@prefix : <http://example.com/> .
:Alice a :Person ;
:name "Alice" ;
:nationality :USA .
:Bob a :Person ;
:name "Bob" ;
:nationality :UK .
:UK a :Country ;
:name "United Kingdom" .
:USA a :Country ;
:name "United States of America" .
You could create a schema mapping with two tables that looks like this:
@prefix sql: <tag:stardog:api:sql:> .
@prefix : <http://example.com/> .
:PersonTableMapping a sql:TableMapping ;
sql:tableName "Person" ;
sql:hasConstraint [ sql:property rdf:type ; sql:object :Person ] ;
sql:hasField [ sql:property :name ; sql:fieldName "person_name" ] ;
sql:hasField [ sql:property :nationality ; :refersTo :CountryTableMapping ] .
:CountryTableMapping a sql:TableMapping ;
sql:tableName "Country" ;
sql:hasConstraint [ sql:property rdf:type ; sql:object :Country ] ;
sql:hasField [ sql:property :name ; sql:fieldName "country_name" ] .
SQL Schema Field Mapping Options
Field mappings may contain the following properties:
-
tag:stardog:api:sql:property
- Specify which RDF property from the database is used to provide data for the field. This property is required for each field mapping. -
tag:stardog:api:sql:fieldName
- Specify the name of the field in the SQL schema. If this property is omitted, the local name of IRI given for the:property
is used. -
tag:stardog:api:sql:inverse
- Whentrue
, the sharedid
field of the triple is assumed to be in the object position of the triple instead of the subject position. An example field mappingsql:hasField [ sql:property knows ; sql:fieldName "is_known_by" ; sql:inverse true ]
will include values for the subject position in theis_known_by
field and join to the other triple patterns using the value from the object position. The default value isfalse
. -
tag:stardog:api:sql:optional
- Whentrue
, no triples are assumed to be present when querying for a row. If they are not present, a NULL is included in the field. Additionally, the SQL schema will omit theNOT NULL
constraint. The default value isfalse
. -
tag:stardog:api:sql:refersTo
- Optionally specify a reference to another table mapping. This is analogous to defining a foreign key in a SQL schema. While not strictly necessary, this type of relationship can be defined once in the SQL schema mapping and will allow introspection of relationships in query generation tools. -
tag:stardog:api:sql:type
- Optionally specify the type of the field. The value is an XSD datatype. If values cannot be converted to the specified type, a default value will be returned, e.g. 0 for a field specifying an integer type but returning a value which cannot be interpreted as an integer. The default value is xsd:string.
SQL Schema Constraint Options
Constraints are generally expressed using the
tag:stardog:api:sql:hasConstraint
predicate linking the table
mapping to the constraint. The constraint may contain the following
properties:
-
tag:stardog:api:sql:property
- Specify which RDF property from the database is used to constrain the data in the table. This property is required. -
tag:stardog:api:sql:object
- Specify which constant object (i.e. the IRI or literal) from the database is used to constrain the data in the table. This property is required. To constrain to multiple values, use onetag:stardog:api:sql:object
relationship per value. -
tag:stardog:api:sql:inverse
- Whentrue
, the sharedid
field of the triple is assumed to be in the object position of the triple instead of the subject position. The default value isfalse
.
To constrain table contents to be instances of the Person
class, we
can use the hasConstraint
form as follows.
@prefix sql: <tag:stardog:api:sql:> .
@prefix : <http://example.com/> .
:PersonTableMapping a sql:TableMapping ;
sql:tableName "Person" ;
sql:hasConstraint [ sql:property rdf:type ; sql:object :Person ] ;
sql:hasField [ sql:property t:name ; sql:fieldName "person_name" ] .
We can also use a constraint to include only Stardog users via the
isStardogUser
predicate:
@prefix sql: <tag:stardog:api:sql:> .
@prefix : <http://example.com/> .
:PersonTableMapping a sql:TableMapping ;
sql:tableName "StardogUsers" ;
sql:hasConstraint [ sql:property :isStardogUser ; sql:object true ] ;
sql:hasField [ sql:property :name ; sql:fieldName "person_name" ] .
Or an constraint with inverse
set if we want to find people
influenced by Einstein:
@prefix sql: <tag:stardog:api:sql:> .
@prefix : <http://example.com/> .
:PersonTableMapping a sql:TableMapping ;
sql:tableName "EinsteinPupils" ;
sql:hasConstraint [ sql:property :influencedBy ; sql:object :Einstein ; sql:inverse true ] ;
sql:hasField [ sql:property :name ; sql:fieldName "person_name" ] .
Restricting contents of a table to those of a specified class can be
done with the shorthand predicate tag:stardog:api:sql:class
.
When using this, the sql:tableName
property may be omitted
and the table name will default to the name of the class.
Here’s an example of a table mapping which includes only instances of
:Person
.
@prefix sql: <tag:stardog:api:sql:> .
@prefix : <http://example.com/> .
:PersonTableMapping a sql:TableMapping ;
sql:class :Person ;
sql:hasField [ sql:property :name ; sql:fieldName "person_name" ] .
Auto-generated Schema Mappings
If the schema mapping named graph is empty and the sql.schema.auto
database
option is set to true
(which is the default value) a default schema mapping
will be created. By default, the
SQL schema mapping will be generated from the RDFS/OWL schema in the database
but the sql.schema.auto.source
option can
be changed to use SHACL shapes definitions for schema generation.
In the default mapping, each RDFS/OWL class will be mapped to a table.
Any property whose rdfs:domain
is set to that class will be added to the
table as a field. The rdfs:range
defined for the class will be used as
the type of the column which will be a primitive SQL type if the range is
a datatype and a foreign key reference if the range is a class.
If sql.schema.auto.source
is set to shacl
then any node shape with a
sh:targetClass
will be mapped to a table. The property shapes defined
on the shape will be added as a field if the path for the property shape
is a predicate. The range defined for the property via sh:class
, sh:datatype
,
or sh:node
will be used as the type of the field. An example of mappings
generated from SHACL constraints can be found in the
stardog tutorials.
The auto-generated mappings will be updated automatically as the database gets
updated unless sql.server.commit.invalidates.schema
option is set to false
.
In either case, auto-generated mappings are never materialized in the special graph
identified by the sql.schema.graph
option. The schema information will still be
available to the SQL tools. If you would like to inspect the auto-generated mappings
you can generate the mappings using the following command:
$ stardog data model --input owl --output sql DB
If you would like to customize the mappings then you can save the output in a file, make changes to the mappings and load the file into the BI/SQL named graph.
Mapping SPARQL Queries in a Schema Mapping
In addition to table mappings, stored SPARQL SELECT queries can be treated as tables. They will automatically be added to the schema if visible from the database. The following restrictions apply:
-
The
FROM
andFROM NAMED
clauses, which specify the named graphs to include in the query’s dataset, are ignored -
The reasoning setting that is saved with the query is ignored
Reasoning and BI Server Queries
Stardog’s reasoning capabilities are available to BI Server
queries. The reasoning schema (see Reasoning with Multiple Schemas)
can be selected at any time during a connection by setting
the reasoning_schema
session variable. To use the default
reasoning schema by executing set @@reasoning_schema = 'default'
, or
you can explicitly set a schema by executing set @@reasoning_schema = 'other_schema'
The reasoning schema can be set at the connection level in various ways:
-
The
initstmt
option can be set on the MySQL Connector/ODBC connection. -
The
sessionVariables
option can be set on a MySQL Connector/J (JDBC) connection. This can be added directly to the connection URL, egjdbc:mysql://localhost:5806/db?sessionVariables=reasoning_schema='default'
. -
The Tableau MySQL connection dialog has an option in the bottom left called
Initial SQL…
. Clicking this option will open a dialog where the statement to set the reasoning schema can be provided. -
The Power BI MySQL connection dialog has an "Advanced Options" section which can be expanded to reveal a "SQL statement" option.
Debugging BI Server Queries
To see the SPARQL queries that are generated by the BI server, add this
element to the <Loggers>
section of your log4j2.xml
file and restart
Stardog:
<Logger name="com.complexible.stardog.serf.sql.planner.SparqlEnumerator" level="DEBUG" additivity="false">
<AppenderRef ref="stardogAppender"/>
</Logger>
Security
Stardog’s security model is based on standard role-based access control: users have permissions over resources during sessions; permissions can be grouped into roles; and roles can be assigned to users.
Stardog uses Apache Shiro for authentication, authorization, and session management and jBCrypt for password hashing.
Stardog 7.3.1 contains a preview version of encryption at rest. The preview uses AES 256 bit encryption for writing data to disk.
Resources
A resource is some Stardog entity or service to which access is
controlled. Resources are identified by their type and their name. A
particular resource is denoted as type_prefix:name
. The valid resource
types with their prefixes are shown below.
Resource | Prefix | Description |
---|---|---|
User |
|
A user (e.g., |
Role |
|
A role assigned to a user ( |
Database |
|
A database ( |
Named Graph |
|
A named graph (graph subset) ( |
Virtual Graph |
|
A virtual graph ( |
Database Metadata |
|
Metadata of a database ( |
Database Admin |
|
Database admin tasks (e.g., |
Integrity Constraints |
|
Integrity constraints associated with a database (e.g., |
Sensitive properties |
|
Sensitive properties associated with a database |
Permissions
Permissions are composed of a permission subject, an action, and a permission object, which is interpreted as the subject resource can perform the specified action over the object resource.
Permission subjects can be of type user
or role
only. Permission
objects can be of any valid type.
Note
|
write permission in Stardog refers to graph contents, including mutative
operations performed via SPARQL Update (i.e., INSERT , DELETE , etc.). The
other permissions, i.e., create and delete , apply to resources of the system
itself, i.e., users, databases, database metadata, etc.
|
Valid actions include the following:
read
-
Permits reading the resource properties
write
-
Permits changing the resource properties
create
-
Permits creating new resources
delete
-
Permits deleting a resource
grant
-
Permits granting permissions over a resource
revoke
-
Permits revoking permissions over a resource
execute
-
Permits executing administration actions over a database
all
-
Special action type that permits all previous actions over a resource
Wildcards
Stardog understands the use of wildcards to represent sets of resources. A
wildcard is denoted with the character *
. Wildcards can be used to
create complex permissions; for instance, we can give a user the ability
to create any database by granting it a create
permission over db:*
.
Similarly, wildcards can be used in order to revoke multiple permissions
simultaneously.
Superusers
It is possible at user-creation time to specify that a given user is a
superuser. Being a superuser is equivalent to having been granted an
all
permission over every resource, i.e., *:*
. Therefore, as
expected, superusers are allowed to perform any valid action over any
existing (or future) resource.
Database Owner Default Permissions
When a user creates a resource, it is automatically granted delete
,
write
, read
, grant
, and revoke
permissions over the new
resource. If the new resource is a database, then the user is
additionally granted write
, read
, grant
, and revoke
permissions
over icv-constraints:theDatabase
and execute
permission over
admin:theDatabase
. These latter two permissions give the owner of the
database the ability to administer the ICV constraints for the database
and to administer the database itself.
Default Security Configuration
Warning
|
Out of the box, the Stardog security setup is minimal and
insecure: user:admin with password set to "admin" is a
superuser; user:anonymous with password "anonymous" has the "reader"
role; role:reader allows read of any resource.
|
Do not deploy Stardog in production or in hostile environments with the default security settings.
Setting Password Constraints
To setup the constraints used to validate passwords when adding new users,
configure the following settings in the stardog.properties
configuration file.
-
password.length.min
: Sets the password policy for the minimum length of user passwords, the value can’t be less than 1 or greater thanpassword.length.max
. Default:4
. -
password.length.max
: Sets the password policy for the maximum length of user passwords, the value can’t be greater than 1024 or less than 1. Default:20
. -
password.regex
: Sets the password policy of accepted chars in user passwords, via a Java regular expression. Default:[\w@#$%!&]+
Using a Password File
To avoid putting passwords into scripts or environment variables, you can put them into a suitably secured password file. If no credentials are passed explicitly in CLI invocations, or you do not ask Stardog to prompt you for credentials interactively, then it will look for credentials in a password file.
On a Unix system, Stardog will look for a file called .sdpass
in the
home directory of the user Stardog is running as; on a Windows system,
it will look for sdpass.conf
in Application Data\stardog
in the
home directory of the user Stardog is running as. If the file is not
found in these locations, Stardog will look in the location provided
by the stardog.passwd.file
system property.
Password File Format
The format of the password file is as follows:
-
any line that starts with a
#
is ignored -
each line contains a single password in the format:
hostname:port:database:username:password
. -
wildcards,
*
, are permitted for any field but the password field; colons and backslashes in fields are escaped with\
.
For example,
#this is my password file; there are no others like it and this one is mine anyway...
*:*:*:flannery:aNahthu8
*:*:summercamp:jemima:foh9Moaz
Of course you should secure this file carefully, making sure that only the user that Stardog runs as can read it.
Named Graph Security
Stardog’s security model is based on standard RBAC notions: users have permissions over resources during sessions; permissions can be grouped into roles; and roles can be assigned to users. Stardog defines a database resource type so that users and roles can be given read or write access to a database. With Named Graph Security added in Stardog 3.1, Stardog lets you specify which named graphs a user can read from or write to; that is, named graphs are now an explicit resource type in Stardog’s security model.
Example
To grant a user permissions to a named graph,
$ stardog-admin user grant -a read -o named-graph:myDB\http://example.org/g1 myUser
$ stardog-admin user grant -a write -o named-graph:myDB\http://example.org/g2 myUser
Note the use of "\" to separate the name of the database ("myDB") from the named graph identifier ("http://example.org/g1").
Important
|
Named Graph Security is disabled by default (for backwards
compatibility with the installed base). It can be enabled globally (or per
database) by setting security.named.graphs=true , in stardog.properties
globally, or per database.
|
Named Graph Operations
Stardog does not support the notion of an empty named graph; thus, there is no operation to create a named graph. Deleting a named graph is simply removing all the triples in that named graph; so it’s also not a special operation. For this reason, only read and write permissions can be used with named graphs and create and delete permissions cannot be used with named graphs.
How Named Graph Permissions Work
The set of named graphs to which a user has read or write access is the union of named graphs for which it has been given explicit access plus the named graphs for which the user’s roles have been given access.
Querying
An effect of named graph permissions is changing the RDF Dataset associated with a query. The default and named graphs specified for an RDF Dataset will be filtered to match the named graphs that a user has read access to.
Note
|
A read query never triggers a security exception due to named graph permissions. The graphs that a user cannot read from would be silently dropped from the RDF dataset for the query, which may cause the query to return no answers, despite there being matching triples in the database. |
The RDF dataset for SPARQL update queries will be modified similarly based on read permissions.
Note
|
The RDF dataset for an update query affects only the WHERE clause.
|
Writing
Write permissions are enforced by throwing a security exception whenever a named graph is being updated by a user that does not have write access to the graph. Adding a triple to an unauthorized named graph will raise an exception even if that triple already exists in the named graph. Similarly trying to remove a non-existent triple from an unauthorized graph raises an error.
Note
|
The unauthorized graph may not exist in the database; any graph that is not explicitly listed in a user’s permissions is unauthorized. |
Updates either succeed as a whole or fail. If an update request tries to modify both an authorized graph an unauthorized graph, it would fail without making any modifications.
Reasoning
Stardog allows a set of named graphs to be used as the schema for reasoning. The OWL axioms and rules defined in these graphs are extracted and used in the reasoning process. The schema graphs are specified in the database configuration and affect all users running reasoning queries.
Named graph permissions do not affect the schema axioms used in reasoning and every reasoning query will use the same schema axioms even though some users might not have been granted explicit read access to schema graphs. But non-schema axioms in those named graphs would not be visible to users without authorization.
Property-based Data Protection
Note
|
This feature is in Beta. |
In addition to the named graph security Stardog 7.3.2 introduces another way of restricting access to sensitive information:
by indicating that only particular users can read values of particular properties (in what follows, we call those properties "sensitive").
The canonical example of such a property would be :ssn
linking a person to their Social Security Number (or other private information)
so that only specific users, like the HR department, can have access to it. All that is required to do is to
add :ssn
to the list of IRIs set in the value of the security.properties.sensitive
database option and grant
the READ
permission on the sensitive-properties
resource to the right users. After that the query engine will
ensure that SSN values are masked for all other users when they run queries or make API calls which try to access those.
Semantics and Implementation
So what does "protecting access to property values" mean specifically? There is a single list of sensitive properties, P
,
and all users are split into two categories: those having the READ
permission on the sensitive-properties
resource
(for the right database) and those who do not. For the former all queries run as usual. For the latter queries return results
as if the database was pre-processed with the following SPARQL Update query:
INSERT { ?subject ?property ?masked }
DELETE { ?subject ?property ?object }
WHERE {
?subject ?property ?object .
FILTER (?property in { P }) # that is, if the predicate of the triple is sensitive
BIND(mask(?object) AS ?masked)
}
i.e. the graph data looks like if every triple with a sensitive property was replaced by another triple where the object node is masked (obfuscated) by applying a masking function (see the section below).
One important aspect of this definition is that it prevents graph traversals over nodes having both sensitive incoming edges but regular outgoing edges. Consider the following little graph:
:john :account :Acc1 .
:Acc1 :opened "2020-05-06"^^xsd:date .
and assume that :account
is a sensitive property. The update query above would transform it into:
:john :account "...e6ac047..." .
:Acc1 :opened "2020-05-06"^^xsd:date .
i.e. it would break the connection between :john
and the attributes of his account by masking the account node. Thus
the following query (if executed by a user not authorized to see :account
values) will not return the expected results:
SELECT ?name ?openDate {
?name :account/:opened ?openDate
}
That may seem counterintuitive at first but, in fact, it’s the only way to prevent attempts to "guess" values of sensitive nodes by a malicious user via queries like:
SELECT ?name ?ssn ?guessed {
?name :ssn ?ssn
VALUES (?ssn ?guessed) { ( "123-12-1111" "123-12-1111") ( "123-12-1112" "123-12-1112") ... }
}
if the system allowed the join over ?ssn
, it would be too late to mask ?ssn
values since they would be revealed by the ?guessed
variable if the attacker has guessed right. To keep things consistent, Stardog also does not allow traversals over
sensitive nodes in property path and path queries.
Implementation-wise Stardog does not make physical changes to the data to mask values of the sensitive properties. Instead it rewrites queries on-the-fly to apply the configured masking function (but only when the current user lacks the permission). That is, a simple query like
select ?s ?ssn { ?s :ssn ?ssn }
would be processed according to this query plan:
Projection(?s, ?ssn) [#1]
`─ Bind(SHA256(Str(id(?ssn))) AS ?ssn)
`─ Scan[POS](?s, <urn:ssn>, ?ssn)
The Bind
operator masks values of the ?ssn
variable by applying the default masking function (SHA256
).
This query rewriting mechanism supports queries executed with reasoning. In that case, masking is done after reasoning to ensure that inferred values of sensitive properties are protected too.
Configuration
This feature is disabled by default, i.e. the default value of security.properties.sensitive
is empty. If it is
non-empty, it should be a comma-separated list of IRIs specifying the protected properties:
stardog-admin metadata set -o security.properties.sensitive=urn:ssn -- myDB
Once the properties are set, the next step is to grant the READ
permission on the sensitive-properties
resource
to the users which are supposed to see the data, for example, using the following CLI command:
stardog-admin user grant -a read -o sensitive-properties:myDB myUser
Now myUser
can see values of urn:ssn
while other regular users would only see masked strings (SHA256
hashes by default).
Finally, it is possible to use a different masking function to apply to sensitive values.
It can configured using the security.masking.function
database property, e.g. security.masking.function=replace(str(?object),".+","XXXX")
.
The function can be either a constant or any SPARQL function with zero or one argument.
Current Limitations
As said above, this feature is in beta and should not be considered production ready at this time. The following limitations are expected to be addressed before the final release:
-
Values of sensitive nodes can be revealed through zero-length paths (e.g. queries like
?s :p? ?o
) and full-text search. Technically these are violations of the definition based on the update query. However, it should not be possible to see connections of these values to other nodes via sensitive properties. -
The query plan cache should be disabled when using this feature (by setting
query.plan.reuse=never
for the database). Otherwise it is possible that a query plan cached for a user with the permission to see sensitive values is reused for a user without the permission, thus bypassing the masking. -
Only a single list of sensitive properties is supported. It is not possible to restrict access to different properties for different users.
Enterprise Authentication
Stardog can use an LDAP server to authenticate enterprise
users. Stardog assumes the existence of two different groups to
identify regular and superusers, respectively. Groups must be
identified with the cn
attribute and be instances of the
groupOfNames
object class. Users must be specified using the
member
attribute.
For example,
dn: cn=stardogSuperUsers,ou=group,dc=example,dc=com
cn: stardogSuperUsers
objectclass: groupOfNames
member: uid=superuser,ou=people,dc=example,dc=com
dn: cn=stardogModelers,ou=group,dc=example,dc=com
cn: stardogModelers
objectclass: groupOfNames
member: uid=hank,ou=people,dc=example,dc=com
dn: cn=stardogReaders,ou=group,dc=example,dc=com
cn: stardogReaders
objectclass: groupOfNames
member: uid=beth,ou=contractors,dc=example,dc=com
Credentials and other user information are stored as usual:
dn: uid=superuser,ou=people,dc=example,dc=com
objectClass: inetOrgPerson
cn: superuser
sn: superuser
uid: superuser
userPassword: superpassword
dn: uid=hank,ou=people,dc=example,dc=com
objectClass: inetOrgPerson
cn: hank
sn: hank
uid: hank
userPassword: hankpassword
dn: uid=beth,ou=contractors,dc=example,dc=com
objectClass: inetOrgPerson
cn: beth
sn: beth
uid: beth
userPassword: bethpassword
Configuring LDAP
In order to enable LDAP authentication in Stardog, we need to include
the following properties in stardog.properties
:
-
security.realms
: with a value ofldap
-
ldap.provider.url
: The URL of the LDAP server -
ldap.security.principal
: An LDAP user allowed to retrieve group members from the LDAP server -
ldap.security.credentials
: The principal’s password -
ldap.user.dn.templates
*: A list of templates to form LDAP names from Stardog usernames -
ldap.superusers.group
*: The distinguised name of the group identifying Stardog super users -
ldap.role.mappings
*: A mapping from Stardog roles to LDAP groups -
ldap.member.attributes
: Optional list of LDAP attributes that list a group’s members, default tomember,uniquemember
. -
ldap.user.class.filter
*: Optional LDAP expression used to filter user classes. e.g.(&(|(objectClass=user)(objectClass=inetOrgPerson))(!(objectClass=device)))
-
ldap.consistency.scheduler.expression
: Optional cron expression. See Stale Permissions/Roles -
ldap.cache.invalidate.time
: Optional time duration to invalidate cache entries, default to24h
.
Properties tagged with a * are updatable properties.
Here’s an example properties file:
security.realms = ldap
ldap.provider.url = ldap://localhost:5860
ldap.security.principal = uid=admin,ou=people,dc=example,dc=com
ldap.security.credentials = secret
ldap.user.dn.templates = "uid={0},ou=people,dc=example,dc=com","uid={0},ou=contractors,dc=example,dc=com"
ldap.superusers.group = cn=stardogSuperUsers,ou=group,dc=example,dc=com
ldap.role.mappings = "modelers":"cn=stardogModelers,ou=group,dc=example,dc=com","queryers":["cn=stardogReaders,ou=group,dc=example,dc=com","cn=stardogModelers,ou=group,dc=example,dc=com"]
ldap.member.attributes = member
ldap.cache.invalidate.time = 1h
This properties file, when pointing to an LDAP server with the entries listed above, will
create 3 users: superuser
, hank
and beth
. superuser
will be a super user and will
not be a member of any Stardog role. beth
will be a member of the queryers
role and
hank
will be a member of both the modelers
and queryers
roles (note both modelers
and queryers
have a mapping to cn=stardogModelers,ou=group,dc=example,dc=com
).
Permissions can be granted to these users and roles as usual using the
user grant
and
role grant
admin commands.
User Management
Users can no longer be added/removed/modified via Stardog. User management is delegated
to the LDAP server. Roles are defined by mappings to LDAP groups as defined in the
ldap.role.mappings
property and membership in those roles is controlled by membership
in the mapped LDAP groups.
Permissions continue to be assigned to users and roles through Stardog, though it is recommended to manage permissions at the role level rather than at the individual user level. The benefit to doing so is that once role permissions have been set up, new users can be created and granted the appropriate permissions exclusively through LDAP (via their group memberships), without needing to interact with Stardog.
Authenticated User Cache
Stardog includes a time constrained cache with a configurable time for eviction,
default to 24 hours. To disable the cache, the eviction time must be set to
0ms
.
Authorization
Authorization is controlled by user and role permissions in Stardog, while role membership is controlled by membership in LDAP groups.
Stale Permissions/Roles
Permissions and roles in Stardog might refer to users that no long exist, i.e.,
those that were deleted from the LDAP server. This is safe as these users will
not be able to authenticate (see above). It is possible to configure Stardog to
periodically clean up the list of permissions and roles according to the latest
users in the LDAP server. In order to do this, we pass a
Quartz
cron expression using the ldap.consistency.scheduler.expression
property:
## Execute the consistency cleanup at 6pm every day
ldap.consistency.scheduler.expression = 0 0 18 * * ?
Managing Stardog Securely
Stardog resources can be managed securely by using the tools included in the admin CLI or by programming against Stardog APIs. In this section we describe the permissions required to manage various Stardog resources either by CLI or API.
Users
- Create a user
-
create
permission overuser:*
. Only superusers can create other superusers. - Delete a user
-
delete
permission over the user. - Enable/Disable a user
-
User must be a superuser.
- Change password of a user
-
User must be a superuser or user must be trying to change its own password.
- Check if a user is a superuser
-
read
permission over the user or user must be trying to get its own info. - Check if a user is enabled
-
read
permission over the user or user must be trying to get its own info. - List users
-
Superusers can see all users. Other users can see only users over which they have a permission.
Roles
- Create a role
-
create
permission overrole:*
. - Delete a role
-
delete
permission over the role. - Assign a role to a user
-
grant
permission over the role and user must have all the permissions associated to the role. - Unassign a role from a user
-
revoke
permission over the role and user must have all the permissions associated to the role. - List roles
-
Superusers can see all roles. Other users can see only roles they have been assigned or over which they have a permission.
Databases
- Create a database
-
create
permission overdb:*
. - Delete a database
-
delete
permission overdb:theDatabase
. - Add/Remove integrity constraints to a database
-
write
permission overicv-constraints:theDatabase
. - Verify a database is valid
-
read
permission overicv-constraints:theDatabase
. - Online/Offline a database
-
execute
permission overadmin:theDatabase
. - Migrate a database
-
execute
permission overadmin:theDatabase
. - Optimize a database
-
execute
permission overadmin:theDatabase
. - List databases
-
Superusers can see all databases. Regular users can see only databases over which they have a permission.
Permissions
- Grant a permission
-
grant
permission over the permission object and user must have the permission that it is trying to grant. - Revoke a permission from a user or role over an object resource
-
revoke
permission over the permission object and user must have the permission that it is trying to revoke. - List user permissions
-
User must be a superuser or user must be trying to get its own info.
- List role permissions
-
User must be a superuser or user must have been assigned the role.
Deploying Stardog Securely
To ensure that Stardog’s RBAC access control implementation will be effective, all non-administrator access to Stardog databases should occur over network (i.e., non-native) database connections.[29]
To ensure the confidentiality of user authentication credentials when using remote connections, the Stardog server should only accept connections that are encrypted with SSL.
Configuring Stardog to use SSL
Stardog HTTP server includes native support for SSL. To enable Stardog to
optionally support SSL connections, just pass --enable-ssl
to the server start
command. If you want to require the server to use SSL only, that is, to reject
any non-SSL connections, then use --require-ssl
.
When starting from the command line, Stardog will use the standard Java properties for specifying keystore information:
-
javax.net.ssl.keyStorePassword
(the password) -
javax.net.ssl.keyStore
(location of the keystore) -
javax.net.ssl.keyStoreType
(type of keystore, defaults to JKS)
These properties are checked first in stardog.properties
; then in JVM args
passed in from the command line, e.g. -Djavax.net.ssl.keyStorePassword=mypwd
. If
you’re creating a Server programmatically via ServerBuilder
, you can specify
values for these properties using the appropriate ServerOptions
when creating
the server. These values will override anything specified in
stardog.properties
or via normal JVM args.
Configuring Stardog Client to use SSL
Stardog HTTP client supports SSL when the https:
scheme is used in the
database connection string. For example, the following invocation of the Stardog
command line utility will initiate an SSL connection to a remote database:
$ stardog query https://stardog.example.org/sp2b_10k "ask { ?s ?p ?o }"
If the client is unable to authenticate to the server, then the connection will fail and an error message like the following will be generated.
Error during connect. Cause was SSLPeerUnverifiedException: peer not authenticated
The most common cause of this error is that the server presented a certificate that was not issued by an authority that the client trusts. The Stardog client uses standard Java security components to access a store of trusted certificates. By default, it trusts a list of certificates installed with the Java runtime environment, but it can be configured to use a custom trust store.[30]
The client can be directed to use a specific Java KeyStore file as a
trust store by setting the javax.net.ssl.trustStore
system property. To
address the authentication error above, that trust store should contain the
issuer of the server’s certificate. Standard Java tools can create such a file.
The following invocation of the keytool
utility creates a new trust store
named my-truststore.jks
and initializes it with the certificate in
my-trusted-server.crt
. The tool will prompt for a passphrase to associate with
the trust store. This is not used to encrypt its contents, but can be used to
ensure its integrity.[31]
$ keytool -importcert -keystore my-truststore.jks -alias stardog-server -file my-trusted-server.crt
The following Stardog command line invocation uses the newly created truststore.
$ STARDOG_SERVER_JAVA_ARGS="-Djavax.net.ssl.trustStore=my-truststore.jks"
$ stardog query https://stardog.example.org/sp2b_10k "ask { ?s ?p ?o }"
For custom Java applications that use the Stardog client, the system property can be set programmatically or when the JVM is initialized.
The most common deployment approach requiring a custom trust store is
when a self-signed certificate is presented by the Stardog server. For
connections to succeed, the Stardog client must trust the self-signed
certificate. To accomplish this with the examples given above, the
self-signed certificate should be in the my-trusted-server.crt
file in
the keytool invocation.
A client may also fail to authenticate to the server if the hostname in the Stardog database connection string does not match a name contained in the server certificate.[32]
This will cause an error message like the following
Error during connect. Cause was SSLException: hostname in certificate didn't match
The client does not support connecting when there’s a mismatch; therefore, the only workarounds are to replace the server’s certificate or modify the connection string to use an alias for the same server that matches the certificate.
Encryption at Rest
Warning
|
This feature is not ready to be used in production and is provided as a preview to gather feedback from users. |
Stardog 7.4.1 contains a preview version of encryption at rest. This feature uses AES 256 bit encryption for writing data to disk. The encryption support requires the user’s environment to include the openssl libcrypto library. This library is available and tested for both Linux and OSX operating environments. Windows is not currently supported.
Data Keys
When encryptions-at-rest is enabled customer data is written to disk
only after it is encrypted by a data key
. Data keys are
AES 256 bit keys that are generated and managed by Stardog upon
the users request. A new data key can be requested with the
stardog-admin
tool in the following way:
stardog-admin encryption new-key <key name>
If a data key did not already exist new-key
will create one. If one
did exist it will create a new one and mark it for use with
future data. All previously encrypted data will be left as is and
thus the old key will remain in the Stardog for use with the data
that it encrypted.
At any time data encryption can be disabled with the following command:
stardog-admin encryption disable
Note that this will only stop data that is added in the future from being encrypted. All of the data that was previously added will remain encrypted by the key that was active at the time of its insertion into Stardog.
Encryption Password
While data keys are a secure means of encrypting the data that is inserted into Stardog the keys themselves are not encrypted and are thus they are secrets fully know to and managed by Stardog. To overcome this Stardog encourages users to associate a password with their encryption keys. This can be done with the following command:
stardog-admin encryption set-password
When a password is set a new AES 256 bit key is created and encrypted with the given password. Stardog does not manage or retain this password. It is a secret only known to the user and thus should be managed appropriately. The new encrypted key is then used to encrypt all of the existing data keys. In this way the data keys are protected at rest without disrupting the data that they encrypted.
Once an encryption password is set it must be used to access every encryption command. Further it must be passed in every time that Stardog is started.
An encryption password can be changed and disabled at any time with the following commands:
stardog-admin encryption change-password stardog-admin encryption remove-password
Note: Stardog does not retain this password nor does it provide a means to reset it. If you use an encryption password you must remember it.
Example Session
Start encrypting data by adding a new key:
$ stardog-admin encryption new-key first-key $ stardog-admin encryption list-keys first-key : 34A6B1DD7C1EA00540C4C0B22A445D2A7C6CE46577BA670BEE942E1BEEB07031
At this point any incoming data will be encrypted with the key named
first-key
. However the key itself is not encrypted and thus we want
to add an encryption password with the following command:
$ stardog-admin encryption set-password Password: Password:
Now the data keys are all encrypted and thus to access them we will
require that same password. The -R
option tells the stardog-admin
command to ask for the password:
$ stardog-admin encryption list-keys -R Current Password: first-key : CE1991DC7F8AC3D25EFFE992CDEF91017E86175D408D39FEC0F7D3933DD4D4A1
Now we will add a new key:
$ stardog-admin encryption new-key next-key Invalid argument: The passphrase did not unlock the database
In this case we get an error because we failed to provide the now needed
password which we will do now with the -R
option:
$ stardog-admin encryption new-key next-key -R Current Password: $ stardog-admin encryption list-keys -R Current Password: next-key : 0713B222DF761A53DB015DA6D5CE973B2879941C737201C499CB7E41BAE65FD3 first-key : CE1991DC7F8AC3D25EFFE992CDEF91017E86175D408D39FEC0F7D3933DD4D4A1
If we wish to now disable data encryption for all future writes we can do so with the following command:
$ ./dist/bin/stardog-admin encryption disable -R Current Password: $ stardog-admin encryption list-keys -R Current Password: Disable write : next-key : 0713B222DF761A53DB015DA6D5CE973B2879941C737201C499CB7E41BAE65FD3 first-key : CE1991DC7F8AC3D25EFFE992CDEF91017E86175D408D39FEC0F7D3933DD4D4A1
Notice that Disable write
is listed as the current key. This means that
incoming data will not be encrypted.
Additional security notes
-
network communications: The stardog-admin program communicates passwords and passphrases to the server via HTTP headers. These network communications are vulnerable to copy and replay. The use of SSL communications via —require-ssl is strongly recommended.
How to install libcrypto
Apple OSX: brew install openssl
Debian / Ubuntu: sudo apt-get install openssl
Centos: sudo yum install openssl
Future encryption features
-
encryption of temporary files created during large imports: Stardog often creates "spill files" when ingesting data in bulk. The files are temporary in nature. During the time the files exist on disk an outsider could copy files to gain access to the unprotected data. This vulnerability will be addressed.
-
user configuration properties: the feature preview supports only one possible encryption library and only one of the library’s supported encryption algorithms. These parameters and others identified by users of the feature preview will become configurable properties in a future release.
Apache NiFi
Installation
To install NiFi and the Stardog connector:
-
Go to http://nifi.apache.org/download.html and download the latest binary release (nifi-1.12.0-bin.zip as of this writing).
-
Decompress the zip file to a local folder.
-
Download the Stardog NiFi nar files from http://downloads.stardog.com/extras/stardog-extras-7.4.5.zip into the
lib
folder in the NiFi installation folder.
Running NiFi
Start the NiFi server by running the command bin/nifi.sh start
in the NiFi installation
folder. It takes up to a minute for the NiFi server to start. Once the server is running
you can go to the URL http://localhost:8080/nifi in your browser, which will show the
empty workflow. You can drag the processor icon from the top left to the empty canvas
and add a Stardog processor:

Once the processor is added you can change the parameters to specify the Stardog server to connect to, credentials, etc. See the following example for more details:
Example NiFi Workflow
An example NiFi workflow is provided in the Stardog Examples GitHub repository. The workflow is for loading the Coivd19 dataset published by New York Times on GitHub into Stardog. It contains three processors:
-
NiFi built-in processor to retrieve the CSV file from GitHub.
-
StardogPut processor that ingests the CSV file into a staging graph in Stardog using the Stardog mappings available in the examples repository.
-
StardogQuery processor that copies the staging graph to the default graph and updates the last modification time.
Follow these steps to upload this workflow to your NiFi instance (see the screencast below and refer to Apache NiFi user interface for terminology):
-
From the Operate Palette, click the "Upload Template" button and select the
covid19-stardog.xml
file. -
Drag the "Template" icon from the Components Toolbar onto the canvas.
-
Unselect the processors by clicking an empty spot on the canvas and then select the
StardogPut
processor to configure the connection details and point to the correct location for the mappings filenyt-covid.sms
. -
Modify the connection details for the
StardogQuery
processor in a similar way.

The example is created to run every hour so if you have leave NiFi running the data will be fetched, transformed and uploaded into Stardog every hour.
As an alterative to supplying the Stardog url and credentials in every Stardog processor, you can configure the Stardog Connection Service just once and then reference that service in each Stardog processor.
Programming Stardog
Sample Code
There’s a Github repo of example Java code that you can fork and use as the starting point for your Stardog projects. Feel free to add new examples using pull requests in Github.
Java Programming
In the Network Programming section, we look at how to interact with Stardog over a network via HTTP. In this chapter we describe how to program Stardog from Java using SNARL Stardog Native API for the RDF Language, Sesame, and Jena. We prefer SNARL to Sesame to Jena and recommend them—all other things being equal—in that order.
If you’re a Spring developer, you might want to read Spring Programming or if you prefer a ORM-style approach, you might want to checkout Empire, an implementation of JPA for RDF that works with Stardog.
Examples
The best way to learn to program Stardog with Java is to study the examples:
We offer some commentary on the interesting parts of these examples below.
Creating & Administering Databases
AdminConnection
provides simple programmatic access to all administrative
functions available in Stardog.
Creating a Database
You can create an empty database with default configuration options in one line of code:
try (AdminConnection aAdminConnection = AdminConnectionConfiguration.toEmbeddedServer().credentials("admin", "admin").connect()) {
aAdminConnection.newDatabase("testConnectionAPI").create();
}
Warning
|
It’s crucially important to always clean up connections to the
database by calling AdminConnection#close(). Using `try-with-resources where
possible is a good practice.
|
The
newDatabase
function returns a
DatabaseBuilder
object which you can use to configure the options of the database you’d like to
create. The
create
function takes the list of files to bulk load into the database when you create it
and returns a valid
ConnectionConfiguration
which can be used to create new
Connections
to
your database.
try (AdminConnection aAdminConnection = AdminConnectionConfiguration.toEmbeddedServer().credentials("admin", "admin").connect()) {
aAdminConnection.newDatabase("waldoTest")
.set(SearchOptions.SEARCHABLE, true)
.create();
}
This illustrates how to create a temporary memory database named test
which supports full text search via [Searching].
try (AdminConnection dbms = AdminConnectionConfiguration.toEmbeddedServer().credentials("admin", "admin").connect()) {
aAdminConnection.newDatabase("icvWithGuard") // disk db named 'icvWithGuard'
.set(ICVOptions.ICV_ENABLED, true) // enable icv guard mode
.set(ICVOptions.ICV_REASONING_ENABLED, true) // specify that guard mode should use reasoning
.create(Paths.get("data/sp2b_10k.n3")); // create the db, bulk loading the file(s) to start
}
This illustrates how to create a persistent disk database with ICV guard mode
and reasoning enabled. For more information on what the available
options for set
are and what they mean, see the Database Admin section.
Also note, Stardog database administration can be performed from the CLI.
Creating a Connection String
As you can see, the
ConnectionConfiguration
in
com.complexible.stardog.api
package class is where the initial action takes place:
Connection aConn = ConnectionConfiguration
.to("exampleDB") // the name of the db to connect to
.credentials("admin", "admin") // credentials to use while connecting
.connect();
The
to
method takes a Database Name
as a string; and then
connect
connects to the database using all specified properties on the configuration.
This class and its constructor methods are used for all of Stardog’s Java
APIs: SNARL native Stardog API, Sesame, Jena, as well as HTTP. In the latter
cases, you must also call
server
and pass it a valid URL to the Stardog server using HTTP.
Without the call to server
, ConnectionConfiguration
will attempt
to connect to a local, embedded version of the Stardog server. The
Connection
still operates in the standard client-server mode, the only
difference is that the server is running in the same JVM as your
application.
Note
|
Whether using SNARL, Sesame, or Jena, most, if not all,
Stardog Java code will use ConnectionConfiguration to get a handle on
a Stardog database—whether embedded or remote—and, after getting that
handle, can use the appropriate API.
|
See the
ConnectionConfiguration
API docs or How to Make a Connection String for more information.
Managing Security
We discuss the security system in Stardog in Security. When logged into the Stardog DBMS you can access all security related features detailed in the security section using any of the core security interfaces for managing users, roles, and permissions.
Using SNARL
In examples 1 and 4 above, you can see how to use SNARL in Java to interact with Stardog. The SNARL API will give the best performance overall and is the native Stardog API. It uses some Sesame domain classes but is otherwise a clean-sheet API and implementation.
The SNARL API is fluent with the aim of making code written for Stardog easier to write and easier to maintain. Most objects are easily re-used to make basic tasks with SNARL as simple as possible. We are always interested in feedback on the API, so if you have suggestions or comments, please send them to the mailing list.
Let’s take a closer look at some of the interesting parts of SNARL.
Adding Data
aConn.begin();
aConn.add()
.io()
.file(Paths.get("data/test.ttl"));
Collection<Statement> aGraph = Collections.singleton(
Values.statement(Values.iri("urn:subj"),
Values.iri("urn:pred"),
Values.iri("urn:obj")));
Resource aContext = Values.iri("urn:test:context");
aConn.add().graph(aGraph, aContext);
aConn.commit();
You must always enclose changes to a database within a transaction begin and commit or rollback. Changes are local until the transaction is committed or until you try and perform a query operation to inspect the state of the database within the transaction.
By default, RDF added will go into the default context unless specified
otherwise. As shown, you can use
Adder directly
to add statements and graphs to the database; and if you want to add
data from a file or input stream, you use the
io
, format
,
and stream
chain of method invocations.
See the SNARL API Javadocs for all the gory details.
Removing Data
// first start a transaction
aConn.begin();
aConn.remove()
.io()
.file(Paths.get("data/remove_data.nt"));
// and commit the change
aConn.commit();
Let’s look at
removing data
via SNARL; in the example above, you can see that file or stream-based
removal is symmetric to file or stream-based addition, i.e., calling
remove
in an io
chain with a file or stream call. See the SNARL
API docs for more details about finer-grained deletes, etc.
Parameterized SPARQL Queries
// A SNARL connection provides parameterized queries which you can use to easily
// build and execute SPARQL queries against the database. First, let's create a
// simple query that will get all of the statements in the database.
SelectQuery aQuery = aConn.select("select * where { ?s ?p ?o }");
// But getting *all* the statements is kind of silly, so let's actually specify a limit, we only want 10 results.
aQuery.limit(10);
// We can go ahead and execute this query which will give us a result set. Once we have our result set, we can do
// something interesting with the results.
// NOTE: We use try-with-resources here to ensure that our results sets are always closed.
try(SelectQueryResult aResult = aQuery.execute()) {
System.out.println("The first ten results...");
QueryResultWriters.write(aResult, System.out, TextTableQueryResultWriter.FORMAT);
}
// Query objects are easily parameterized; so we can bind the "s" variable in the previous query with a specific value.
// Queries should be managed via the parameterized methods, rather than created by concatenating strings together,
// because that is not only more readable, it helps avoid SPARQL injection attacks.
IRI aIRI = Values.iri("http://localhost/publications/articles/Journal1/1940/Article1");
aQuery.parameter("s", aIRI);
// Now that we've bound 's' to a specific value, we're not going to pull down the entire database with our query
// so we can go head and remove the limit and get all the results.
aQuery.limit(SelectQuery.NO_LIMIT);
// We've made our modifications, so we can re-run the query to get a new result set and see the difference in the results.
try(SelectQueryResult aResult = aQuery.execute()) {
System.out.println("\nNow a particular slice...");
QueryResultWriters.write(aResult, System.out, TextTableQueryResultWriter.FORMAT);
}
SNARL also lets us parameterize SPARQL queries. We can make a Query
object by
passing a SPARQL query in the constructor. Simple. Obvious.
Next, let’s set a limit for the results: aQuery.limit10
; or if we
want no limit, aQuery.limitQuery.NO_LIMIT
. By default, there is no
limit imposed on the query object; we’ll use whatever is specified in
the query. But you can use limit to override any limit specified in the
query, however specifying NO_LIMIT will not remove a limit specified in
a query, it will only remove any limit override you’ve specified,
restoring the state to the default of using whatever is in the query.
We can execute that query with executeSelect
and iterate over the
results. We can also rebind the "?s" variable easily:
aQuery.parameter"s", aURI
, which will work for all instances of "?s"
in any BGP in the query, and you can specify null
to remove the
binding.
Query objects are re-usable, so you can create one from your original query string and alter bindings, limit, and offset in any way you see fit and re-execute the query to get the updated results.
We strongly recommend the use of SNARL’s parameterized queries over concatenating strings together in order to build your SPARQL query. This latter approach opens up the possibility for SPARQL injection attacks unless you are very careful in scrubbing your input.[33]
Getter Interface
aConn.get()
.subject(aURI)
.statements()
.forEach(System.out::println);
// `Getter` objects are parameterizable just like `Query`, so you can easily modify and re-use them to change
// what slice of the database you'll retrieve.
Getter aGetter = aConn.get();
// We created a new `Getter`, if we iterated over its results now, we'd iterate over the whole database; not ideal. So
// we will bind the predicate to `rdf:type` and now if we call any of the iteration methods on the `Getter` we'd only
// pull back statements whose predicate is `rdf:type`
aGetter.predicate(RDF.TYPE);
// We can also bind the subject and get a specific type statement, in this case, we'll get all the type triples
// for *this* individual. In our example, that'll be a single triple.
aGetter.subject(aURI);
System.out.println("\nJust a single statement now...");
aGetter.statements()
.forEach(System.out::println);
// `Getter` objects are stateful, so we can remove the filter on the predicate position by setting it back to null.
aGetter.predicate(null);
// Subject is still bound to the value of `aURI` so we can use the `graph` method of `Getter` to get a graph of all
// the triples where `aURI` is the subject, effectively performing a basic describe query.
Stream<Statement> aStatements = aGetter.statements();
System.out.println("\nFinally, the same results as earlier, but as a graph...");
RDFWriters.write(System.out, RDFFormats.TURTLE, aStatements.collect(Collectors.toList()));
SNARL also supports some sugar for the classic statement-level
getSPO
--scars, anyone?--interactions. We ask in the first line of
the snippet above for an iterator over the Stardog connection, based on
aURI
in the subject position. Then a while-loop, as one might
expect…You can also parameterize Getter`s by binding different positions of the
`Getter
which acts like a kind of RDF statement filter—and then iterating as
usual.
Note
|
the aIter.close which is important for Stardog databases to
avoid memory leaks. If you need to materialize the iterator as a graph, you can
do that by calling graph .
|
The snippet doesn’t show object
or context
parameters on a
Getter
, but those work, too, in the obvious way.
Reasoning
Stardog supports query-time reasoning using a
query rewriting technique. In short, when reasoning is requested, a query
is automatically rewritten to n queries, which are then executed. As
we discuss below in Connection Pooling, reasoning is enabled at the
Connection
layer and then any queries executed over that connection
are executed with reasoning enabled; you don’t need to do anything up
front when you create your database if you want to use reasoning.
ReasoningConnection aReasoningConn = ConnectionConfiguration
.to("reasoningExampleTest")
.credentials("admin", "admin")
.reasoning(true)
.connect()
.as(ReasoningConnection.class);
In this code example, you can see that it’s trivial to enable reasoning
for a Connection
: simply call reasoning
with true
passed in.
Search
Stardog’s search system can be used from Java. The fluent Java API
for searching in SNARL looks a lot like the other search interfaces: We create a
Searcher
instance with a fluent constructor: limit
sets a limit on the
results; query
contains the search query, and threshold
sets a minimum
threshold for the results.
// Let's create a Searcher that we can use to run some full text searches over the database.
// Here we will specify that we only want results over a score of `0.5`, and no more than `50` results
// for things that match the search term `mac`. Stardog's full text search is backed by [Lucene](http://lucene.apache.org)
// so you can use the full Lucene search syntax in your queries.
Searcher aSearch = aSearchConn.search()
.limit(50)
.query("mac")
.threshold(0.5);
// We can run the search and then iterate over the results
SearchResults aSearchResults = aSearch.search();
try (CloseableIterator<SearchResult> resultIt = aSearchResults.iterator()) {
System.out.println("\nAPI results: ");
while (resultIt.hasNext()) {
SearchResult aHit = resultIt.next();
System.out.println(aHit.getHit() + " with a score of: " + aHit.getScore());
}
}
// The `Searcher` can be re-used if we want to find the next set of results. We already found the
// first fifty, so lets grab the next page.
aSearch.offset(50);
aSearchResults = aSearch.search();
Then we call the search
method of our Searcher
instance and
iterate over the results i.e., SearchResults
. Last, we can use
offset
on an existing Searcher
to grab another page of results.
Stardog also supports performing searches over the full-text index within a
SPARQL query via the LARQ SPARQL
syntax. This provides a powerful mechanism for querying both your RDF index and
full-text index at the same time while also giving you a more performant option
to the SPARQL regex
filter.
User-defined Lucene Analyzer
Stardog’s Semantic Search capability uses Lucene’s
default
text analyzer, which may not be ideal for your data or application. You can
implement a custom analyzer that Stardog will use by implementing
org.apache.lucene.analysis.Analyzer
. That lets you customize Stardog to
support different natural languages, domain-specific stop
word lists, etc.
See Custom Analyzers in the stardog-examples Github repo for a complete description of the API, registry, sample code, etc.
User-defined Functions and Aggregates
Stardog may be extended via Function and Aggregate extensibility APIs, which are fully documented, including sample code, in the stardog-examples Github repo section about function extensibility.
In short you can extend Stardog’s SPARQL query evaluation with custom
functions and aggregates easily. Function extensibility corresponds to
built-in expressions used in FILTER
, BIND
and SELECT
expressions, as well as aggregate operators in a SPARQL query like
COUNT
or SAMPLE
.
SNARL Connection Views
SNARL
Connections
support obtaining a specified type of Connection
. This lets you extend and
enhance the features available to a Connection
while maintaining the standard,
simple Connection API. The Connection
as
method
takes as a parameter the interface, which must be a sub-type of a Connection
,
that you would like to use. as
will either return the Connection
as the view
you’ve specified, or it will throw an exception if the view could not be
obtained for some reason.
An example of obtaining an instance of a
SearchConnection
to use Stardog’s full-text search support would look like this:
SearchConnection aSearchConn = aConn.as(SearchConnection.class);
SNARL API Docs
Please see SNARL API docs for more information.
Using Sesame
Stardog supports the Sesame API; thus, for the most part, using Stardog and Sesame is not much different from using Sesame with other RDF databases. There are, however, at least two differences worth pointing out.
Wrapping connections with StardogRepository
// Create a Sesame Repository from a Stardog ConnectionConfiguration. The configuration will be used
// when creating new RepositoryConnections
Repository aRepo = new StardogRepository(ConnectionConfiguration
.to("testSesame")
.credentials("admin", "admin"));
// init the repo
aRepo.initialize();
// now you can use it like a normal Sesame Repository
RepositoryConnection aRepoConn = aRepo.getConnection();
// always best to turn off auto commit
aRepoConn.setAutoCommit(false);
As you can see from the code snippet, once you’ve created a
ConnectionConfiguration
with all the details for connecting to a
Stardog database, you can wrap that in a StardogRepository
which is a
Stardog-specific implementation of the Sesame Repository
interface. At
this point, you can use the resulting Repository
like any other Sesame
Repository
implementation. Each time you call
Repository.getConnection
, your original ConnectionConfiguration
will
be used to spawn a new connection to the database.
Autocommit
Stardog’s RepositoryConnection
implementation will, by default, disable
autoCommit
status. When enabled, every single statement added or
deleted via the Connection
will incur the cost of a transaction, which
is too heavyweight for most use cases. You can enable
autoCommit
and it will work as expected; but we recommend
leaving it disabled.
Using RDF4J
Stardog also supports RDF4J, the follow-up to Sesame. Its use is nearly identical to the Stardog Sesame API, mostly with package name updates.
Wrapping connections with StardogRepository
The RDF4J API uses com.complexible.stardog.rdf4j.StardogRepository
, which works
the same way as the Sesame StardogRepository
mentioned above. Its constructor will
take either a ConnectionConfiguration
like Sesame’s or a Connection String.
Autocommit
The major difference between the RDF4J and Sesame APIs is that the RDF4J one
will leave the autoCommit
mode ON by default, instead of disabling it. This is because
as of RDF4J’s 2.7.0 release, they have deprecated the setAutoCommit
method in favor
of assuming it to be always on unless begin()/commit()
are used, which we still VERY
highly recommend.
Using Jena
Stardog supports Jena via a Sesame-Jena bridge, so it’s got more overhead than Sesame or SNARL. YMMV. There are two points in the Jena example to emphasize.
Init in Jena
// obtain a Jena model for the specified stardog database connection. Just creating an in-memory
// database; this is roughly equivalent to ModelFactory.createDefaultModel.
Model aModel = SDJenaFactory.createModel(aConn);
The initialization in Jena is a bit different from either SNARL or
Sesame; you can get a Jena Model
instance by passing the Connection
instance returned by ConnectionConfiguration
to the Stardog factory,
SDJenaFactory
.
Add in Jena
// start a transaction before adding the data. This is not required,
// but it is faster to group the entire add into a single transaction rather
// than rely on the auto commit of the underlying stardog connection.
aModel.begin();
// read data into the model. note, this will add statement at a time.
// Bulk loading needs to be performed directly with the BulkUpdateHandler provided
// by the underlying graph, or by reading in files in RDF/XML format, which uses the
// bulk loader natively. Alternatively, you can load data into the Stardog
// database using SNARL, or via the command line client.
aModel.getReader("N3").read(aModel, new FileInputStream("data/sp2b_10k.n3"), "");
// done!
aModel.commit();
Jena also wants to add data to a Model
one statement at a time, which
can be less than ideal. To work around this restriction, we recommend
adding data to a Model
in a single Stardog transaction, which is
initiated with aModel.begin
. Then to read data into the model, we
recommend using RDF/XML, since that triggers the BulkUpdateHandler
in
Jena or grab a BulkUpdateHandler
directly from the underlying Jena
graph.
The other options include using the Stardog CLI client to bulk load a Stardog database or to use SNARL for loading and then switch to Jena for other operations, processing, query, etc.
Client-Server Stardog
Using Stardog from Java in either embedded or
client-server mode is very similar--the only visible difference
is the use of url
in a ConnectionConfiguration
: when it’s present,
we’re in client-server model; else, we’re in embedded mode.
That’s a good and a bad thing: it’s good because the code is symmetric and uniform. It’s bad because it can make reasoning about performance difficult, i.e., it’s not entirely clear in client-server mode which operations trigger or don’t trigger a round trip with the server and, thus, which may be more expensive than they are in embedded mode.
In client-server mode, everything triggers a round trip with these exceptions:
-
closing a connection outside a transaction
-
any parameterizations or other of a query or getter instance
-
any database state mutations in a transaction that don’t need to be immediately visible to the transaction; that is, changes are sent to the server only when they are required, on commit, or on any query or read operation that needs to have the accurate up-to-date state of the data within the transaction.
Stardog generally tries to be as lazy as possible; but in client-server mode, since state is maintained on the client, there are fewer chances to be lazy and more interactions with the server.
Connection Pooling
Stardog supports connection pools for SNARL Connection
objects for
efficiency and programmer sanity. Here’s how they work:
// We need a configuration object for our connections, this is all the information about
// the database that we want to connect to.
ConnectionConfiguration aConnConfig = ConnectionConfiguration
.to("testConnectionPool")
.credentials("admin", "admin");
// We want to create a pool over these objects. See the javadoc for ConnectionPoolConfig for
// more information on the options and information on the defaults.
ConnectionPoolConfig aConfig = ConnectionPoolConfig
.using(aConnConfig) // use my connection configuration to spawn new connections
.minPool(10) // the number of objects to start my pool with
.maxPool(1000) // the maximum number of objects that can be in the pool (leased or idle)
.expiration(1, TimeUnit.HOURS) // Connections can expire after being idle for 1 hr.
.blockAtCapacity(1, TimeUnit.MINUTES); // I want obtain to block for at most 1 min while trying to obtain a connection.
// now i can create my actual connection pool
ConnectionPool aPool = aConfig.create();
// if I want a connection object...
Connection aConn = aPool.obtain();
// now I can feel free to use the connection object as usual...
// and when I'm done with it, instead of closing the connection, I want to return it to the pool instead.
aPool.release(aConn);
// and when I'm done with the pool, shut it down!
aPool.shutdown();
Per standard practice, we first initialize security and grab a connection, in
this case to the testConnectionPool
database. Then we setup a
ConnectionPoolConfig
, using its fluent API, which establishes the parameters
of the pool:
using
|
Sets which ConnectionConfiguration we want to pool; this is what is used to actually create the connections. |
minPool , maxPool
|
Establishes min and max pooled objects; max pooled objects includes both leased and idled objects. |
expiration
|
Sets the idle life of objects; in this case, the pool reclaims objects idled for 1 hour. |
blockAtCapacity
|
Sets the max time in minutes that we’ll block waiting for an object when there aren’t any idle ones in the pool. |
Whew! Next we can create
the pool using this ConnectionPoolConfig
thing.
Finally, we call obtain
on the ConnectionPool
when we need a new
one. And when we’re done with it, we return it to the pool so it can be
re-used, by calling release
. When we’re done, we shutdown
the
pool.
Since reasoning in Stardog is enabled per Connection
, you
can create two pools: one with reasoning connections, one with
non-reasoning connections; and then use the one you need to have
reasoning per query; never pay for more than you need.
API Deprecation
Methods and classes in SNARL API that are marked with the
com.google.common.annotations.Beta
are subject to change or removal in
any release. We are using this annotation to denote new or experimental
features, the behavior or signature of which may change significantly
before it’s out of "beta".
We will otherwise attempt to keep the public APIs as stable as possible,
and methods will be marked with the standard @Deprecated
annotation
for a least one full revision cycle before their removal from the SNARL
API. See Compatibility Policies for more information about API stability.
Anything marked @VisibleForTesting
is just that, visible as a
consequence of test case requirements; don’t write any important code
that depends on functions with this annotation.
Using Maven
As of Stardog 3.0, we support Maven for both client and server
JARs. The following table summarizes the type of dependencies that you
will have to include in your project, depending on whether the project
is a Stardog client, or server, or both. Additionally, you can also
include the Jena or Sesame bindings if you would like to use them in
your project. The Stardog dependency list below follows the
Gradle convention and is of the form:
groupId:artifactId:VERSION
. Versions 3.0 and higher are supported.
Type |
Stardog Dependency |
Type |
client |
|
|
server |
|
|
rdf4j |
|
|
sesame |
|
|
jena |
|
|
gremlin |
|
|
You can see an example of their usage in our examples repository on Github.
Warning
|
If you’re using Maven as your build tool, then
client-http and server dependencies require that you specify the packaging
type as POM (pom ):
|
<dependency>
<groupId>com.complexible.stardog</groupId>
<artifactId>client-http</artifactId>
<version>$VERSION</version>
<type>pom</type> (1)
</dependency>
-
The dependency type must be set to
pom
.
Note: Though Gradle may still work without doing this, it is still best practice to specify the dependency type there as well:
compile "com.complexible.stardog:client-http:${VERSION}@pom"
Public Maven Repo
The public Maven repository for the current Stardog release is https://maven.stardog.com. To get started, you need to add the following endpoint to your preferred build system, e.g. in your Gradle build script:
repositories {
maven {
url "https://maven.stardog.com"
}
}
Similarly, if you’re using Maven you’ll need to add the following
to your Maven pom.xml
:
<repositories>
<repository>
<id>stardog-public</id>
<url>https://maven.stardog.com</url>
</repository>
</repositories>
Private Maven Repo
For access to nightly builds, priority bug fixes, priority feature access, hot fixes, etc. Enterprise Premium Support customers have access to their own private Maven repository that is linked to our internal development repository. We provide a private repository which you can either proxy from your preferred Maven repository manager—e.g. Artifactory or Nexus—or add the private endpoint to your build script.
Connecting to Your Private Maven Repo
Similar to our public Maven repo, we will provide you with a private URL and credentials to your private repo, which you will refer to in your Gradle build script like this:
repositories {
maven {
url $yourPrivateUrl
credentials {
username $yourUsername
password $yourPassword
}
}
}
Or if you’re using Maven, add the following to your pom.xml
:
<repositories>
<repository>
<id>stardog-private</id>
<url>$yourPrivateUrl</url>
</repository>
</repositories>
Then in your ~/.m2/settings.xml
add:
<settings>
<servers>
<server>
<id>stardog-private</id>
<username>$yourUsername</username>
<password>$yourPassword</password>
</server>
</servers>
</settings>
Network Programming
In the Java Programming section, we consider interacting with Stardog programmatically from a Java program. In this section we consider interacting with Stardog over HTTP. In some use cases or deployment scenarios, it may be necessary to interact with or control Stardog remotely over an IP-based network.
Stardog supports SPARQL 1.1 HTTP Protocol; the SPARQL 1.1 Graph Store HTTP Protocol; and the Stardog HTTP Protocol.
SPARQL Protocol
Stardog supports the standard SPARQL Protocol HTTP bindings, as well as additional functionality via HTTP. Stardog also supports SPARQL 1.1’s Service Description format. See the spec if you want details.
Stardog HTTP Protocol
The Stardog HTTP Protocol supports SPARQL Protocol 1.1 and additional resource representations and capabilities. The Stardog HTTP API v6 is also available on Apiary: http://docs.stardog.apiary.io/. The Stardog Linked Data API (aka "Annex") is also documented on Apiary: http://docs.annex.apiary.io/.
Generating URLs
If you are running the HTTP server at
http://localhost:12345/
To form the URI of a particular Stardog Database, the Database Short
Name is the first URL path segment appended to the deployment URI. For
example, for the Database called cytwombly
, deployed in the above
example HTTP server, the Database Network Name might be
http://localhost:12345/cytwombly
All the resources related to this database are identified by URL path segments relative to the Database Network Name; hence:
http://localhost:12345/cytwombly/size
In what follows, we use URI Template
notation to parameterize the actual request URLs, thus: /{db}/size
.
We also abuse notation to show the permissible HTTP request types and default
MIME types in the following way: REQ | REQ /resource/identifier → mime_type |
mime_type
. In a few cases, we use void
as short hand for the case where there
is a response code but the response body may be empty.
HTTP Headers: Content-Type & Accept
All HTTP requests that are mutative (add or remove) must include a valid
Content-Type
header set to the MIME type of the request body, where
"valid" is a valid MIME type for N-Triples, Trig, Trix, Turtle, NQuads,
JSON-LD, or RDF/XML:
RDF/XML |
|
Turtle |
|
N-Triples |
|
TriG |
|
TriX |
|
N-Quads |
|
JSON-LD |
|
SPARQL CONSTRUCT
queries must also include a Accept
header set to one of these RDF serialization types.
When issuing a SELECT
query the Accept
header should be set to one
of the valid MIME types for SELECT
results:
SPARQL XML Results Format |
|
SPARQL JSON Results Format |
|
SPARQL Boolean Results |
|
SPARQL Binary Results |
|
Response Codes
Stardog uses the following HTTP response codes:
|
Operation has succeeded. |
|
Operation was received successfully and will be processed shortly. |
|
Indicates parse errors or that the transaction identifier specified for an operation is invalid or does not correspond to a known transaction. |
|
Request is unauthorized. |
|
User attempting to perform the operation does not exist, their username or password is invalid, or they do not have the proper credentials to perform the action. |
|
A resource involved in the request—for example the database or transaction—does not exist. |
|
A conflict for some database operations; for example, creating a database that already exists. |
|
A unspecified failure in some internal operation…Call your office, Senator! |
There are also Stardog-specific error codes in the SD-Error-Code
header in the
response from the server. These can be used to further clarify the reason for
the failure on the server, especially in cases where it could be ambiguous. For
example, if you received a 404
from the server trying to commit a transaction
denoted by the path /myDb/transaction/commit/293845klf9f934
…it’s probably
not clear what is missing: it’s either the transaction or the database. In this
case, the value of the SD-Error-Code
header will clarify.
The enumeration of SD-Error-Code
values and their meanings are as follows:
0
|
Authentication error |
1
|
Authorization error |
2
|
Query evaluation error |
3
|
Query contained parse errors |
4
|
Query is unknown |
5
|
Transaction not found |
6
|
Database not found |
7
|
Database already exists |
8
|
Database name is invalid |
9
|
Resource (user, role, etc) already exists |
10
|
Invalid connection parameter(s) |
11
|
Invalid database state for the request |
12
|
Resource in use |
13
|
Resource not found |
14
|
Operation not supported by the server |
15
|
Password specified in the request was invalid |
In cases of error, the message body of the result will include any error information provided by the server to indicate the cause of the error.
Stardog Resources
To interact with Stardog over HTTP, use the following resource representations, HTTP response codes, and resource identifiers.
Query Evaluation
GET | POST /{db}/query
The SPARQL Protocol endpoint for read queries against the database. The valid Accept types are listed above in the HTTP Headers section.
To issue SPARQL queries with reasoning over HTTP, see Using Reasoning.
SPARQL update
GET | POST /{db}/update → text/boolean
The SPARQL Protocol endpoint for updating the database with SPARQL Update. The valid Accept types are
application/sparql-update
or application/x-www-form-urlencoded
. Response is the result of
the update operation as text, eg true
or false
.
Query Plan
GET | POST /{db}/explain → text/plain
Returns the explanation for the execution of a query, i.e., a query
plan. All the same arguments as for Query Evaluation are legal here; but
the only MIME type for the Query Plan resource is text/plain
.
Transaction Begin
POST /{db}/transaction/begin → text/plain
Returns a transaction identifier resource as text/plain
, which is
likely to be deprecated in a future release in favor of a hypertext
format. POST
to begin a transaction accepts neither body nor arguments.
Transaction Security Considerations
Warning
|
Stardog’s implementation of transactions with HTTP is vulnerable to man-in-the-middle attacks, which could be used to violate Stardog’s isolation guarantee (among other nasty side effects). |
Stardog’s transaction identifiers are 64-bit GUIDs and, thus, pretty hard to guess; but if you can grab a response in-flight, you can steal the transaction identifier if basic access auth or RFC 2069 digest auth is in use. You’ve been warned.
In a future release, Stardog will use RFC 2617 HTTP Digest Authentication, which is less vulnerable to various attacks and will never ask a client to use a different authentication type, which should lessen the likelihood of MitM attacks for properly restricted Stardog clients—that is, a Stardog client that treats any request by a proxy server or origin server (i.e., Stardog) to use basic access auth or RFC 2069 digest auth as a MitM attack. See RFC 2617 for more information.
Transaction Commit
POST /{db}/transaction/commit/{txId} → void | text/plain
Returns a representation of the committed transaction; 200
means the
commit was successful. Otherwise a 500
error indicates the commit
failed and the text returned in the result is the failure message.
As you might expect, failed commits exit cleanly, rolling back any changes that were made to the database.
Transaction Rollback
POST /{db}/transaction/rollback/{txId} → void | text/plain
Returns a representation of the transaction after it’s been rolled back.
200
means the rollback was successful, otherwise 500
indicates the
rollback failed and the text returned in the result is the failure
message.
Querying (Transactionally)
GET | POST /{db}/{txId}/query
Returns a representation of a query executed within the txId
transaction. Queries within transactions will be slower as extra
processing is required to make the changes visible to the query. Again,
the valid Accept types are listed above in the HTTP Headers
section.
GET | POST /{db}/{txId}/update → text/boolean
The SPARQL endpoint for updating the database with SPARQL Update. Update queries
are executed within the specified transaction txId
and are not atomic operations
as with the normal SPARQL update endpoint. The updates are executed when the
transaction is committed like any other change. The valid Accept types are
application/sparql-update
or application/x-www-form-urlencoded
. Response is the result of
the update operation as text, eg true
or false
.
Adding Data (Transactionally)
POST /{db}/{txId}/add → void | text/plain
Returns a representation of data added to the database of the specified
transaction. Accepts an optional parameter, graph-uri
, which specifies
the named graph the data should be added to. If a named graph is not
specified, the data is added to the default (i.e., unnamed) context. The
response codes are 200
for success and 500
for failure.
Deleting Data (Transactionally)
POST /{db}/{txId}/remove → void | text/plain
Returns a representation of data removed from the database within the
specified transaction. Also accepts graph-uri
with the analogous
meaning as above--Adding Data (Transactionally). Response codes are also the same.
Clear Database
POST /{db}/{txId}/clear → void | text/plain
Removes all data from the database within the context of the
transaction. 200
indicates success; 500
indicates an error. Also
takes an optional parameter, graph-uri
, which removes data from a
named graph. To clear only the default graph, pass DEFAULT
as the value of graph-uri
.
Export Database
GET /{db}/export → RDF
Exports the default graph in the database in Turtle format. Also
takes an optional parameter, graph-uri
, which selects a
named graph to export. The valid Accept types are the ones defined
above in HTTP Headers for RDF Formats.
Explanation of Inferences
POST /{db}/reasoning/explain → RDF
POST /{db}/reasoning/{txId}/explain → RDF
Returns the explanation of the axiom which is in the body of the POST
request. The request takes the axioms in any supported RDF format and
returns the explanation for why that axiom was inferred as Turtle.
Explanation of Inconsistency
GET | POST /{db}/reasoning/explain/inconsistency → RDF
If the database is logically inconsistent, this returns an explanation for the inconsistency.
Consistency
GET | POST /{db}/reasoning/consistency → text/boolean
Returns whether or not the database is consistent w.r.t to the TBox.
Listing Integrity Constraints
GET /{db}/icv → RDF
Returns the integrity constraints for the specified database serialized in any supported RDF format.
Adding Integrity Constraints
POST /{db}/icv/add
Accepts a set of valid integrity constraints serialized in any RDF format supported by Stardog and adds them to the database in an atomic action. 200 return code indicates the constraints were added successfully, 500 indicates that the constraints were not valid or unable to be added.
Removing Integrity Constraints
POST /{db}/icv/remove
Accepts a set of valid integrity constraints serialized in any RDF
format supported by Stardog and removes them from the database in a
single atomic action. 200
indicates the constraints were successfully
remove; 500
indicates an error.
Clearing Integrity Constraints
POST /{db}/icv/clear
Drops all integrity constraints for a database. 200
indicates all
constraints were successfully dropped; 500
indicates an error.
Validating Constraints
POST /{db}/icv/validate → text/boolean | application/json
Validates that the data in the database conforms to the integrity constraints.
The message body can optionally include the constraints to be validated. If not,
the constraints that have already been added to the database will be used for
validation. Accepts an optional parameter, graph-uri
, which specifies the
named graphs that will be validates. If a named graph is not specified, all the
graphs in the database will be validated.
Getting SHACL Validation Report
POST /{db}/icv/report → RDF
Returns a SHACL validation report
for the database. The message body can optionally include the constraint to be
validated. If not, the SHACL constraints that have already been added to the
database will be used for validation. Accepts an optional parameter, graph-uri
,
which specifies the named graphs that will be validated. If a named graph is not
specified, all the graphs in the database will be validated. The shapes
parameter
can be used to provide a list of shape IRIs to validate only a subset of the SHACL
constraints in the database. The nodes
parameter can be used to specify a subset of
RDF nodes in the database for validation. The countLimit
parameter, if provided,
will limit the number of validation results returned.