Data Quality with ICV

Michael Soren

Jan 10, 2018, 6 minute read

Getting data into the graph is only the first step. Stardog’s Integrity Constraint Validation (ICV) services provide a powerful way to improve data quality in the Knowledge Graph. We’ll learn how this works by looking at some examples from the Knowledge Graph project at NASA that is built on Stardog.

Data Quality in 4 Easy Steps

For a Knowledge Graph to be useful, it’s critical that the data be valid and consistent. Tools to validate and enforce data integrity help to improve its correctness and consistency. Constraints provide one method of implementing business rules. They prevent bad data from entering the system. Strict enforcement of constraints lowers error rates, resulting in time saved troubleshooting and tracing erroneous data.

Stardog provides ICV services detect, explain, and report invalid or inconsistent data and to check the integrity of the Knowledge Graph. At NASA we’re using ICV to help improve the data quality of a Knowledge Graph containing a number of different kinds of NASA objects sourced from a variety of systems and owners. As the graph has grown, we’ve incrementally built up a set of quality constraints. This allows us to measure the quality of the data, perform verification after an integration, and assist in planning future improvement measures. We previously wrote about the initial construction of [the NASA knowledge graph](https://www.stardog.com/blog/nasas-knowledge-graph/).

What follows is a 4-step process to get started using ICV.

Identify appropriate rules
Encode the rules and build the model
Validate constraints and analyze the report
Exploit the rules for data quality reporting

1. Identify Appropriate Rules

The first step is to identify appropriate constraints to validate the data. Our approach at NASA is meet with stakeholders and subject matter experts who understand the data intimately and talk about how the data objects relate to each other and what domain-specific rules they follow. To help us with the discussion, we have certain categories to consider.

Rules and Guidelines

Quality Control: Are they valid objects? (Rules)
- e.g. Non-null properties
Goodness: Are they good objects? (Best Practices) * e.g. Cardinalities aren’t exceeded
Structural: Do the relationships between objects follow expected patterns?
- e.g. Intersections in the graph, valid paths, etc

We also use this opportunity to identify various queries and metrics to collect, some of which may turn into rules, stored queries, or application integrations.

Queries

Gap analysis: “Show me all cases where there is no ‘measurement’ object tied to a ‘sensor’ object.”
Status: “Show me the status of all activities tied to this object.”
Other Data: “Show me everything related to a particular object.”
Collection: “Show me all the objects that occur only during a time frame.”

We collaborate with technical and business stakeholders and capture these rules in a spreadsheet, and then we use Stardog Virtual Graphs or otherwise transform the CSV into a graph. After all, Stardog ICV is just more data in the graph. It also gives us a good point to align rules for stakeholders’ lower-level constraints.

2. Encode Rules, Build the Model

Once the rules have been identified, they are encoded as ICV constraints. Since data integration across disparate systems may mean schemas are not necessarily aligned, we like to iterate on the constraint until all stakeholders have a baseline understanding. To do this quickly, we typically start with SPARQL or Stardog Rules to best match how the stakeholders describe the data relationship. ICV constraints can be expressed in OWL axioms or Stardog Rules.

For example, here is one of the business rules we identified: No system can be an orphan; each system should have a connection to at least one other object.

This needs to be translated into a constraint. We create a constraint based on a SPARQL query. This style of constraint is violated if there are any results from the SPARQL query. Here we check for any System that does not have a systemOf relationship to another node:

@prefix icv: <tag:stardog:api:icv:> .

# SPARQL Constraint
[] a icv:Constraint ;
    icv:query """
        SELECT * {
            ?x a :System .
            FILTER NOT EXISTS {
                ?x :systemOf ?system .
            } .
        }
    """ .

This can also be expressed using OWL axioms (under Stardog’s ICV closed world semantics):

:System rdfs:subClassOf
              [ a owl:Restriction ;
                owl:onProperty :systemOf ;
                owl:someValuesFrom :Thing
              ] .

This constraint requires the existence of at least one relationship, i.e., at least one :systemOf edge to another node. Should these relationships be to objects of a particular class? Should we have cardinality constraints? These are the sorts of questions we iterate on until we expand out the requirements to a robust set of checks. The data modeling–i.e., what kinds of classes (node types) and properties (edge types) are required?–plus the constraints together make up the model of the application. The model indirectly makes up some part of the domain. This model, including the constraints, is reusable both inside and outside NASA.

3. Validate Constraints, Analyze the Report

Once the model is developed, we validate some graph against the constraints using the icv validate command as shown below. Of course we can also do this over HTTP or via some Java APIs, too. If the constraints are violated, the output will note it:

$ stardog icv validate myDb constraints.ttl

Data is NOT valid.
The following constraints were violated:
SPARQLConstraint{

        SELECT * {
            ?x a :System .
            FILTER NOT EXISTS {
                ?x :systemOf ?system .
            } .
        }}

We then use the icv explain command to learn more, including which nodes in the graph violate a constraint:

$ stardog icv explain -r myDb constraints.ttl

VIOLATED SPARQLConstraint{

        SELECT * {
            ?x a :System .
            FILTER NOT EXISTS {
                ?x :systemOf ?system .
            } .
        }}
+---------------------------------------+
|                   x                   |
+---------------------------------------+
| https://nasa.gov/system/4             |
| https://nasa.gov/system/204           |
|                   .                   |
|                   .                   |
+---------------------------------------+

Violations are then examined to determine the root cause and fix the data. This lets us develop better procedures and constraints and it lets us develop and deploy them iteratively.

Stardog also has the ability to apply constraints as part of its transactional cycle and fail transactions that violate constraints. This is called “guard mode”, and it must be enabled explicitly in the database configuration options; there’s more information about that here.

When should you use guard mode versus ad hoc validation? If we’re building a Knowledge Graph over many systems, each of which owns its data entry interfaces, then we will run ICV asynchronously on our own auditing schedule. For applications built to store data directly in the graph, guard mode helps move complex business logic from application software to declarative data in the Knowledge Graph.

4. Exploit the Rules

Modeling constraints declaratively and running validation periodoically gives us a mechanism to collect data and exploit the results to measure and track data quality of the Knowledge Graph. As the size and complexity of the data set increase over time, ICV becomes a useful method to follow the correctness and consistency of the data and keep a watchful eye on the progress and maturity of the information. For production Knowledge Graphs, an audit process that runs the queries and creates reports provides actionable quality issues to manage as the system evolves.

During the development of a new system, we produce a valid and invalid data set that goes with each individual constraint and run the constraint validation during a build process. As models and schemas evolve, and the number of applications increases, having ICV during development provides a quick feedback cycle.

Using Stardog ICV, you’ll be rocketing your way towards data quality.

Read more about how Stardog combats data silos here.