Working with Messy Data: You May Have More Flexibility Than You Think

Tim Sedlak

Jun 15, 2022, 4 minute read

“Our data is a mess!” This is a common frustration expressed by Stardog prospects about their data. They aren’t alone. The amount of unstructured and structured data has exploded and isn’t letting up. Efforts to find better ways to store and manage data have also increased. These two factors are some of many that have led to significant data proliferation across enterprises. And the data is often not tidy.

Data quality matters greatly because digital transformation has increased the demand for highly curated datasets to support advanced analytics, machine learning, and other applications. This data is sourced from disparate, disorganized, and rapidly changing data sources and if the data isn’t complete or accurate, enterprises lose out.

Data Quality Solutions

Organizations are on the hunt for the magic bullet to cure their data quality woes. But dirty data can crop up in a variety of ways. The end-user may encounter invalid fields, duplicate data, missing values, additional values, consistency issues, etc. There are many solutions out there that promise to solve these problems.

Master Data Management (MDM) tools try to establish a single source of truth for data domains across an organization. Data Quality tools aim to identify, understand, and correct issues in the data. Both require significant investments to build and are often one step behind an evolving data management landscape inside the organization.

An enterprise knowledge graph offers a more flexible solution, supporting the dynamic delivery of semantically enriched data.

Enterprise knowledge graphs are vital in transforming data infrastructure into a data fabric. Stardog’s Enterprise Knowledge Graph platform uses a unique combination of virtualization, inference, and data quality validation to make it an excellent choice for dealing with dynamic, disparate, and even messy, data.

Let’s walk through each feature and discuss how it helps with ongoing data quality issues.

Virtualization and Data Integrity

Stardog’s virtualization capability provides a cost-effective alternative to traditional and expensive data integration techniques that require data to be replicated, moved, and stored multiple times. Copying data for every new project or use case leads to data drift and errors, ultimately losing the trust of the data by end-users.

Data virtualization guarantees that the data consumed by users is the most current and accurate. Errors discovered in the data can be directed to the source system owners and subject matter experts for correction. Once fixed, the correct data will immediately be reflected in the results seen by end-users because the data was never persisted in the knowledge graph.

Inferencing and Data Correctness

Stardog’s Inference Engine associates related information stored across disparate sources and applies business rules based on a semantic data model. This combination of relationships and rules is used to discover new connections along with additional insights. Since Stardog infers these connections at query time, the resulting insights are always up-to-date.

The summation of these implied relationships and connections between data sources creates a richer, more complete, and more accurate view of the data.

Constraint Validation and Data Consistency

Stardog’s Integrity Constraint Validation (“ICV”) is a feature to enforce data integrity and help improve the knowledge graph’s correctness and consistency. Stardog validates data stored in a Stardog database according to constraints described by users that make sense for their domain, application, and data.

Constraints can also find inconsistencies across disparate data sources, flag conflicting data, or prevent the knowledge graph from accessing bad data. Constraints support measuring the quality of the data, performing verification after an integration, and assisting in planning future improvement measures. Stardog also offers explanations for constraint violations that provide insights on what the invalid data is and why it is invalid.

Good News for Knowledge Graph Users

For a knowledge graph to be useful, the data must be valid and consistent. However, concerns about messy data should not impede one from moving forward with a knowledge graph project.

Stardog’s platform, which includes the above-mentioned capabilities, can enforce data integrity, improve data correctness, and ensure data consistency. With these safeguards in place, customers can feel confident that the end-users and consumers of their knowledge graph are receiving complete and accurate results that they can use to solve their business needs.

So what’s next? Try this on-demand webinar from Stardog’s VP of Solutions Consulting and Engineering, “How to Build a Knowledge Graph.”