Stardog is an Enterprise Knowledge Graph (EKG) platform, which unifies data based on meaning, without requiring copying data. While we created our graph-based virtualization solution, Virtual Graphs, to avoid requiring data copying, we also support graph-based storage for those times when materializing data is required. This may be due to law, regulation, or simply internal policy.
For Stardog admins, understanding that Stardog has a semantic graph database under the covers provides a helpful mindset for managing Stardog, which I detail below. But first, it’s critical to understand the function that semantic graph serves for Stardog. Stardog uses semantic graph as the key to connect data rather than transform it, enabling collaboration between internal and external stakeholders. Connecting data creates a reusable data foundation to power multiple use cases. We believe that semantic graph is the best way to scalably connect data in the modern enterprise.
However, when we first started building enterprise data management solutions, there wasn’t a suitable open source semantic graph database to use. So we had to build our own.
Our vision was never just to build another graph database; but we did build a really good one. It is customized for the purpose of EKG, including functionality like distributed high availability, ACID transactions, and MVCC semantics. (For background on some of these issues, you might read the blog post introducing Mastiff, the latest storage engine inside Stardog.)
Of course a semantic graph database is also a database, so we had the benefit of designing that system using the wisdom of databases that came before us. This was especially influential for low-level system design, including functions like the persistence layer and the query planner. Each of these systems has to be adapted to the unique requirements of an EKG platform, but each is rooted in sound database theory and computer science research.
Specifically, our template here is PostgreSQL, the leading open source enterprise database. PostgreSQL is an MVCC system and Stardog is, too, and that’s by design. The move to an MVCC architecture provides a fully lock-free transaction system. This has enormous impacts throughout the system, but the main benefits are pretty obvious: multi-writer concurrency is considerably improved and the chance of bugs is reduced, leading to more stability.
How to think about administering Stardog
The big takeaway that we want to leave you with from this blog post can be easily summarized:
The operations, procedures, and guarantees in Stardog with respect to transactional semantics, administration, and management are very similar to PostgreSQL. This means that typical best practices for database management apply to Stardog’s storage functionality.
One of the implications of an MVCC system like Stardog is that reads and writes never block each other and that absence of blocking happens without the system taking locks on contended resources. In fact, snapshot isolation is often paired with MVCC since they go together nicely; that’s Stardog’s default consistency model.
Another implication of MVCC, is that deletes are implemented in a soft or tombstone fashion, where some data in the indexes is marked as deleted but is only physically removed at some other point in time, asynchronously. (Stardog isn’t breaking new ground here since most of this scheme has been in PostgreSQL for years, as well as in many other systems.)
Finally, an implication of this approach to deletes is that Stardog, exactly like PostgreSQL, requires some periodic maintenance. PostgreSQL calls this process database vacuuming. Stardog performs explicitly vacuuming when a database is optimized. It also will run periodically as data is written to the database.
Tips for the working Stardog EKG Admin
If you wouldn’t force kill PostgreSQL, don’t force kill Stardog
Force killing databases isn’t generally the go-to move, especially in an MVCC system and even more especially while it’s vacuuming. It shouldn’t, and in most cases won’t, lead to on-disk corruption; but that doesn’t mean it’s a good idea.
Fine-tune query performance by trusting the query plan explainer
At the core of Stardog lies the query optimizer whose mission is precisely to achieve optimal performance on every input query. The optimizer may rewrite a query into a form which is more efficient to execute but gives all the same answers. Sometimes, however, it may require a little help from the user to find the most efficient query execution plan. Learn more in our blog: 7 Steps to Fast SPARQL Queries.
Use DevOps best practice around PostgreSQL vacuuming as a template
First, like any other MVCC system, Stardog requires periodic (automated) optimizing, i.e., cleaning up previously tombstoned deletes.
Second, database optimizing in Stardog happens on regular intervals; you can disable or change those intervals; you can even turn the process off completely. But then, as in PostgreSQL, you are responsible for managing the system on your own after that. Note: if you are maintaining manually, you don’t want to run optimizing in tight loops with unrealistic performance expectations. But, at the same time, you can’t never do it because it is key to maintaining great read performance.
Third, running database optimize during quiet periods is the best strategy toward achieving optimal performance without causing a lot of latency issues around contended resources. Until Stardog offers parallelizing database optimization, you will need to be careful about planning.
Fourth, other database system best practices apply here, including taking regular backups. Stardog’s features are special but it’s low-level operations are pretty ordinary and should remind you a lot of PostgreSQL.
We’re here to help
Hopefully, these guidelines prove helpful to managing your Stardog EKG. Be sure to familiarize yourself with the other features of Stardog’s platform and resources for implementing and managing those features. Our Customer Success team is standing by for any other questions or guidance you may need.