Making Cassandra Sing

Paul Jackson

Jan 16, 2019, 4 minute read

Stardog is proud to announce support for Apache Cassandra in Stardog 6.1. Now even more of your enterprise’s NoSQL data is accessible to the Knowledge Graph.

Integrating Apache Cassandra

Now that Stardog supports Cassandra, we’re at two NoSQL¹ and six Big Data² platforms, in addition, of course, to all of the real RDBMSes. This is starting to get serious!

While each new addition brings its own set of idiosyncrasies, adding support for Cassandra was especially interesting in several ways. I thought I’d share a few of the highlights.

Scalability First

Cassandra was built with scalability as a first principle and you can see evidence of this reflected in different elements of the design. For example, the choice of the primary key for a Cassandra table has an impact on how the data in the table will be distributed across the cluster. The choice of primary key also affects how efficiently a query will execute or even whether certain queries will execute at all.

With Cassandra, it is normal, even a best practice, to replicate data into multiple tables where each table is modeled with different keys. The design of each table is optimized for answering different queries. Cassandra is optimized to make writing data cheap. For those of us that are trained to normalize tables for efficient storage, working with Cassandra requires a paradigm shift.

A goal in Cassandra is to design your data model so that a query can be answered by a single node in the cluster. If you need to support a query that sums the sales for a given store, you create a table that is designed so that all the sales for a given store are on a single node and the sales are sorted such that those figures will be in a contiguous block.

Alternatively, you create a table where those totals are precomputed, either in real time as new sales are written or asynchronously as part of some batch update process.

Adapting Stardog

At first, adapting our query rewriter to Cassandra felt like an impedance mismatch. From the highest perspective, the Stardog philosophy is to let users declare the mapping from their source system to an RDF representation. Once that is done, you write queries. Then the query planner will rewrite those queries to the query language of the source system–these are called source access queries–and that source, being closest to the data, can use its own native optimizer to choose how best to execute that query.

But Cassandra doesn’t work like that. CQL is a declarative language, but the optimization it does is to check whether your query is optimal to begin with. Say that two times fast.

If the query filters on the partition key³ or if it asks for the data in an order that is different from the ordering in the table, Cassandra will raise an error rather than risk tying up resources processing an inefficient query. This is not bad, it’s just different from how declarative queries are traditionally processed.

So how to adapt? If the Cassandra way is to replicate your data and choose the right table for each query, then we will adapt to that. Stardog permits multiple data sources to provide the same data: this is a natural occurrence when unifying data across silos. We can extend this idea to Cassandra. We can create a mapping to each table and when we answer a query we can select the table that supports the query.

For our initial release we’ll do just that, but there’s more we will do in future releases, such as add hints to the mappings that indicate what data is redundant to further help Stardog select the optimal combination of tables to use for each query.

Conclusion

Do you have a Cassandra-based data silo that needs unification with the rest of your enterpise data assets? Why not download Stardog and see how it fits your use case? The possibilities are virtually endless once you can access all your data from a single platform. We’re looking forward to hearing from you.

MongoDB and Apache Cassandra. ↩︎
MongoDB, CosmosDB, Apache Hive, Impala, Teradata and Cassandra. ↩︎
A Cassandra primary key consists of two parts; a partition key that determines which nodes will hold the data and a cluster key that determines the order of the rows for a given partition key. ↩︎