Our latest benchmark report, Trillion Edge Knowledge Graph, is the first demonstration of a massive knowledge graph that consists of materialized data and Virtual Graphs spanning hybrid multicloud data sources. In it, we prove it is possible to have a 1 trillion-edge knowledge graph—for comparison purposes, this is twice as large as Google’s knowledge graph, which has 500 billion triples—and deliver sub-second query times while using a distributed infrastructure that provides a 98% cost savings over traditional approaches that store all the data in one location. This has the potential to usher in a new era of data management, one where the knowledge graph is a core component of enterprise data infrastructure that drives profitability and competitive advantage.
In the typical modern enterprise, the data landscape consists of a combination of on-prem and cloud sources distributed across a host of vendors. While this delivers operational and commercial benefits to companies, it also increases the complexity of unifying data and deriving insights from data assets. Stardog’s Enterprise Knowledge Graph platform addresses this issue by connecting data based on what the data means not simply by colocation. We use Virtual Graphs to connect to data in place without ingesting all the data into a single source.
A common concern we hear about this approach is around latency of virtualized data. People worry about poor performance with virtualized sources. This benchmark demonstrates there is no performance degradation as a result of using virtualized sources, and Stardog’s knowledge graph can deliver sub-second query times in line with graph databases using fully materialized data. This is the case even at massive scale—and Stardog offers the further benefit of dramatically reducing operational costs by reducing the number of servers you need to use.
Below you will find a preview of some of the key findings in our benchmark. To read the full report, click here!
Scalable up to 1 trillion triples
The purpose of this demonstration was to create a single unified knowledge graph with 1 trillion edges against which we could run structured graph queries. In reality, the data was spread across multiple heterogeneous data sources reflecting the common situation in large enterprises. In order to match the distributed nature of the enterprise, we have created an environment where our 1 trillion-edge graph was distributed over three systems: Stardog, Amazon Redshift in AWS, and SQL Server in Azure (see graphic below). Stardog handles the execution of SPARQL graph queries by reaching out to Redshift and SQL Server as needed and hiding the complexity of data distribution from end users.
In order to demonstrate the scalability of the Stardog system we chose the Berlin SPARQL Benchmark, which is commonly used for measuring the performance of systems that support SPARQL query answering. The benchmark suite is built around an e-commerce use case, where a set of products is offered by different vendors and different consumers have posted reviews about products.
The BSBM dataset contains eight main classes (tables) and the dataset can be generated at different scales. The data is product-oriented with information on product, product features, vendors providing offers, review, etc.
Typically, all the data is loaded into a single storage system, but we have partitioned the data into three parts and loaded it into Stardog, Amazon Redshift in AWS, and SQL Server in Azure. Stardog supports connectors for over 100 different data stores. This benchmark environment corresponds to how an enterprise may have their data stored for a knowledge graph: part in a graph database and the rest in two common databases across two cloud providers.
The BSBM dataset comes both in an RDF graph version and a relational version. We have used the RDF representation for materializing the graph data in Stardog and used the relational for loading the data into Redshift and SQL Server. We have then defined Virtual Graph functionality in Stardog using the Stardog Mapping Syntax to map the relational data sources to RDF format. As a result, no data is moved from relational sources into Stardog, but Stardog can answer SPARQL queries submitted by users, converting all or parts of the query into SQL automatically for external data sources.
The size of the data distributed over the three data sources looks as follows:
|Data Source||Graph Type||Number of Nodes||Number of Edges||Amount of data|
|Stardog||Materialized||8.8 billion||115 billion||6.1TB|
|SQL Server||Virtualized||30 billion||220 billion||4.5TB|
|Redshift||Virtualized||57 billion||660 billion||2.8TB|
Performant on distributed, real-world data
The following table shows the average query execution times for each query. Before running the actual tests, we ran 500 iterations of random query mixes to warm up the system and caches. We have then executed 20 query mixes (with 25 random query instantiations in each mix) without caching to compute the average execution times for each query. All average query execution times for each query was under one second (except one). Average query execution times below one second show that performance at this scale is not a problem.
|Queries||Average Number of Results||Average Execution Time (sec)|
Our results are the first demonstration of a distributed knowledge graph implementation at the scale of 1 trillion edges. Typical graph database benchmarks are much smaller in scale. The largest benchmarks that have been reported for Ontotext GraphDb (17 billion edges), Neo4j (20 billion edges), and TigerGraph (67 billion triples) are an order of magnitude smaller.
Several RDF-based systems have published benchmarking results for graphs with 1 trillion triples, for example Cambridge Semantics, Oracle and Cray. There are some key differences between the results we achieved here and what has been published by other vendors.
In all the previous benchmarking results all the data was loaded into a single location. Copying all the data into a single database approach is practically indistinguishable from data warehousing. In contrast, our setup queries data where it was designed to be stored without the need to create new copies making data lineage and traceability straight-forward, as well as speeding time to insight for customers.
All of the benchmarks mentioned above use the Lehigh University Benchmark (LUBM). There are 14 fixed queries in the LUBM benchmark. The exact same queries are executed multiple times during the benchmarking making it much easier to cache results. In contrast no query is executed more than once in BSBM as each query mix is completely randomized. It’s a better simulation of an enterprise workload.
98% more cost effective
All of the previous benchmark results take advantage of very large number of servers to achieve the trillion-triple scale. For example, the Cambridge Semantics benchmark uses a cluster of 200 n1-highmem-32 type server instances in the Google Compute Platform. Each server has 208GB memory and comes with 32 vCPU’s, which correspond to 32 Intel hyper threads on 16 hardware cores. At the time of this writing, the n1 family of Compute Engines cost $0.031611/vCPU-hour. Therefore, the Cambridge Semantics cluster would cost $202/hour.
In contrast, we used a single Stardog server with 192GB memory to load the data and switched to a machine with 976GB for queries. The server we used for Stardog costs $6.60/hour whereas AnzoGraph cluster costs $378/hour. Even when the cost of Redshift and SQL Server are taken into account (which is an additional $2/hour), our distributed setup has an order of magnitude lower (98%) operational costs than the AnzoGraph setup, which of course under-estimates the ongoing operational cost differences between operating a single server and a 200-node cluster with respect to devops and other support personnel.
Read the full report
The above is just a preview of what’s contained in this benchmark report. Check out the full report to see all our findings: Trillion Edge Knowledge Graph.