Stardog is the world’s leading Enterprise Knowledge Graph platform. But what is a knowledge graph, and why should you want one?
The Problem with Data
Enterprise data is both the disease and its cure. Data will save us, and data will kill us all. At the same time. Enterprise data is the world’s most strategic asset going forward, while on the ground it’s painful: hybrid, varied, and ever-changing.
My favorite metaphor (to overcome) is the silo. Which is where farmers store grain—keeping it safe from blight, pests, and the weather—so that no one starves when winter comes. Real silos have high utility in farming.
And in the enterprise, data silos allow local control and governance in a way that is often valuable. Legal and regulatory considerations may require that some silos remain as silos.
But a silo is a disconnected thing that prevents larger structures from being composed easily. Silos impede everything: app dev, data science and analytics, reporting, compliance, and AI initiatives.
Data silos mean unconnected data—and unconnected data sucks.
Connectedness is the Solution
An Enterprise Knowledge Graph is the only realistic way to manage enterprise data in full generality, at scale, in a world where connectedness is everything. This shouldn’t really surprise us given that graphs are all about connections and connectedness.
The CIO of an enormous American bank told me recently that his organization annually spends 1/3rd of its IT budget on the Enterprise Data Silo Problem. That’s just unacceptable.
Among Stardog’s current customers there are many motivating examples where crucial decisions depend on connecting otherwise unconnected data:
- NASA engineers have saved countless hours assembling the answers they need from interconnected data to safely build rockets to send humans back to the Moon and to Mars.
- Dow Jones built new human-centered products, helping them maintain their leadership as the ultimate source for business news and data.
- Springer Nature launched SpringerMaterials to facilitate and enhance materials science research by providing subscribers on-demand and interconnected information.
- Schneider Electric created a building management platform to digitize power and controls in buildings for better sustainability, efficiency, comfort, and safety.
- A top international pharmaceutical accelerates their drug target identification and drug repurposing efforts with a central data analytics platform.
Each of these is radically different but, from our point of view, also identical. At the same time.
How is that possible? Two large-scale historical technology trends are crucial here: first, graph is the data model for the next 20+ years; second, virtualization of everything. The value proposition of the Enterprise Knowledge Graph lives at the intersection of these trends. Let’s talk about both of them.
First Trend: The Rise of Graph
Knowledge graphs are already in use at the FAANG companies—Facebook, Amazon, Apple, Netflix and Google—along with many other tech companies. Their knowledge graphs power recommendation engines, AIs, and search applications. Knowledge graphs help these companies succeed because they turn data into knowledge, creating powerful, user-friendly products and experiences that guide our day-to-day lives.
More powerful still is the application of the knowledge graph to enterprise data management, the Enterprise Knowledge Graph. The value proposition of an Enterprise Knowledge Graph is that all data, data sources, and databases of every type can be represented and connected.
In the context of enterprise data management, it doesn’t count if you only handle some of the data. And it doesn’t help if you can only do basic, low-level, or primitive things with all the data.
In order to create business value within the enterprise, you must be able to connect all the data that matters. Some of this data will be stored in tables, but also in PDFs, webpages, emails, and other semistructured and unstructured sources. Only semantic graph is able to represent data that is natively stored in other structures and connect all relevant metadata and context. So all true Enterprise Knowledge Graphs are backed by semantic graph.
With an Enterprise Knowledge Graph, different data dialects and structures embedded in legacy systems can be represented in the standard language of RDF. This allows for queries across relational databases, NoSQL databases, documents, and even geospatial data—seamlessly.
An EKG seamlessly connects and relates data from different structures, unifying context to turn data into knowledge.
Semantic graph operates in stark contrast to relational data. Finding connections between different relational databases requires time-intensive data modeling and query operations. Each new question produces a new dataset with its own schema. That’s not sustainable for the rate of new and unanticipated questions that the business wants to ask of its data. Today, data and analytics leaders need to be able to quickly support iterative question and answer cycles from the business and easily dig into new territory in their data.
Instead of rows and columns and tables and keys, semantic graph organizes information using nodes and edges to represent for entities and the relationships between those entities. This graph data model is fundamentally simpler than the relational model, yet it’s also far more expressive and powerful, easier to modify, and endlessly extensible.
While relational ruled for a time, the data model for the next 20+ years will be semantic graph. Increasingly, and for as far out as anyone can see, the world’s largest enterprises need a software platform that will solve the problem of unconnected data once and for all. That platform is an Enterprise Knowledge Graph.
Second trend: Virtualization
What do I mean by virtualization? Consider, by way of analogy, Storage Area Networks (SAN). A SAN is a lie, a useful fiction that we agree to tell ourselves. A SAN says there is one infinite-sized hard drive that everyone in the enterprise can read from and write to whenever they want without regard for the laws of physics. That’s a useful lie; which is to say, it’s an abstraction.
From one perspective, a SAN just an illusion. There’s no way to get around the laws of physics. And yet with the proper implementation, technical know-how, and capital investment, we can just act as if there is one big hard drive and…it just works.
Can we generalize this analogy? Yes. The right abstraction often has exactly this kind of “useful fiction” feel—SAN virtualizes storage; Cloud virtualizes compute; DCOS virtualizes infrastructure.
So what virtualizes data? What virtualizes the silos?
How would you build a thing that made it appear as if the data silos didn’t exist but the data — and databases, data sources, data-providing services, and so on —in them still did? Virtualization here doesn’t just mean virtual or federated query. It means the same kind of useful fiction about data as we have with compute, storage, etc. What it really means here in operational terms is connecting data at the compute layer rather than at the storage layer only. Virtualization means the appropriate abstraction for the task and in the context; using graph to connect and then query all the data, irrespective of where it lives at the storage layer.
That’s why we say, in the context of treating enterprise data, all of an enterprise’s data, as an actionable asset and as a whole, the only workable abstraction is the Enterprise Knowledge Graph.
Now, beyond all this, data virtualization has the added benefit of being a proven, cost-effective data integration technique, as it eliminates the expense of replicating, moving, and storing data multiple times. Virtualization connects source data directly, cutting down on what would be an otherwise complex and cumbersome ETL system, migrating data from dozens or even hundreds of systems and external vendors into a single repository. Copying data for each new analysis leads to human error and data drift. It leads to uncertainty about which data sources are trusted, current, or canonical. Data virtualization provides access to live source data and it means you’re guaranteed to always get the most up to date data every time you ask a question.
So, not only is graph-based virtualization the right abstraction, and not only does it connect data at the compute layer as opposed to the storage layer, it also improves enterprise data management workflows and data quality, critical when managing data at scale.
Graph Database vs Enterprise Knowledge Graph
Graph databases are awesome. We think they are so awesome that we built one, from scratch. But that wasn’t enough to make an Enterprise Knowledge Graph platform.
Graph database vendors, including Neo4j, which is the leading graph database, are busy turning relational silos into graph silos. We applaud that effort. Graph silos are often better than relational silos.
But at the end of the day, a silo is still a silo, whether it’s got tables, key-value pairs, or nodes and edges inside of it. Unconnected data sucks.
Plain Graph Databases
Plain graph data stores like TigerGraph or CosmosDB, which are all lovely systems with much to admire, technically, are really more like data structure servers than databases. They support graph as a data structure. They tightly couple traversal code and the graph itself, with no real abstraction between. Their primary operation is traversing that data structure, that is, a graph traversal API. Traversal is a universal, but low-level interaction pattern. Imagine giving someone directions across the country without using street or highway names or any cardinal directions.
But a data model is more than a data structure, and a database should support a data model independently of its implementation. Actual graph databases—for example, Neo4j, MarkLogic, or GraphDB—support evaluation of an actual query language, an interaction pattern that treats the graph as a data model rather than merely as a data structure.
In contrast, an Enterprise Knowledge Graph platform enriches and amplifies graph as a data structure and graph as a data model into something greater than the sum of its parts. And it does this by adding a Knowledge Toolkit to a Graph Database.
What distinguishes an Enterprise Knowledge Graph platform from a plain old graph database? The difference is using graph for data storage versus using graph for data management.
In an Enterprise Knowledge Graph, the trends of graph and data virtualization converge. The data model exists at the compute layer, not at the storage layer only, which means you can modify the schema at any time by adding new nodes and edges, you don’t have to struggle at a point in time to come up with a single shared data model covering all current and future enterprise data needs. That’s a fool’s errand.
It also means that the enterprise can have many different, even mutually incompatible schemas, that all apply discretely to the common pool of connected data. And that means you never have to force-fit emerging data sources and use cases to adhere to standardized rules from an already outdated perspective.
The result? The same data can be reused for new questions, without starting from scratch. Insight is derived at the speed the business requires it.
Norvig’s Other Law
Peter Norvig said, famously, that more data beats smarter algorithms. We think that’s true. And it’s the other way a knowledge graph beats a plain graph database. In other words, a knowledge graph about X knows everything about X that’s worth knowing.
An Enterprise Knowledge Graph platform supports traversals and queries of the graph data structure and data model, respectively, too. But it adds a layer of machine-understandability by supporting a richer semantics for the graph. This is all enhanced by inference: the use of logical reasoning to understand relationships and find implicit facts within the data. Your Enterprise Knowledge Graph knows the difference between graph as data structure—there’s an edge between node
A and node
B—and graph as something more—for example, a symmetric or reflexive or transitive property between a
Person and an
But the power of going beyond data structures or models to graph as knowledge representation is further enhanced by having access to all, or even most, of the relevant data. An Enterprise Knowledge Graph, unlike a plain graph, adds more to the data by turning it into knowledge, and it does that to and with all the data, which, if Norvig’s Other Law is correct, creates another layer of enterprise value.
Knowledge for Enterprise Data Management
Now, I said before that enterprise data is both the disease and its cure. So in addition to semantic graph, virtualization, and inference, there’s other necessary functionality to support enterprise data management at scale.
Real Enterprise Knowledge Graph platforms require integrated machine learning, data quality management tools, query explanation and model checking. Additionally, a suite of tools and Connectors to make it easy to connect, map, and model all the data that matters, regardless of its structure. By deeply integrating all these features with a graph database, the Enterprise Knowledge Graph platform supports a much wider and deeper range of services.
All of this is in service of addressing your organization’s unconnected data. But let’s be clear, the Enterprise Knowledge Graph doesn’t (necessarily) destroy silos. Some silos should be consolidated; some should be left in place. There is no universal answer to that question. It always depends on the silos and the situation. Solutions built on the wrong abstraction inevitably dictate a fixed answer to that question. That’s the wrong kind of disruption.
An Enterprise Knowledge Graph, built on the right abstraction, supports the widest possible means to manage data silos. And it can do that because it can query data silos, or even just parts of data silos, in place or pull their data via ETL in any arbitrary combination that best suits business needs.
An Enterprise Knowledge Graph makes it safe to proceed as if the silos don’t even exist and, thus, lets the enterprise act as if there’s just connected, actionable knowledge.
The Connected Enterprise
The Enterprise Knowledge Graph is the platform to power the connected enterprise. A connected enterprise is one where data, no matter where it is stored, is connected at the compute layer, so that all aspects of the enterprise can make decisions based on knowledge. The key is to build a reusable, resilient data foundation that can keep pace not only with well-understood, scoped projects, but also address unanticipated questions. Knowledge-based enterprises are proactive rather than reactive.
The modern world is complex. Inherently, enterprises are networks: loosely connected institutions distributed over space and time… But data management systems haven’t kept pace. They’re mostly still stuck in the old world where top-down, command-and-control systems consolidated data at the storage layer rather than connected it at the compute layer.
All the interesting, hard challenges that enterprises face today are horizontal rather than vertical in nature. Conventional enterprise data management is focused on lines-of-business, which are vertical in nature. This data is sales; this other data is marketing; this data is R&D. But the connected enterprise is arranged horizontally, across lines of business, because the problems to be solved don’t care about our org charts! In fact, partial data is more often meaningless or misleading rather than merely being “some of the truth”. To have some of the truth is in fact to have none of it.
The FAANG companies have this figured out already: to successfully leverage data for competitive advantage, you need a knowledge graph. The future of data management and the success of a connected enterprise depends on the Enterprise Knowledge Graph.