How Stardog Uses AI

Kendall Clark

May 23, 2023, 6 minute read

Where We Came From and Where We’re Going

In the early days of Stardog it was assumed we were an AI company because we were AI people. We three founders met and worked together at an AI lab at the University of Maryland. The question then was what were three AI people doing starting a data management company? We answered that Stardog was predicated on a long-term bet that data management and knowledge management were on a collision course, set out by Google, and that Knowledge Graph was the intersection point. From 2023’s perspective, it looks like our long-term bet was correct.

AI has many disciplines and branches; the two big camps of AI have historically been Symbolic AI and Statistical AI, sometimes pithily described as the Neats and the Scruffies, respectively. “Neats” because Symbolic AI is based on formal logic; “Scruffies” because Statistical AI is based on, well, statistics, which is far more indifferent to noise than logic generally.

The early history of AI was dominated by the Neats; but since the rise of the Web, the Scruffies have been very much in charge of AI progress, and that’s entirely because the rise of the Web meant the availability of massive data which meant learning and training strategies had the fuel necessary to make them work. The LLM revolution is one result of that process. But developments in Symbolic AI, including business rule engines, automated reasoning, and automated planning, continued apace as well and some of them are now quite mature and suffused throughout industry.

The other long-term bet that Stardog is predicated on is that real progress in AI will require a hybrid of Symbolic and Statistical AI. Not necessarily an algorithmic hybrid — i.e., neurosymbolic AI — but certainly real progress requires systems that implement both approaches. While Stardog has to date primarily used Symbolic AI and logistic regression, Stardog Voicebox is different; it fully embraces deep learning, that is, neural nets, which are foundational to Statistical AI. It hasn’t been a quick trip here but the AI that Stardog uses is the heir of the Scruffies and the Neats.

That milestone means it’s time for us to set out the principles that inform our product strategy and the role AI plays in that strategy.

Stardog’s AI Strategy: Hybrid, Applied, In-house, & User-focused

Hybrid. The requirements for automated and autonomous data management systems in the hybrid multicloud era include, at one end of the spectrum, crisp, provably correct, trusted answers to questions. And at the other end of the spectrum they include fuzzy, not-terribly-wrong answers to other questions and solutions to other tasks, as well as many intermediate points along the way.

For example, Stardog’s query-time inference engine is powered by decidable fragments of first-order logic with excellent performance properties, which makes, say, querying multiple intersecting hierarchies — of the sort you see in Product 360, financial risk management, and pharma supply chain use cases, to name only three examples — really easy and the queries immune to (many) changes. In Stardog 9 we released an alpha version of Stardog Stride, our next generation query-time inference engine. Stride supports Datalog with stratified negation and aggregation, which means supporting non-monotonic rules that go beyond the expressivity of what is typically available in Knowledge Graph systems. This new inference engine is informed by all the commercial use cases we’re focused on; they require reasoning over large amounts of data with rich, logical, user-defined rules.

Stardog supports logic-powered inference rules because that’s how business logic works. But Stardog Voicebox is powered by LLMs because real-world inputs from people by way of ordinary language are messy, including generating crisp inference rules iteratively using ordinary language and LLMs.

Our AI strategy is hybrid because we are vehement pragmatists. The future of AI is hybrid because human-level tasks are not pure, just like the world is not pure. The world is not one thing; the world is the multiplicity of things.

Applied, In-house. We are not in the game of creating foundational LLMs, NLP, or AI infrastructure because that would distract. Our job is to connect enterprise data so that decisions are data-powered and our customers still get to go home on time to play with their kids or walk their dogs. But just as our job isn’t to create foundational statistical AI systems, it’s also not our job to take those systems off the shelf and apply them as-is to our customers’ problems. Our job is to build a product that applies foundational AI techniques to the specific challenges our customers face around things like data modeling, data mapping, query generation, rule creation, etc.

Stardog Voicebox is powered by LLMs fine-tuned at Stardog. More bluntly, Stardog does not use OpenAI. Rather we’re using open source foundational models and applying (at least) two layers of fine-tuning to them: first, a pre-production fine-tuning stage which is ongoing; second, a fine-tuning stage that will be continuous based on customer data and user feedback (i.e. RLHF).

We’ve adopted this strategy for three tactical reasons:

OpenAI models are generic in nature. the right strategy with Voicebox is to align LLMs with specific jobs-to-be-done by users so that those jobs are completed faster, cheaper, and easier with Voicebox. Voicebox isn’t powered by one LLM, it’s powered by several and the number will grow as Voicebox expands.
Cost control. We won’t have it, and so we can’t pass it on to customers if we’re reliant on OpenAI (or any other) LLM platform directly. We can’t create value for our customers or for ourselves by paying retail.
Data privacy, security, and good governance. Our customers in financial services, manufacturing, and life science companies rely on us for world-class data security and privacy. We can’t ensure it by sending their data to any third party; it’s entirely unclear how any enterprise-ready Cloud platform, which is SOC2 compliant, can blat customer data around to god-knows-where.

User-focused. We take a strict jobs-to-be-done approach to product development, especially including our use of AI. We don’t need any particular input from customers to know that they’d like every query they send to Stardog to be executed faster in every new release. We just assume everyone wants Stardog to always get faster.

But when it comes to applied AI, if you lose sight of the user’s task, you’re sunk. Our jobs-to-be-done focus is on the following user tasks:

Question answering without any need to write queries or to use a BI tool.
Ordinary language approach to full lifecycle management (creation, maintenance, customization) of
- queries, including debugging, optimization, and repair,
- data models,
- data mappings,
- inference rules, and
- data constraints & data cleansing & quality rules.
Semi-supervised integration of structured, semi-structured, and unstructured enterprise data — this will be a multi-pronged extension and fusion of Voicebox, Virtual Graphs and BITES.
Higher-order operations leading to the world’s first Autonomous Knowledge Graph platform — more about this one in the next few months.

The Big Vision

Our mission is universal self-service analytics. Everyone working at any large enterprise should be able to answer any question, subject to data governance and access control, and get a trusted, timely, and accurate answer based on public and private data without having to learn a query language or a BI tool or to wait on IT to move or copy data into Stardog.

AI plays a key and growing role in our ability to deliver on this vision, and we’re excited to see what the hybrid future and a diversity of AI approaches brings to Stardog’s customers.