The Stardog Voicebox Vision

Kendall Clark

Jul 31, 2024, 13 minute read

What’s the big vision for Stardog Voicebox?

Anyone can ask any question of any data and get an accurate, timely, hallucination-free answer immediately.

Everyone who sees Voicebox has the same response—“How can we get this with our data?” There’s a growing awareness that exploratory analytics and data democratization go hand-in-glove and are key to not only digital transformation but also to turning large enterprises from data laggards into data leaders.

In this post I describe the Voicebox vision in full. We will follow-up soon with two companion pieces explaining how Voicebox creates value in two of our key verticals: financial services and defense/intelligence.

TLDR: The Vision in 7 Questions

Let’s think about the Voicebox vision schematically, step by step. It’s like a puzzle made of these seven parts, each of which answers a simple question—

Who can ask Voicebox questions? Anyone can, subject to access controls, of course; but especially knowledge workers in regulated industries with high-stakes use cases.
How do people ask Voicebox questions? With ordinary language that expresses how they think about the world.
What kinds of questions can Voicebox answer? Voicebox answers nine types of questions—
1. Factual, including needle-in-haystack and data paths questions that key to exploratory analytics;
2. general background knowledge via ChatGPT, Claude, or other foundational model integrations;
3. multi-hop questions with reasoning (i.e., the ability to source answers from multiple locations and fuse them dynamically into a coherent answer);
4. metadata control plane & data discovery since Voicebox works in Stardog’s Knowledge Catalog, too;
5. analytic questions about tabular data, especially tabular quantitative data;
6. geospatial questions;
7. predictive questions;
8. anomaly detection questions; and
9. root cause analysis questions, too.
Why should people trust Voicebox’s answers? Because Voicebox answers are derived from your enterprise data and are fully-traceable, explainable, integrated with data lineage, and 100% hallucination-free.
Where do people ask Voicebox questions? In Voicebox itself, but also in the places where real work happens, i.e., Microsoft Teams and Slack.
What data can Voicebox answer questions about? Voicebox is the first AI data assistant to include in its scope all enterprise data sources, that is, any database record or document, including metadata sources from data catalogs.
When can Voicebox answer questions? Right now! Because Voicebox doesn’t require data to be moved or copied anywhere to answer questions, Voicebox answers are soft real-time, always current and fresh, and never based on old, stale copies of data.

A multi-hop question to find exposure to California munis in Stardog Voicebox Wealth Adviser.

Obviously there’s a lot more to say in unpacking this schematic, so let’s get into it.

Anyone Means Everyone

The biggest solvable obstacle to data-driven decision making in the enterprise is that knowledge workers can’t get access to data that’s relevant to them. This is exactly why the universal response to Voicebox is to ask for Voicebox with their data. This isn’t primarily a security or governance problem. Data access is mostly a problem of data silos and data integration. Siloed data is the enterprise default condition in the enterprise, leading to data that is inaccessible to knowledge workers by default.

Year over year in McKinsey’s work on digital transformation, what distinguishes data leaders from data laggards is the percentage of knowledge workers who are enabled to self-serve analytically with respect to data.

🚨 Of course Voicebox isn’t the only game in town. But typically the other ways of interrogating data to achieve aha! moments require help or investment from IT or data science; or they require someone to write some really advanced queries, say; or they require you to already know exactly what you’re looking for. None of these restrictions is consistent with self-service enablement of the sort that McKinsey calls data leadership.

Source: Rewired and running ahead: Digital and AI leaders are leaving the rest behind

Data leaders have knowledge workers that can ask questions about the biz and then answer those questions without relying on IT, data science, etc. Data laggards also have knowledge workers who ask questions, but those questions only get answered weeks or months later by IT, data science, etc., if they ever get answered at all.

🗣 The Stardog Voicebox vision is to enable anyone to use data to make decisions, create insights, and drive business forward. And by “anyone” we really mean “everyone”, i.e., any worker who needs data to win at work. And that’s everyone in the AI era.

Stardog Voicebox Answers Questions Where Real Work Happens

McKinsey is right. The challenge is to make data easy to consume, and we do that in Stardog Voicebox primarily by making natural language the universal interface to enterprise data. But we also do it by pushing Voicebox capabilities into the digital places where real work happens like Slack and Microsoft Teams.

Stardog Voicebox answering questions in Slack

Can Ask Any Question

Anyone can ask any question? Yeah, that’s the vision in full. At least, any question covered by these fundamental types of analytics questions: factual, including needle-in-haystack and data paths; multi-hop reasoning; metadata control plane (”what data sources contain information about Euros-denominated trades for ESG futures on Borsa Italiana?”); tabular analytics (”what is the relationship between revenue and expenses in the last three FY?”); geospatial (“how many Chinese-registered trawlers are operating right now within 10 nautical miles of Roatan?”); background knowledge; predictive; anomaly detection; and root cause analysis.

Most of these are clear but the last three need a bit more explanation—

What’s going to happen next with account churn? Which product is going to ship late? (Predictive)
Which of these new accounts is unusual or out of compliance? Which shipments from Shenzen to NYC in April were suspicious? (Anomaly Detection)
Why did batch #03245 of sumatriptan spend 28 hours in quarantine at the Battlefield plant? (Root Cause Analysis)

Sometimes the answer you need is a report!

OpenAI ChatGPT Integration for Background Knowledge Questions

🗣 What is OpenAI really good for? Well for Voicebox it’s really best for answering a particular sort of user question—general or background knowledge questions—in context of interactions with Voicebox that answer very specific, enterprise-relevant questions that OpenAI will never understand. For example, I want to ask Voicebox how many customers we have in the capital of Nepal, but I can’t remember the name “Kathmandu”, so I ask OpenAI’s ChatGPT “what’s the capital of Nepal?” And then ask Voicebox “how many customers do we have in Kathmandu?”

Voicebox does tabular reasoning, tabular analytics, and even creates basic visualizations for data that it’s connected across disconnected enterprise data silos.

Voicebox’s Semantic Layer Means You Can (Probably) Skip Finetuning

Most orgs shouldn’t and won’t train a foundational model. This is generally well understood. But most orgs really shouldn’t fine tune a foundational model either, at least they don’t have to. Fine-tuning is a kind of model adapter pattern. But there are other ways to provide that capability. Voicebox’s ability to reach into any sort of data and answer a wide-range of questions means that it will act as dynamic fine-tuning layer for enterprise data.

Hint: this is the real enterprise semantic layer and you won’t get this anywhere else since no relational data-based semantic layer can deliver it.

Other hint: all the “semantic layer” offerings are based at best on the relational data model—including AtScale, dbt, Cube, etc. Some of these are merely a bag of loose metrics.

Good products? Yes. But are they gonna remove the need to finetune data for GenAI? No chance.

Voicebox operationalizes this semantic layer right out of the box—a unifying fabric that captures broad domain knowledge about your operations and business processes like customers, suppliers, materials, and complex relationships and hierarchies between them. Previously that required expensive application development or integration with some other tool or product.

Of Any Data

No, really. Other data assistants and chatbots say this, but they really mean “all your documents”. We mean all the data including every database record, document, and even metadata, too. That’s the power of backing an AI data assistant with an enterprise knowledge graph.

Structured, Semi-structured, & Unstructured…and Metadata, Too!

Nearly every GenAI app is focused on smart document handling only. All of these RAG apps are search engines in the sense that they do (better) what search engines do now.

Stardog Voicebox isn’t a search engine. It isn’t documents-only and we aren’t looking for doc chunks to cite to answer user questions like “how much vacation leave do I get after my maternity leave if I have 10 years of service?”

🗣 Stardog Voicebox is a question-answering machine. We extract facts from enterprise documents and connect them to facts within enterprise databases to create a complete picture of the enterprise data landscape. That’s the scope of enterprise data that Voicebox answers come from. All the data matters! We call this SafetyRAG since it’s also key to how we eliminate LLM hallucinations.

The vision in-full requires full-spectrum connected data and that’s entirely the point of a native enterprise knowledge graph like Stardog. In this view of the world, the AI document assistants like Glean, Writer, and Hebbia—all great tools that we admire, frankly—are perfectly complementary with Stardog Voicebox but they aren’t substitutes for Voicebox since none of them understands or connects to enterprise databases.

Our vision treats documents like unstructured knowledge-containers and treats database records like structured knowledge-containers and then connects and elevates all that knowledge for Voicebox to answer questions with. We are less interested in the perfect document summary or which paragraph of the employee manual indicates the dental copay. Those are important but they aren’t connected to or relevant for knowledge workers achieving aha! moments.

Accurate, Timely, Hallucination-free Answer

It won’t do anyone much good for Voicebox to just answer questions. What creates value, what leads to aha! moments, is when Voicebox provides accurate, timely, and hallucination-free answers to user’s most important questions.

The value of accuracy is obvious; but timeliness is no less important. One reason we’ve built Voicebox is that our customers need to move faster than IT can support. Velocity kills the competition! We don’t move or copy data so that answers are always fresh, current, that is, timely.

Hallucination-free is a new and differentiated requirement for a real AI data assistant like Voicebox. It’s another reason Voicebox can’t be replaced by a “chat with your documents” RAG app that hallucinates answers.

See Safety RAG: Improving AI Safety by Extending AI’s Data Reach for more on our hallucination-free alternative to all these LLM-and-RAG apps proliferating these days.

Traceable & Explainable, Too

Often it’s not enough to get the right answer but, instead, you need to be able to verify that the answer is right. Maybe the regulatory authority needs to know the basis of a material decision which may be correct but you have to “show your work” to assure them about the integrity of the business process that led to the decision.

Voicebox is a fast, accurate AI data assistant but it’s also got built-in traceability, lineage, and explainability capabilities, too, since sometimes it’s not enough just to be right, but you have to demonstrate that you’re right, too. Why? Because the regulator won’t just take your word for it: you have to show proof of compliance or actually engage with questions in a model review.

The virtuous cycle in regulated industries with Voicebox looks like this:

Ask a question → Get the answer → Browse the answer’s data lineage → Ask another question → Get another answer → virtuous cycle continues until aha! is achieved

That’s why every Voicebox answer comes with embedded data lineage in Stardog Explorer.

Voicebox can provide automatic lineage and traceability for any answer because it answers using enterprise data sources, not black box LLMs. And because Voicebox connects to enterprise data catalogs and governance platforms, it understands the enterprise data landscape and that feeds into its lineage and traceability capabilities.

Immediately

Immediate answers are better than “some time later”-answers. But how is Voicebox answering questions immediately? We do that in two related but distinct ways.

First, Stardog Voicebox is immediate compared to traditional MDM or other data-movement integration approaches. Stardog Voicebox implementation time is fast. Our customers get to value faster because Stardog Voicebox is an all-inclusive cloud offering that takes only a few weeks to implement.

🗣 For our heavily regulated customers who aren’t in the cloud, Stardog Karaoke is our platform offering inclusive of all hardware and software—CPUs, GPUs, platform, and Voicebox agents, APIs, etc.—in an on-premise appliance. Check it out!

Second, at question time, when a knowledge worker is poised for an aha! moment, Voicebox talks to a family of LLMs and SLMs via Voicebox agents to (1) determine human intent and (2) convert natural language to one or more queries. Which are then executed in Stardog Core against trusted enterprise data sources using our unique data federation capabilities that eliminate costly, slow data movement.

So Voicebox multi-turn conversation performance is a function of two subsystems, both of which are fully engineered by us: GenAI and Knowledge Graph interacting together to give answers to questions immediately.

Data leader orgs empower knowledge workers to self-serve analytically and that means not sending every new question to data science team to answer six weeks later.

Does Voicebox Only Answer Questions?

No. While question answering to achieve aha! moments is the primary value of Stardog Voicebox, it’s not the only value that Voicebox provides. We extend Voicebox’s LLM-powered natural language interface to data modeling, mapping, business rules, and data quality, too.

Why Data Modeling & Mapping Matter

All of the arguments for Voicebox for end-users—immediacy, speed to insight, self-service, democratization—apply to implementing Stardog, too. That includes using natural language as the interface to every Stardog job-to-be-done including data modeling and data mapping.

Business Rules & Data Quality, Too

Business rules and data quality constraints are both implementation and usage jobs-to-be-done; sometimes the right way to get a multi-hop reasoning question answered is to tell Voicebox about some business rule that is specific to your use case or enterprise. Voicebox includes the ability to add business rules and data quality constraints to Stardog on the fly.

Where Do We Go from Here?

Notice what I haven’t talked about in this blog post? ETL, query syntax, skills acquisition, special training, GPUs, LLM quantization, parameter size, etc. We live, love, eat and breathe all that tech stuff so our users and customers can focus only on what matters to them: having aha! moments and moving their business forward at breakneck speed.

Stardog Voicebox is hallucination-free and shovel ready GenAI that can transform data leaders into data overachievers and data laggards into data leaders. LET’S GO!

Vision → Roadmap

Nearly everything in the Stardog Voicebox vision is already shipping in production in Stardog Cloud, in Stardog Karaoke, and to on-prem deployments with financial services, life sciences, manufacturing, and defense customers globally. But some parts of the vision are on the near-term roadmap.

Shipping Q3 2024

Microsoft Teams and Slack plugins for Voicebox to run as an agent in those tools.
Voicebox for data modeling and data mapping.
Tabular Analytics (beta); i.e., Voicebox answers what’s happened with my business; performance, metrics.

Shipping Q4 2024

SafetyRAG and automated Knowledge Graph Construction including LLM-powered Named Entity, Relationship, and Event Extraction.
Stardog Cloud-hosted public API for customers to build their own agents, assistants, and data-intensive GenAI apps.
Predictive AI (beta); i.e., Voicebox answers “what‘s going to happen next?” questions.

Shipping Q1 2025

Voicebox support for business rules and data quality.
Anomaly Detection (beta); i.e., Voicebox answers “what’s do I really need to pay attention to?” questions.
Root Cause Analysis (beta); i.e., Voicebox answers “why did that thing happen?” questions.

Back to all posts

download our free e-guide

Knowledge Graphs 101

How to Overcome a Major Enterprise Liability and Unleash Massive Potential

Download for free