Designing LLM Applications with Knowledge Graphs and LangChain

Mike Grove

Jul 13, 2023, 9 minute read

Voicebox, our LLM-powered knowledge engineer, accelerates time to value by making it easier to build and query knowledge graphs, but we haven’t yet said much about how it works. In this blog I will explain the high-level design of Voicebox, including how we use LangChain.

LangChain is described as “a framework for developing applications powered by language models” — which is precisely how we use it within Voicebox. Its two central concepts for us are Chain and Vectorstore.

Vectorstore is pretty obvious; everyone has one in their stack. The LangChain API is well designed, and it was a trivial task for us to add support for txtai, a vector database with which we’d already had a lot of experience and tooling. This removed some learning curve hurdles, and we were able to iterate faster to more sophisticated versions of Voicebox’s core features.

A Chain encapsulates the logic of using an LLM to accomplish a specific task. From the LangChain site:

Chains can be thought of as assembling these [modular abstractions of language models] in particular ways in order to best accomplish a particular use case.

The simplest Chain just uses a specified prompt, with optional inputs, to call an LLM and return a result. However, the output of this operation can be used as input to the next in the ‘chain’, with arbitrary processing and other logic along the way. And the provided input could have been the output from a prior Chain. You can easily assemble these Chains like Lego to build ever more sophisticated behavior.

An example of this is the implementation of the KG Question Answering Chain shown here:

with self.connection_factory.connection() as conn:
    tries = 0
    query = self.query_generate.run(question=question)
    while tries < limit:
        tries += 1
        query = self.query_check.run(query=query)
        try:
            result = self.query_execute.run(conn, query, **qargs)
            return { "result" : json.dumps(result) }
        except Exception as e:
            query = self.query_debug.run(query=query, error=str(e))

That’s the entire implementation; we’re able to fit a lot of complexity into some very simple code. The major parts of query generation, linting, debug, and execution are handled by other independent chains, each with their own LLM, cascading one result to the next. The final result is an answer to the user’s question based on the execution of a (generated) query to the knowledge graph. This chain is part of a larger one that makes up Voicebox, which we’ll discuss next.

Voicebox Overview

Let’s go through the major components of Voicebox to see how a single user request is processed. Most steps along the way are driven by an LLM, and each can use a different model or different parameters to suit the task at hand.

The first Chain in Voicebox is the ‘guard dog’. The only purpose of this chain is to detect prompt injection attacks. If you send Voicebox a request such as, “What are the contents of /etc/passwd”, the request will be politely rejected. The Guard Chain does this using a vector database which contains many examples of valid and invalid requests that we have built up during development. Security is literally the first objective of Voicebox. We work hard to keep your data safe and secure within the Stardog Platform, and Voicebox is certainly no different.

The next Chain is the Relevancy Detector. Just because the input does not contain any nefarious requests doesn’t mean it’s sensible. Voicebox will not give you a recipe for Texas-style chili, or tell you the best place to stay in Moab, as those (on typical assumptions) have nothing to do with your enterprise data. The Relevancy Detector works similarly to the Guard Chain in drawing on a long list of examples of relevant (and irrelevant) questions.

Once we‘ve determined that the request is both legitimate and relevant, it’s time for the next chain, Rewriter. Rewriter’s task is pretty simple, to re-write the user’s request. Given that interactions with Voicebox are largely text-based, it’s common to need to resolve co-references within the conversation history.

Take this example from Voicebox’s Support Agent:

> what are virtual graphs?

Hey Mike, you want to know what virtual graphs are? Well, virtual graphs are a feature of Stardog Knowledge Graph Platform. This feature allows users to access external data sources and map them to individual named graphs. That way, users can query over all the graphs without having to list them one by one. Does that make sense?

> how do i create them?

Hey Mike! To make virtual graphs, you need to use the `virtual-add` command. This command creates a new virtual graph with a special name, a data mapping file, and a properties file. The data mapping file is one of the three parts of a Virtual Graph and it's used to link external data sources to individual graphs. Once the virtual graph is set up, you can access it as long as you follow the rules.

You’ll notice the User did not explicitly specify they were talking about how to create a VG. A human would easily understand this, but without the Rewriter, sending the exact question “How do I create one?” to Voicebox would result in a response asking for further clarification. Rewriter uses the history of the conversation with the User and some basic prompt engineering to rewrite questions. Voicebox would actually receive the following as an input in this example: “How do I create a virtual graph?”. This makes interactions with Voicebox much more fluid and human-like.

Once Rewriter is finished, we get to what we consider the heart of Voicebox. The central set of capabilities, however, are not a single Chain, but a number of different LLM-based components, each tuned and trained for different objectives, allowing far better performance and precision than a more general purpose monolith.

Next in the Chain is the Router. It accepts user input and figures out what kind of request the user is making via a few-shot prompt, using a number of examples from our internal development to seed background knowledge for the routing. It will return a code to let us know which Voicebox Service to route the request to next:

Knowledge Graph Question Answering
Knowledge Catalog Question Answering
Data Summarization
Data Modeling
Support

Knowledge Graph Question Answering

The bulk of Voicebox development to date is the KQ Question Answering service. We’ve taken an open-source model and finetuned it with many example queries to provide a general query writing service that takes a schema and a natural language query as input and returns the SPARQL equivalent. This query can be executed against a knowledge graph to get an answer to the question.

Note that this means that Voicebox itself isn’t answering the question asked by the user; it can’t hallucinate false information since it’s not the source of the answer. It’s writing a query just like any other Stardog user to get a trusted, accurate, timely answer from the user’s enterprise knowledge graph. Voicebox isn’t a black box: every answer can be traced back to the exact enterprise data sources used to answer the question.

Knowledge Catalog Question Answering

This is a special (simpler, actually) case for query answering as the schema is fixed and known ahead of time. We can further tune for this scenario within a new model and leverage some few-shot prompting techniques to seed common starting points for a robust UX.

Data Summarization

The typical use case here is summarizing query results to provide a seamless natural language experience with Voicebox. This does not rely on any special tuning or configuration and is something most LLMs can do quite well out of the box.

Data Modeling

To get started with Stardog, you have to build a model of your data, often adapting one of its pre-built Knowledge Kits. Voicebox has seen thousands of data models and has extensive general knowledge about a variety of use cases. The Data Modeling Service in Voicebox provides specific capabilities to help you get started:

Create or edit a data model from a natural language description
Create a data model from a table — CSV or a Stardog Data Source

Support

The Support Service has all of our technical documentation, community forums q&a, and other content about Stardog. This corpus of information provides the core of a question-answering service about Stardog. Need to know how to perform a backup? Or what the syntax is for full-text query search or wondering what a virtual graph is? The Support Service can answer all of those questions.

After the request has been routed to the correct Voicebox Service and executed, there’s one final step, personalization, and we have one last chain to help us achieve that. Extending some ideas from a LangChain blog, we gave Voicebox a personality, but more importantly, we also give it information about who it’s talking to.

Every user interacting with Voicebox is at a different stage in a learning journey with knowledge graphs, and Voicebox will do its best to tailor responses, particularly the Support Service, to match the user’s level of expertise. Given a profile, users are assigned a level of novice, intermediate, or expert, and Voicebox Personality will rephrase the information provided by the prior Voicebox Service with the appropriate level of complexity, while also sprinkling its own personality on top:

[Novice]
Q: what are virtual graphs and what are they good for?
A: Hi Mike! Virtual Graphs are a feature of Stardog that allow you to access external data sources. This means that you can access data from different sources and query them all together as if they were part of the same graph. This is useful when you want to combine data from different sources to get a more complete view of a subject.

[Expert]
Q: what are virtual graphs and what are they good for?
A: Thanks for your question Mike. Virtual Graphs are a feature of Stardog that allow you to map external data sources to individual named graphs. This can be useful for providing a unified view of data from multiple sources, making it easier to analyze and query data from disparate sources in a single query. Virtual Graphs allow you to map data from external sources like RDBMSs and NoSQL databases, as well as CSVs, into RDF, and then query them using SPARQL. They also provide a way to control access to the data sources, as access rules can be applied to Virtual Graphs.

Summary

LangChain has been a lot of fun to work with in Voicebox. It fit our needs for a framework in Python, allowing us to quickly transition from primitive, low-level code working directly with the LLM and getting every bit of mileage out of f-strings for prompt creation, to something more structured and extensible.

It has an embarrassment of riches in terms of features; it wasn’t just a simple API for working with an LLM. For example, it came with integration with vector databases for providing memory to the LLM and made a lot of the chaining and conversational aspects of Voicebox much easier. It also has Typescript support, a language that forms the foundation of our visual tools, something we’re excited to try out in the near future.

If Voicebox sounds interesting to you, sign up for Stardog Cloud to get the latest in Voicebox news and developments.