Solving Four LLM Design Problems
Get the latest in your inbox
Get the latest in your inbox
LLM-based systems have four inherent obstacles that limit their utility in high-stakes use cases, especially in regulated industries—
The first three problems are mentioned frequently in the AI literature; the fourth isn’t.
The RAG design pattern addresses the first three to various degrees, but the Semantic Parsing (SP) design pattern addresses all four better. Let’s compare these two ways of dealing with the four LLM design problems.
LLM outputs are factually incorrect in particularly “inventive” ways perhaps as often as 60% of the time in some cases. In short, LLMs lie to people. Of course what matters more than LLMs lying is system design that exposes users to these lies. See AI Design Needs More Systems Thinking for more, or for a more systematic consideration of hallucinations, the LLM Hallucination FAQ is a good resource.
The basic intuition of RAG, which is sound as far as it goes, is to steer the response of LLM away from hallucination by prompt-stuffing, that is, augmenting a user’s query (”the prompt”) with additional, out-of-band, and, ideally, contextually relevant, information. In principle, this information can be derived from any source, but in practice most RAG apps use the embedding-vectors pattern: pre-indexing some documents, building vector embeddings from parts of these docs, and then prompt-stuffing after looking up some context in a vector database. Even if a RAG system is very au courant and is doing Graph RAG or using a knowledge graph, the relevant architectural posture is the same.
In that way, or so goes the theory, the LLM is steered toward truth and away from hallucination. It doesn’t really work very well.
The key inadequacy of RAG? Hope is not a strategy for protecting users!
Since RAG apps let the LLM speak last—which is an essential element of RAG, so the only way to fix it is not to do RAG!—it is in effect trusting that its prompt-stuffing will steer LLM away from hallucination. But when that doesn’t work, the user is exposed to hallucination.
RAG trusts LLM not to lie to users. LLMs lie anyway. Oops! 🆘
The basic intuition of SP is the idea that LLM should be used to determine human intent but not to answer factual questions about the world or the user’s interests by trusting an LLM. See Safety RAG for more. Rather SP assumes that the best answer to a user’s question comes from trusted data sources, not from an LLM.
An SP system solves the hallucination problem by not trusting LLM outputs and by not showing LLM outputs to the user directly or without (re)mediation.
Of course as with any design pattern, SP has a failure mode, that is, when SP goes wrong the effects are well understood. An SP system in failure mode is unable to answer a user’s question and responds, instead, with some version of “I don’t know”.
SP assumes that LLM will lie to users and doesn’t expose users to those lies. 🏁✅
🤨 We’ve all had two types of colleague who react in very different ways when posed a question—The first type who “hallucinates”, that is, makes shit up when faced with uncertainty or ignorance and then hopes for the best. The second type who says “I don’t know” when, in fact, they don’t know, but they will take next steps to turn “I don’t know” into a true answer. RAG is a type 1 colleague and SP is a colleague of type 2.
Many problems with thinking about LLMs come from assuming that LLMs are a kind of database. This is incredibly misleading. The resource intensity of database writes range from trivial to intensive; but nothing in database design or operations comes close to the resource intensiveness of LLM training, that is, to LLM writes. Further, LLMs are stateless functions that are never updated (directly); even in continuous fine-tuning architectures, the time between fine-tunings can be lengthy. Because LLMs are trained extensively for months and then frozen in time, they decay continuously from the very first moment of their release as the world moves on from their last input.
As a result of these two facts, LLMs contain stale data and lack critical new facts about the world.
RAG is pitched as a hallucination mitigation response often but it’s more credible as a response to the Staleness Problem. In the worse case RAG prompt-stuffs with data that is inimical to users’ interest but the common case is that RAG in fact provides updated information that LLM lacks and LLM outputs are better as a result.
The issue is that prompting is a contested channel of input since it’s equally available to the system and to system users; to the extent that user inputs may be adversarial, LLM must be immune to steering via prompt inputs. But then it’s exactly that immunity that degrades the channel for RAG inputs via prompt stuffing!
Hence, there is good evidence to suggest that LLMs resist RAG steering attempts. However, as RLHF and instruction-tuning techniques improve, the RAG approach to steering will be more effective in the common case.
It’s fair to assume, then, that RAG will increase in effectiveness in solving the Staleness Problem as LLMs become more steerable. Of course there’s a tension here since aligned LLMs are supposed to be immune to steering away from safe values.
An SP system provides answers to user questions by a three-step process:
Because SP systems don’t treat LLMs as sources of truth , generally, they circumvent the Staleness Problem by, typically, deriving answers to user questions from trusted, safe, current data sources.
SP is a general solution to the Staleness Problem.🏁✅
Imagine you’re using an LLM system at a large bank. One of the things that differentiates your bank from all the others is of course you and your colleagues, that is, the bank’s human capital. But another thing is the bank’s intellectual property, that is, its data.
Now there’s a problem because even if you’ve solved for Hallucination and Staleness Problems, an LLM, unless it’s explicitly trained or fine-tuned on your bank’s data, won’t know anything very useful about your bank’s data. It will know general and perhaps true facts about banking, but what you will often need is specific and true facts about your bank.
So LLM’s have a Generality Problem since, in the common case, it’s general data that’s in-distribution for them. This makes sense since, to a first approximation, LLMs are token generators for general purpose knowledge. Perhaps the ideal solution to Generality Problem is to train or at least fine tune a foundational model on the data of your organization. But most organizations aren’t going to do that.
RAG’s response to the Generality Problem is the same as its response to the others: to prompt-stuff with specific information from some pre-indexed data sources. In this case the idea is to pull specific information from enterprise documents to supplement LLM’s general knowledge with organization-specific knowledge.
See RAG’s Response above for the issues here which are the same. Generality and Staleness are specific instances of a more general problem, namely, LLMs are frozen and the world very much isn’t.
SP avoids the Generality Problem by deriving answers from trusted data sources rather than from LLMs.🏁✅
But in fairness SP arguably has the inverse of the Generality Problem, what we might call the No Background Problem.
I’m not sure a blog post that can’t finish itself without spawning new problems to discuss is that useful, but it’s the one I’ve written for you today, apparently.
So, the No Background Problem, which may be specific to the SP design pattern, is the inverse of the Generality Problem: since SP doesn’t trust LLM outputs for the user directly, that means that the single best source —a cutting-edge frontier model—of background knowledge isn’t easily available to an SP application beyond determining human intent.
This is unfortunate since a user’s question may require background knowledge. Since Stardog Voicebox is based on the SP design pattern, it’s important for us to address the No Background Problem in a way that preserves Voicebox’s commitment to 100% hallucination-free user interactions.
This problem is the least studied in the AI research community but deserves mention here. The problem is easy enough to see: database records are an important—perhaps, the most important—source of enterprise knowledge but LLMs don’t understand database records very well. In fact, there aren’t any foundational model types that optimally accept database records as input. Tabular Models are probably the closest, but tables in documents aren’t really database records. Tables in documents aren’t really database tables or collections of tables.
There is, then, a Database Record Problem owing to the general semantic mismatch between model types and an important source of enterprise knowledge.
RAG is ignorant of database-resident knowledge, that is, database records cannot be inputs to vector embeddings. Much like physics for a dog or ultraviolet light for a chimpanzee , a RAG system simply cannot apprehend database records as input. Setting aside advances in building foundational database models, for which I won’t be holding my breath, the only move to play is for RAG to query database records and then stuff something into the prompt. But stuff what? It’s entirely unclear. 🆘
SP is immune to this problem since querying current, specific, *database records—*of any type: relational, key-value, document, graph, time series—just is what SP does.🏁✅
LLMs hold out great promise for changing how many jobs-to-be-done are, in fact, done. But they also contain inherent obstacles that must be overcome in order to achieve their promise. Stardog Voicebox is an SP system and inherits all of its advantages and disadvantages. It’s important for consumers to be aware of the obstacles and the solutions provided by RAG and SP.
How to Overcome a Major Enterprise Liability and Unleash Massive Potential
Download for free