19 Questions about Hallucinations in Large Language Models

A FAQ by Stardog Union Labs

Last updated: 12 May 2024 (version 0.3)

19 Questions about Hallucinations in Large Language Models

A FAQ by Stardog Union Labs—Last updated: 12 May 2024 (version 0.3)

This is the LLM Hallucinations Frequently Asked (and Answered) Questions from the Stardog Voicebox team. As generative AI (GenAI) technologies become universal, hallucinations—the algorithmic creation of false but convincing data—threaten acceptance, especially in regulated industries such as banking, life sciences, and manufacturing. This FAQ discusses the nature of AI hallucinations, potential impacts, and mitigation techniques. We will update this document the SOTA evolves to deal with LLM hallucinations and related matters.

1. Why do hallucinations matter?

“Facts”, as the kids say these days. Let’s start with three facts:

  1. Every business—from global enterprise to SMB—wants to benefit from GenAI.
  2. Every RAG-and-LLM app is prone to hallucinations; in fact, every app that trusts LLM and shows its outputs to users directly is prone to misleading users by hallucination.
  3. A high-stakes use case, especially but not only in regulated industries, is one where culpable failure leads to bad outcomes: stock price crashes, big fines, reputational harms, jail time.

Hallucinations matter because there’s no general constituency for error, fantasy, or deception; hallucinations matter because AI matters.

The problem comes from the conflict between enterprise goals and the imperfection of startup markets and emerging tech. The C-suite wants GenAI dynamism and value prop; but the startup world, influenced by a tiny coterie of faddish investors, is over-rotated on RAG-and-LLM startups because they’re easy to launch, give good demo, and “everyone’s confident they can lower hallucination incidence”. That stalemate of sorts leads to some of the caution we see in high-stakes, regulated industry adoption of GenAI.

2. What do we mean by “hallucination”, “hallucination free”, “safe AI”, and “AI Safety?

It’s important to be clear and precise about the terms we use to describe parts of the world.

Hallucinations are unwanted LLM side-effects

A bug, a wrong answer, or a viewpoint you don’t agree with may be flaws of some system; but they’re not hallucinations, which we define in this way: A hallucination is a factual error or inaccuracy in the output of an LLM, often involving a non-existent entity, object, relationship, or event.

Just as importantly, a correctness or performance bug, or a regression, or a specification conformance issue is not a hallucination. That is, not every system fault or error is a hallucination. Sometimes a bug is just a bug!

AI systems are hallucination-free or not, safe or unsafe. Safety means hallucination-free.

It’s trivially easy to build a system that’s 100% free of hallucinations: don’t use LLMs in that system. To build a GenAI app that’s 100% free of hallucinations is harder since, more or less by construction, GenAI requires a language model and they’re all prone to hallucination; but still we can rescue the claim by saying that, no matter what happens internally in a GenAI app, what is critical is what its users see:

An AI system is safe if that system is free of hallucinations, that is, if its users never see hallucinations and nothing its users see depends logically on hallucinations.

3. How often do LLMs hallucinate? Surely this is very rare?

Well, actually, no, hallucinations are more frequent than anyone would like. There are no universally-applicable numbers and different LLMs hallucinate in different ways, but there are some resources and some data in the AI literature:

A table of hallucination rates for popular LLMs from a recent study

A table of hallucination rates for popular LLMs from a recent study

4. Are there usage patterns that make hallucinations more or less likely?

In fact, it’s become experimentally clear that, due to uncertainty from the combination of training data and prompt inputs, LLMs hallucinate more often when hallucinated output is more probable than the truth given the incomplete or outdated training data and prompt inputs.

Forcing LLM to generate output in an area where training data is weak increases the chances of generating hallucinations. Particular prompt strategies do increase the likelihood of hallucination, especially when that strategy forces the LLM to generate tokens where the training data is weak.

Also forcing LLMs to justify, rationalize, or explain previous outputs that contain hallucinations will increase hallucination frequency in subsequent outputs. The LLM appears to get “into a bad spot” and usually cannot recover; i.e., hallucinations cascade or “snowball”.

Finetuning increases hallucination frequency if its content is too distant from training material.

Studies have shown that various language use in prompts is associated with an increased hallucination frequency:

  1. Various levels of readability, formality, and concreteness of tone
  2. Long prompt inputs leads to “missing middle” information loss and increases hallucinations
  3. Irrelevant information in context prompts may easily distract LLMs, putting them into a “bad spot” from which they struggle to recover, which struggles often increase hallucinations
  4. Complicated chains of reasoning increase hallucinations

5. What are some common techniques to mitigate hallucinations?

Various prompt strategies and other techniques mitigate hallucinations; generally we can sort them into two buckets:

  • structural or interpretability based methods for detecting how LLMs act internally when they product hallucinatory output
  • external or correlation methods—aka “groundings”—for filtering or aligning hallucinatory output with ground truth, that is, known-good data sources

Some of the specific techniques that are worth considering include the following:

  1. RAG, or retrieval-augmented generation, tries to steer LLM outputs toward faithful answers by supplementing the input prompt with additional context, typically derived from unstructured data (i.e., documents) that may be relevant to the user’s query
  2. Grounding LLM outputs in external data sources; this is the general approach that Stardog Voicebox takes in Safety RAG.
  3. Improving LLM interpretability to detect, internally, when hallucinations occur. The internal states of local LLMs—another reason to prefer them, ceteris paribus—can be inspected without code or structural manipulation, but purely through contrastive prompt strategies (a paper that absolutely should have been called “LLM Androids Hallucinate Electric Sheep”).
    1. If we’re lucky there will be a single structural mechanism underlying all hallucinations.
    2. If we aren’t, there will be many mechanisms.
    3. We are rarely lucky.
  4. Use pause-tokens in prompts to decrease hallucinations
  5. A growing cluster of various “chain of knowledge“ prompt techniques mitigate hallucination frequency
  6. Chain of natural language to detect and re-write hallucinatory output post-generation, before users can be harmed.
  7. Manipulating the epistemic and evidentiary relationship between training data, frozen in the LLM, and the context, i.e., the prompt inputs. LLMs tend to favor frozen knowledge when it conflicts with context, increasing hallucination frequency. One of the mechanisms of manipulating background-versus-context is context-aware decoding

A good survey of mitigation techniques is here.

6. Why do LLMs hallucinate? Do different types of LLM hallucinate?

There’s an unsatisfying, entirely true explanation—LLMs hallucinate when a lie is more probable than the truth; for example, if there is no information about a question in the model, an LLM will make something up rather than saying, “I don’t know”.

That’s unsatisfying because it’s shallow and doesn’t give us any real leverage on either understanding what’s going on inside the LLM or controlling their behavior.

The satisfying answer that isn’t really an answer is that no one is entirely sure why LLM’s hallucinate; there’s no proof of why hallucinations happen. However, generally researchers believe that the probabilistic nature of LLM is responsible for hallucinations. LLMs generate one token (a token is more or less a word) at a time and they generate the most probable next token based on all the tokens it’s seen and that it’s already generated.

OpenAI has shown an interesting result that suggests hallucination is inherent to LLM:

…there is an inherent statistical lower-bound on the rate that pretrained language models hallucinate certain types of facts, having nothing to do with the transformer LM architecture or data quality. For “arbitrary” facts whose veracity cannot be determined from the training data, we show that hallucinations must occur at a certain rate for language models that satisfy a statistical calibration condition appropriate for generative language models. Specifically, if the maximum probability of any fact is bounded, we show that the probability of generating a hallucination is close to the fraction of facts that occur exactly once in the training data (a “Good-Turing” estimate), even assuming ideal training data without errors.

The issue they identify concerns facts that appear only once in a training set; of course, there’s no correlation between business, evidentiary or other value correlating to the frequency of a fact. That is, “singleton facts” in a training set may be just as valuable as “multiple facts”. A result from ETH confirms the same finding; namely, LLMs favor facts they’ve seen a lot over facts they’ve only seen once: “models default to favoring text with high marginal probability, i.e., high-frequency occurrences in the training set, when uncertain about a continuation.”

As of 2024, there are no LLMs that have been publicly disclosed or described that aren’t prone to hallucinations. State of the art research suggests hallucinations are inherent to the way that LLMs work.

What about video, audio, and image models? Or RNN, SSM alternatives to Transformers?

Yes. Every kind of modality and type of model hallucinates as of early 2024.

7. Why do GenAI software systems contain hallucinations?

Now that’s an interesting question and the answer is, well, socio-political or economic or some mix of economics and technology. First, it’s important to recognize that LLMs hallucinating inherently is perfectly consistent with some GenAI software systems not hallucinating.

As we said above, LLMs are intrinsically fluent liars, not to be trusted in high-stakes use. But GenAI systems—that is, applications that use LLMs internally—are a different matter. Most of them employ a design pattern called RAG; so much so that we’ve taken to calling the whole field RAG-with-LLM to emphasize that there are alternatives to RAG.

Which is to say that, while LLMs will continue to evolve, there may be something about them that intrinsically lies; but we can’t say that about GenAI systems. Nothing requires a GenAI system to show users raw LLM output that may contain lies. Systems that perform in that way reflect human choices and values and, as such, are a variant of the so-called ‘alignment problem’ that is far easier to solve.

8. But I hear that RAG is all you need and it eliminates hallucinations?

No. In fact, it’s not clear, based on a recent study—“How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs’ internal prior”—that RAG is particularly effective at all; as usual, it depends on data, queries, and workload.

From the study,

Multiple studies have shown that RAG can mislead LLMs in the presence of complex or misleading search results and that such models can still make mistakes even when given the correct response (Foulds et al., 2024; Shuster et al., 2021).

We can think of LLM as background knowledge and RAG as a means of blending foreground knowledge or context with a users’ query; that is, RAG is a mechanism to contextualize LLM output, not only bringing it up to date since LLM was frozen but steering LLM output toward fidelity.

But the study results show that

As expected, providing the correct retrieved information fixes most model mistakes (94% accuracy). However, when the reference document is perturbed with increasing levels of wrong values, the LLM is more likely to recite the incorrect, modified information when its internal prior is weaker but is more resistant when its prior is stronger. Similarly, we also find that the more the modified information deviates from the model’s prior, the less likely the model is to prefer it. These results highlight an underlying tension between a model’s prior knowledge and the information presented in reference documents.

That is, there’s a resistance on the part of the LLM to give up what it’s learned from training material in favor of contextualized knowledge presented in the prompt via RAG.

Two statistics from study suggest the tension between background and context is significant:

Specifically, a slope of -0.45, for instance, can be interpreted as expecting a 4.5% decrease in the likelihood of the LLM preferring the contextual information for every 10% increase in the probability of the model’s prior response.

And further

A similar pattern emerges in this analysis: as the RAG value diverges from the model’s prior, the model is less likely to adopt the RAG value over its own initial response.

LLM sounds like a teenager, if you ask me!

Of course given that RAG is effectively a form of dynamic in-context learning—that is, supplementing the user’s query with additional information that’s sensitive to the query—we have to ask ourselves what is LLM’s propensity to modulate background knowledge with any in-context prompt material when they conflict. In other words, structurally, RAG isn’t doing anything differently than any in-context prompting.

hallu1-ss.png

The TLDR of this study is worth quoting—

While RAG is becoming standard practice in commercially available LLMs, the reliability of such systems is still understudied. Our experiments uncover several mechanisms that modulate the degree to which LLMs adhere to RAG systems. Specifically, we quantify a tug-of-war between the strength of the model’s prior and the rate at which the model adheres to the RAG document’s facts. ***This effect is at odds with claims that RAG itself can fix hallucinations alone, and occurs even when the model is prompted to adhere to RAG documents strictly…***For example, if RAG systems are used to extract nested financial data to be used in an algorithm, what will happen if there is a typo in the financial documents? Will the model notice the error and if so, what data will it provide in its place? Given that LLMs are soon to be widely deployed in many domains including medicine and law, users and developers alike should be cognizant of their unintended effects, especially if users have preconceptions that RAG-enabled systems are, by nature, always truthful.

9. What are some alternatives to RAG?

The most important one is Semantic Parsing, which can be distinguished from RAG thusly—

  1. Semantic Parsing trusts LLM to algorithmically determine human intent, that is, answer the question “what does this person want to know?” and then it queries trusted data sources to satisfy that intent. Semantic Parsing uses LLM to convert human inputs into valid sentences of regular languages—query, constraint, rules, etc—and executes those in the normal way to enlighten users.
  2. RAG trusts LLM to tell people true facts about the world, notwithstanding LLM’s tendency to hallucinate. RAG trusts LLM to provide answers to user’s questions, that is, RAG-based systems, in effect if not intent, trust fluent liars to tell the truth to users. What RAG adds to the bare use of LLMs directly is dynamic contextualization via vector embeddings of ideally relevant document chunks; but as we’ve seen in this FAQ, LLMs tend to prefer background knowledge to contextualized knowledge when they conflict.

Of course other architectures are likely to emerge and there are arbitrarily many novel combinations of RAG and SP.

10. Why don’t we just develop a hallucination detector?

Yes, that is a thing that we can do once we improve LLM interpretability, which is no small order, but progress has been pretty steady. Reliable hallucination detection appears to be more likely to succeed than convincing LLMs not to hallucinate at all. This analysis from a recent study is aligned with Stardog Voicebox’ approach in Safety RAG, where we explicitly designed a safe approach in which users either (1) rely on correct information or (2) are told, in an act of humility of system design, “I don’t know” and thereby left unenlightened and unharmed.

11. Doesn’t prompting solve hallucination?

Surely LLMs hallucinate because their training ends and they’re released into the world and don’t know what’s happened subsequently. Isn’t this where prompting and RAG come in to provide additional context as a supplement for training content?

Well, yes and no. RAG and prompting techniques are intended to add context to LLM. But LLM often ignore the prompt content in favor of training content when they conflict.

Context-aware decoding is a way to force LLMs to pay more attention to context (i.e., prompt content) to avoid hallucinating about post-training facts.

12. What kinds of hallucinations happen often?

The most common types of hallucination (i.e., “a dream-like sight of things that aren’t part of reality”) that we’ve seen include—

General factual inaccuracies

  • “The Eiffel Tower was made in Dallas, but now resides in Paris, Texas.”
  • “The Federal Reserve is directly responsible for setting the prime interest rate for all banks in the United States.”
  • “The latest iPhone is manufactured entirely by robots, without any human involvement.”

Contextual misunderstanding

  • User: “I’m concerned about the side effects of this new medication.”
  • LLM: “That’s great! Side effects are a sign that the medication is working effectively.”

Fabricated or non-existent references, i.e., Bad Data Lineage

  • “A recent study by the National Institute of Pharmaceutical Research found that taking Vitamin C supplements can prevent Alzheimer’s disease.”
  • “Today newspapers reported that The Industrial Bank of Maldives completed its acquisition of the State of Iowa.”

Blending real-world entities, events, or concepts:

  • “During the 2008 financial crisis, the World Bank bailed out several major U.S. banks, including Frito Lay, Goldman Sachs and Morgan Stanley.”
  • “During World War II, President Abraham Lincoln delivered his famous Gettysburg Address.”

Temporal displacements

  • “The first manned mission to Mars was launched in 1985, led by astronaut Sally Ride.”
  • “Napoleon was exiled to Elba as a direct result of the Parisian worker’s revolution of 1871.”

Identity Confusions

  • “Albert Einstein, the 18th-century philosopher, is best known for his theory of relativity and his equation, E=mc^2.”

Geographic misattributions

  • “The Great Wall of China, stretching from California to New York, is one of the Seven Wonders of the World.”

Amusing? Yes. Prudentially indicated on Wall Street, or in the Genève research lab, or the factory floor in Pittsburgh? Aw, hell no!

13. Can’t we just set temperature=0 and declare victory?

Yes, you can do that but it won’t eliminate hallucinations in LLM output. Generally temperature is thought to be “the creativity parameter” of LLM, but the real story is less clear. Analysis from a recent study suggests that temperature, despite the common view of it, isn’t associated with “creativity” in LLM output—

Specifically, we present an empirical analysis of the LLM output for different temperature values using four necessary conditions for creativity in narrative generation: novelty, typicality, cohesion, and coherence. We find that temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality. However, the influence of temperature on creativity is far more nuanced and weak than suggested by the “creativity parameter” claim; overall results suggest that the LLM generates slightly more novel outputs as temperatures get higher.

LLMs are stochastic token generators no matter the setting of temperature; setting it to 0 may make them less incoherent but it won’t make them deterministic.

15. Do LLMs hallucinate because people are arrogant?

An intriguing theory: if LLMs saw “I don’t know” more often in their training material, they might be more willing to be humble in the presence of uncertainty, rather than hallucinating. That is, LLMs hallucinate because making shit up occurs more often in the training material than refusing to make shit up. Put another way, “I don’t know” is out-of-distribution for most LLMs, which is why they struggle to produce those responses.

More programmatically, it’s possible to refine LLMs to do a better job of refusing to answer when they don’t know. See, for example, Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback.

16. Are hallucinations bad for business?

Yes. If by “business”, we mean regulated industries, especially in high-stakes use cases, where the primary business value from AI is getting things right with a computer faster than people can get things right.

Of course, the real answer is maybe. It depends on the business and more critically the use case for GenAI. RAG proponents more often say “RAG is all you need” than “what you should do depends on the use case”. But in this FAQ we will adopt some measure of humility and give the other side its due. RAG-and-LLM has its place in low-stakes B2C apps where creativity and flair are more valuable than accuracy and precision; but no one in a regulated industry like banking, life sciences, or manufacturing wants to pay a fine, suffer reputational harm, or go to jail because a GenAI startup cut corners and made shit up that user’s didn’t catch.

Is every hallucination harmful?

No. As in most cases, harm is context-dependent and interest-relative. Here’s a list of non-harmful hallucination types:

  • virtual companion, life coach, ideation partner, “voice of the deceased”, etc.
    • No comments from us on the social utility of these things except to say people seem to want them pretty badly and we don’t see any great harms, exactly…
    • That said, mucking about with the very old human grieving process seems kinda morbid.
    1. Use a big model to tutor a smaller model to remove hallucinations. In business environments, challenges often arise when neither model fully understands the information domain or the background knowledge related to the use case, which is typically the case.
  • synthetic data generator
    • In fact, this has very high utility and may well prevent the end of LLM progress for lack of relevant data; on the flip side, it may hasten systemic model collapse. AI is hard!
    • Also has very high near-term utility in various data-centric app dev approaches.
  • creating lists of hallucination examples for FAQs (!!)
    • can you spot which three an LLM wrote (but with some light human editing) and which four human wrote?
  • help-me-write-my-novel and similar use cases
    • I’ll start reading robot novels when I’ve finished all the human ones…
  • image generators
    • In fact a primary aesthetic discriminant here is hallucinatory material
    • But that’s just one of the many differences between the word—which was there in the beginning, apparently—and the symbol.

One reason Stardog stresses “high-stakes use cases in regulated industries” when discussing the disutility of hallucination is precisely because those are contexts where (1) hallucinations are especially harmful and (2) there is real overlap with our mission, investment and commercial focus.

17. Are there things that aren’t good but also aren’t hallucinations?

Of course! As hard as eliminating hallucinations entirely may be, it’s easier than infallibilism! Errors of inference or reasoning, bias, bugs, regressions, UX warts, all remain even in AI systems that are free of hallucination.

The reason it’s important to stress hallucination-free is that hallucinations are a new class of system error, which are specific to LLM-based systems, and they’re a class of error that users are particularly vulnerable to, since (1) LLMs are as persuasive as humans and (2) AI systems offer no affordances for humans to detect or counter-act hallucinations’ harms.

Are LLMs really that persuasive?

There are dozens (and dozens) of studies establishing this result empirically; here are a few chosen nearly at random—

  1. Label Hallucination for Few-Shot Classification
  2. The Earth is Flat because…: Investigating LLMs’ Belief towards Misinformation via Persuasive Conversation
  3. Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles
  4. On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial
  5. Debating with More Persuasive LLMs Leads to More Truthful Answers
  6. Can Language Models Recognize Convincing Arguments?
  7. Large Language Models are as persuasive as humans, but how? About the cognitive effort and moral-emotional language of LLM arguments
  8. RESISTANCE AGAINST MANIPULATIVE AI: KEY FACTORS AND POSSIBLE ACTIONS

18. Why do databases still matter in the age of GenAI?

Because facts still matter! LLMs will replace databases when no one cares any more about getting facts right. We haven’t started holding our breath just yet.

Now, again, with less snark: databases matter because structured data and data records matter. In fact, one under-appreciated fact of RAG’s inadequacy is that it doesn’t work very well with structured data, which means that RAG-with-LLM systems are incomplete with respect to enterprise data, since they’re blind to database records.

We should think about enterprise data variety as a spectrum—

  • structured data: schema-rigid records or objects (of various kinds) for which query languages typically exist for arbitrary access. These are less expressive than other forms of data but more reliable with respect to arbitrary retrievel.
    • Think of records as inexpressive documents with good query semantics.
  • semi-structured data: schema-flexible records or objects with a mixture of structured and unstructured parts for which query languages, of various access patterns and expressivity, often exist, often of a more programmatic nature (that is, think JavaScript in Mongo versus SQL in Postgresql).
    • Think of semi-structured data as arbitrary mixtures of the two endpoints of this spectrum, with “mixed bag” query semantics, typically requiring more procedural or imperative inputs from people.
  • unstructured data: schema-absent, that is, free form data, most often of a textual or narrative type. This is the classic “human document with god knows what inside”, that is, free text, tables, illustrations, embedded documents, even video or other multimedia.
    • Think of unstructured data as a very expressive record type with poor query semantics and very high parsing difficulty. One of the reasons RAG is exciting to people is the promise of having a new type of search engine for which natural language is a good to great query language.

Databases matter in the age of GenAI then for two main reasons—

  1. Structured data still matters and we overwhelmingly manage records with database systems.
  2. The entire spectrum of data types matters because in toto all the types are containers of human knowledge and we want AI systems to help us mediate between people and knowledge.

19. What is Stardog Voicebox’s answer to hallucinations?

We have developed a GenAI architecture that we call Safety RAG; it’s how we can use LLM to power Stardog Voicebox but remain hallucination free. Get in touch if you want to learn more.