GenAI Cost Problems Are Not Model Problems

Jun 29, 2026, 9 minute read

Most conversations about GenAI costs start in the wrong place. The discussion usually centers on foundation model pricing, inference optimization, and model choice. The underlying assumption is that the model itself is a cost problem.

It usually isn’t.

In most enterprise AI systems, the largest cost leak happens before the model is even invoked. A major contributor is the retrieval layer. Here, pipelines often inject excessive amounts of text as enterprise knowledge into the model’s context window. By the time the model starts generating a response, the system has already multiplied the cost of the interaction due to this bloated context.

This explains why two different enterprise AI systems can use the exact same underlying foundation model yet incur different costs per request. It is because of how they construct the context.

To build cost-efficient systems, you need to provide just the right amount of context, understand the user’s intent, and map that intent precisely to how enterprise systems and data relate, without burning unnecessary tokens. Token economics are dominated by how you construct context, not which model you call.

The Two Cost Amplifying Patterns

Across enterprise AI systems, two patterns dominate the most. They are chunk-based retrieval systems built on retrieval augmented generation (RAG), and Text-to-SQL pipelines that translate natural language into database queries.

The differences between them appear in their implementation, but the inefficiency remains the same. RAG relies on retrieved text passages, while Text-to-SQL uses a long schema as context within the prompt. In both cases, the LLM is responsible for interpreting business concepts and definitions that were never defined first.

This inefficiency becomes evident when you look at how both patterns behave in production.

Why RAG Burns Tokens

In RAG systems, documents are split into chunks, converted into vector embeddings, and retrieved using a top-k similarity search at runtime. The retrieved passages are then inserted into the prompt as context for the LLM.

The problem is that retrieval relies on statistical similarity rather than a true understanding of context.

A RAG pipeline does not know which piece of information precisely answers the question. Instead, it retrieves candidate passages that seem related to the query. This creates uncertainty that the systems later compensate for. To mitigate this, teams increase the top-k value to avoid missing relevant context. They also enlarge the retrieval chunk size and add overlap to ensure it contains sufficient context.

Each of these adjustments increases the number of tokens sent to the LLM.

The result is token bloat and redundant context. In many enterprise deployments, the model spends more time filtering irrelevant information than reasoning about the actual task. Hallucination risks increase because the system asks the LLM to resolve ambiguities that the system design never resolved upstream.

This is the core inefficiency in most RAG systems. The LLM then pays the cost of disambiguation with tokens.

Why Text-to-SQL Doesn’t Resolve the Problem

Text-to-SQL refers to the task of converting a natural language query into an SQL query. To accomplish this, a model must understand the user’s intent and then map it onto the database schema, including tables, columns, and relationships.

This approach does not resolve the cost problem.

The primary inefficiency is in the size of the input context that must be included within the model’s context window. Enterprise schemas can be extensive. In many Text-to-SQL approaches, relevant portions of the schema must be included in the prompt. This can lead to significant token consumption.

Even with the complete schema context, business meaning still goes missing. Business definitions, such as “active customer,” may vary depending on factors like time windows or activity criteria. These definitions are typically not encoded in schema metadata and vary across different business domains and teams.

This results in execution inefficiency. Queries frequently require retries and rewrites. Data is often over-fetched and then reprocessed by the model for interpretation and formatting, leading to additional inference cycles and ultimately increased token consumption. This shifts the cost instead of reducing it.

The Common Failure in Both Patterns

RAG and Text-to-SQL follow the same pattern, where meaning is not explicitly defined and must be inferred at runtime by the LLM.

As a result, the LLM is compelled to assume multiple roles simultaneously. It acts as a query planner, interpreting the user’s natural language query and translating it into a SQL query. It also acts as a semantic resolver, inferring business definitions that are not explicitly specified. And, it acts as a relevance engine, filtering and reconciling incomplete or noisy context.

This multi-role responsibility does not scale efficiently in production environments. Each LLM task increases token usage, inference time, and output variability.

The core issue is not the model’s capability or pricing. It is the lack of structured meaning before the model is invoked. The missing semantic layer is what drives both token usage and costs higher.

Results from Stardog Customer Deployments

4x

Reduction in prompt token volume/cost

vs. text-to-sql pipelines
20–40%

Improvement in retrieval precision

vs. text-to-sql pipelines
30%

Faster response times for answers

vs. text-to-sql pipelines

Source: Stardog customer deployments

How Ontologies Optimize Token Economics

Organizations can introduce ontologies into their data stack to establish a structured representation of enterprise semantics. These ontologies can be positioned above the data and retrieval layers.

An ontology defines entities such as customers, products, and accounts, along with the relationships between them. It provides a structured layer where business concepts are formally represented.

Teams can use ontologies as a predefined layer to enable reusable and consistent business definitions across an organization.

The impact of ontologies on token economics can be analyzed through several core system behaviors.

Deterministic Retrieval

Ontologies enable deterministic retrieval rather than probabilistic retrieval. Systems retrieve information as structured, meaningful data, without relying on similarity search, top-k approximation, or probabilistic guessing. The system becomes aware of how entities and relationships are defined.

It does not need to search through large sets of documents to understand relationships. For example, a query for “customer purchase history” can directly retrieve the exact customer and order entities instead of searching through documents.

This eliminates the need to retrieve documents in large chunks and reduces the inclusion of irrelevant context in the prompt.

Context Compression

Ontologies condense context by replacing raw text and schema dumps with entities and relationships. Instead of passing full documents or database schemas into the LLM context window, the system uses structured meaning.

This eliminates the need to include large volumes of reference material in every prompt.

As a result, each query requires less information while preserving the full business meaning needed to produce correct, consistent outputs across downstream tasks.

No Schema Prompting

The LLM does not need to interpret tables, joins, or database structures. Prompts do not contain schema explanations anymore.

The system, instead, relies on the ontology to provide that structure in advance. This eliminates the need to include table definitions, column mappings, or relationship logic in the prompt each time.

The LLM no longer needs to reconstruct or infer how the underlying data is organized before executing a task.

Fewer Retries

Ontologies can be a layer that makes semantics explicit and consistent throughout the system. These semantics reduce ambiguity at the source. As the business definitions are standardized, the model has less room to misinterpret intent or produce incorrect outputs.

The result is fewer failed queries and fewer regeneration loops. The system does not need to repeatedly regenerate SQL or re-run interpretations to correct mistakes.

Outputs become more consistent across repeated tasks, reducing correction cycles and making production workflows more stable.

The Real Cost Equation

A common assumption in GenAI systems is that cost is primarily a function of model pricing. But in fact, cost is dominated by system-level token consumption across retrieval, prompting, and retries.

Cost can be approximated as:

Cost = (Tokens per prompt × Prompts per task) × Tasks

Different retrieval patterns influence this equation in distinct ways. RAG increases tokens per prompt due to redundant context injection. Text-to-SQL increases both tokens per prompt and the number of retries due to schema and semantic ambiguity.

Ontology-based systems reduce both components by minimizing context size and eliminating repeated inference cycles. This lowers token usage per request and creates a more stable cost structure at scale.

How Stardog’s Semantic Layer Operationalizes Meaning

Stardog turns ontologies into an operational semantic layer that shapes how data is queried and interpreted.

Stardog’s virtual graphs are at the center of this layer. They connect distributed sources like Databricks, Snowflake, and PostgreSQL and present them as a unified knowledge graph.

The LLM does not need to interpret raw database structures because the interpretation already exists in the semantic layer.

The application layer first translates the user’s intent into a graph query, using Stardog’s knowledge graph as the semantic model. Stardog then converts that graph query into source-specific SQL and executes it directly across live operational data systems when a request is made. The LLM does not process raw schemas or tables. It receives data that has already been mapped to business concepts through the semantic layer.

This is the important shift. The ontology stops functioning as a diagram and becomes an operational layer.

LLMs no longer need to perform inference over business rules themselves. As business rules and definitions are encoded in the knowledge graph, Stardog’s inference engine applies them at query time. There is no need to materialize them in advance or reconstruct them in every prompt.

The semantic layer acts as an orchestration layer between user intent and enterprise data. The LLM receives only the structured context. This resolves ambiguity at both the data and ontology levels before the LLM is invoked. AI systems work with precise context, reducing the number of tokens per prompt.

Moving in the Right Direction

Model improvements alone do not resolve structural inefficiencies in enterprise AI systems. The main cost driver is the way systems handle data and meaning before the model is ever called.

When an LLM works with approximate retrieval and ambiguous business logic, it consumes runtime tokens to infer missing meaning. This approach does not scale effectively in production environments.

The deeper question is, who is responsible for handling the meaning of business data? RAG and Text-to-SQL systems delegate semantic resolution to the prompt window. The LLM must interpret the data’s meaning while generating the answer.

Ontology-based systems change this. Business definitions, rules, and relationships are predefined within a structured layer before the LLM is invoked.

The winning pattern for token economics is to add a semantic layer, reduce ambiguity upstream, send less context, and pay less per task.

If you want to reduce GenAI costs using a semantic layer, contact us and we’ll connect you with an expert to get started right away.

Knowledge Graphs 101

How to Overcome a Major Enterprise Liability and Unleash Massive Potential

Free download
Knowledge Graphs 101 ebook