April 1, 2026

AI Thinking

Why Enterprise AI Projects Blow Budgets: The Hidden Cost of ReAct Architecture

Why Enterprise AI Projects Blow Budgets

80% of companies exceed their AI cost forecasts by 25% or more. The primary driver is not model pricing. It is architectural waste: ReAct agents that retry failed actions, reload context at every step, and burn tokens on redundant reasoning loops. Compiled execution eliminates this waste by building the execution plan before processing begins.

AI budgets are spiraling. Not because models are expensive, but because most agent architectures are structurally wasteful. Every reasoning loop, every context reload, every retry burns tokens that produce no business value. The enterprises that control AI costs in 2026 will not be the ones negotiating better model pricing. They will be the ones that eliminate architectural waste from their execution pipelines.

The 2026 AI Cost Crisis

Global AI operational expenditure will exceed $500 billion in 2026. That number includes training, inference, infrastructure, and the integration costs that nobody talks about until the invoices arrive. For AI-native startups, the picture is even starker: some spend 60 to 80 cents of every revenue dollar on inference alone. The unit economics of "call an LLM for everything" do not work at scale.

Integration and maintenance account for 40 to 60% of total AI deployment costs. The model API bill is the visible expense. The invisible expenses are the engineering hours spent debugging stochastic failures, rebuilding prompts when outputs drift, and maintaining fragile tool-calling chains that break when upstream APIs change. These costs compound over time. The model bill stays flat. Everything around it grows.

Aaron Levie, CEO of Box, noted on LinkedIn that enterprise leaders are now asking "how we will budget for tokens across use-cases and teams." This is no longer an engineering conversation. It is a C-level conversation about operational expenditure that scales with usage, has no natural ceiling, and resists traditional forecasting methods.

Gartner projects that by 2028, 33% of enterprise software applications will include agentic AI, up from less than 1% in 2024. Every one of those deployments will face the same question: how do we control inference costs at scale? The platforms that answer that question architecturally, not just commercially, will win.

Where the Tokens Go

To understand why AI agent costs blow budgets, you need to trace where tokens are actually consumed. Consider a ReAct agent processing a 30-page commercial loan document. The agent needs to extract borrower information, financial data, collateral descriptions, covenant terms, and compliance certifications from five different document types within the package.

Context loading: At each step, the ReAct agent reloads the full context. The document content, the instructions, the prior extraction results, and the conversation history all go back into the prompt. A 30-page document produces roughly 15,000 tokens of raw text. By the time the agent completes its fifth reasoning cycle, it has loaded that context five times: 75,000 tokens consumed just on context, most of it redundant.

Reasoning loops: The ReAct pattern follows a strict observe-think-act cycle. Each cycle, the agent produces a reasoning trace explaining what it sees, what it plans to do, and why. These traces are useful for debugging. They are also expensive. A typical reasoning trace runs 200 to 500 tokens per cycle. Across 8 to 12 cycles for a complex document, that is 2,400 to 6,000 tokens spent on reasoning that produces no extractable output. The agent is thinking out loud, and you are paying for every word.

Retries: When extraction fails or the output does not match the expected format, the agent retries. It reloads the context (again), adds error information to the prompt, and tries a different approach. A single failed extraction can trigger 2 to 3 retries, each consuming the full context window plus additional diagnostic tokens. In production, retry rates of 15 to 25% are common for complex documents. Each retry roughly doubles the token cost for that extraction step.

Sequential processing: ReAct agents process documents one section at a time. The borrower information extraction runs, completes, and returns results. Then the financial data extraction starts, loading the full context fresh. Then collateral. Then covenants. Five document types processed sequentially means five independent context loads, five independent reasoning chains, and no shared state between them.

The total for a single 30-page loan document: approximately 120,000 to 180,000 tokens. At current pricing for frontier models, that is $0.60 to $1.80 per document. Process 500 loan packages per day and you are spending $300 to $900 daily on a single workflow. Multiply across 10 or 20 workflows and the annual cost reaches seven figures for inference alone.

How Compiled Execution Eliminates Waste

Compiled execution takes a fundamentally different approach. Instead of giving an agent a goal and letting it figure out the steps at inference time, a policy engine compiles plain English rules into an execution plan before any document is processed. The plan specifies exactly which fields to extract, from which document types, using which methods, and in what order.

Context loads once. The compiled plan ingests the document content a single time. Extracted data is stored in a shared context that all subsequent steps can reference. No reloading. No redundant token consumption. The 15,000-token document enters the pipeline once, not five or twelve times.

Deterministic operations compile to code. Calculations, comparisons, format validations, cross-field consistency checks: these operations do not require an LLM. A compiled plan identifies which steps require genuine language understanding and which can be executed as deterministic code. Code execution costs zero tokens. In a typical document processing workflow, 40 to 60% of operations are deterministic. That is 40 to 60% of the workflow running at zero inference cost.

Structured LLM calls replace reasoning chains. When the plan does invoke an LLM, it uses constrained outputs against a pre-built schema. The model does not need to reason about what to extract or how to format the output. It receives a specific extraction target ("extract the aggregate coverage limit from the declarations page") with a typed output schema. No reasoning trace. No observe-think-act cycle. One call, one structured result.

No retries. The plan is deterministic. If an extraction fails, it fails for a structural reason (the field does not exist in the document, the document is unreadable) rather than because the agent guessed wrong about which tool to call. Structural failures route to human review immediately instead of burning tokens on retry loops that will not succeed. The result: 3 to 5x fewer tokens for the same workflow, with higher accuracy and consistent outputs.

The Parallelism Dividend

ReAct agents process sequentially by design. Each step depends on the reasoning output of the previous step. The agent cannot start extracting financial data until it finishes reasoning about borrower information, because the reasoning chain is continuous. This sequential dependency is architectural, not incidental. It is how the ReAct pattern works.

Compiled execution plans identify independent steps at compile time. Borrower extraction and collateral extraction do not depend on each other. Financial data extraction and covenant identification do not depend on each other. The compiled plan runs all independent steps in parallel. A workflow that processes five document types sequentially takes five times the wall-clock time of one that processes all five simultaneously.

The cost implication is significant. Cloud compute is billed by time. A workflow that takes 30 seconds to complete consumes compute resources for 30 seconds. The same workflow completing in 6 seconds through parallelism consumes resources for 6 seconds. Speed and cost improve together. You do not choose between fast and cheap. Compiled execution delivers both.

For high-volume deployments, the parallelism dividend compounds. A lender processing 500 loan packages per day with sequential execution needs enough compute capacity to handle the queue without falling behind. With parallel execution, the same throughput requires a fraction of the compute. The infrastructure scales with the workflow's parallelizable surface area, not with the total number of sequential steps.

Token Budgeting for the Enterprise

Levie's observation deserves a closer look: token costs "probably won't make sense to be considered an IT budget anymore." He is right. Traditional IT budgets assume predictable, fixed costs: licenses, seats, infrastructure contracts. AI inference costs scale with usage in ways that resist traditional budgeting. A single workflow change can 3x token consumption overnight. A new document type can introduce retry patterns that double costs for an entire pipeline.

The answer is to treat AI compute like cloud compute. Allocate by workflow. Measure per-transaction. Optimize the highest-volume paths first. This requires a platform that makes per-transaction costs visible and predictable. ReAct agents cannot provide this because their token consumption varies with document complexity, retry rates, and reasoning depth. You can measure average costs, but you cannot predict them for a specific transaction.

Compiled execution makes per-transaction costs deterministic. The execution plan is fixed. The same document type, processed by the same compiled plan, consumes the same number of tokens every time (plus or minus minor variation in LLM output length). Finance teams can forecast AI costs the same way they forecast cloud compute: by multiplying transaction volume by per-transaction cost. No surprise spikes. No budget overruns from stochastic retry storms.

The practical implication for budget planning: start with your highest-volume workflow. Measure its current per-transaction token cost. Identify the waste (context reloads, reasoning traces, retries). Calculate the cost under a compiled execution model. The difference is your savings opportunity. For most enterprises, this single workflow optimization pays for the platform migration.

The Math: ReAct vs. Compiled for a Real Workflow

Let's make this concrete with a construction draw review. The workflow: process a draw package containing 30 pages across 5 document types (AIA G702/G703 forms, lien waivers, inspection reports, insurance certificates, and change orders). Apply 12 policy rules covering budget compliance, lien waiver completeness, insurance coverage minimums, retainage calculations, and change order authorization.

ReAct approach:

Context loading: 15,000 tokens per document, reloaded across approximately 12 reasoning cycles. Total context tokens: 180,000. Reasoning traces: 400 tokens average per cycle, 12 cycles. Total reasoning tokens: 4,800. Tool calls and observations: 2,000 tokens across 12 cycles. Retries (estimated 20% retry rate): additional 40,000 tokens. Policy evaluation: 12 rules checked sequentially, each reloading extraction results. Additional 24,000 tokens. Grand total: approximately 250,000 tokens per draw review. At $3 per million input tokens and $15 per million output tokens (frontier model pricing), that is roughly $1.50 per review.

Compiled approach:

Context loading: 15,000 tokens, loaded once. Parallel extraction across 5 document types: 5 structured LLM calls at approximately 3,000 tokens each (constrained output, no reasoning trace). Total extraction tokens: 15,000. Deterministic policy checks: 8 of 12 rules execute as code (budget math, threshold comparisons, date validations). Zero tokens. LLM-based policy checks: 4 rules requiring judgment at approximately 2,000 tokens each. Total policy tokens: 8,000. Grand total: approximately 38,000 tokens per draw review. Cost: roughly $0.25 per review.

The ratio: 6.5x fewer tokens. The cost difference: $1.25 per review. At 200 draw reviews per month, that is $250 in monthly savings on a single workflow. Scale to 20 workflows and the annual savings exceed $50,000. For large enterprises with hundreds of workflows processing thousands of documents daily, the savings reach millions.

These are conservative estimates. The gap widens with more complex documents, higher retry rates, and longer reasoning chains. The worst-performing ReAct workflows, the ones with 30%+ retry rates on complex multi-document packages, show token ratios of 8x to 10x compared to compiled equivalents.

Beyond Tokens: Total Cost of Ownership

Token cost is the most visible expense. It is not the largest. The total cost of ownership for an AI agent deployment includes three categories that compiled execution reduces simultaneously.

Workflow maintenance. ReAct agents are governed by prompts. When a business rule changes, someone must update the prompt, test it across edge cases, verify that the change did not break adjacent behaviors, and deploy it. This is engineering work. For a policy-driven platform, a rule change is a policy edit in plain English. The platform recompiles the execution plan automatically. The compliance officer who understands the regulation makes the change directly. No engineering ticket. No two-week sprint cycle. No regression testing against prompt fragility.

Debugging costs. When a ReAct agent produces an incorrect output, debugging requires reconstructing the reasoning chain to find where it went wrong. Was it a hallucinated tool call? A retry that corrupted state? A reasoning trace that drifted off-task? These failures are stochastic. The same input might produce the correct output on a rerun, making the bug unreproducible. Compiled execution produces structural failures. If the plan is wrong, it is wrong consistently. You can identify the exact step that failed, trace it to the policy rule that governs it, and fix it. Debugging time drops from hours to minutes.

Compliance costs. In regulated industries, every automated decision must be explainable. ReAct agents produce reasoning traces that read like stream-of-consciousness notes: "I think the coverage amount is $2M based on what I see on page 3, but let me check again..." This is not auditable evidence. Compiled execution produces structured why-trails: policy version applied, data extracted with source pointers, each condition evaluated with pass/fail results. Auditors and examiners can review decisions without interpreting an AI's thought process. The compliance cost of maintaining audit readiness drops significantly when every decision is structurally documented.

When you add maintenance, debugging, and compliance costs to token costs, the total cost advantage of compiled execution over traditional workflow approaches is not 3 to 5x. It is closer to 5 to 10x over a three-year deployment horizon. The token savings pay for the migration. The operational savings justify the platform.

Frequently Asked Questions

Why do enterprise AI projects exceed their budgets?

The primary driver is architectural waste, not model pricing. Most enterprise AI agents use ReAct-style execution that reloads context at every step, burns tokens on reasoning traces that produce no output, and triggers retry loops when extractions fail. These patterns make token consumption unpredictable and inflate costs by 3 to 10x compared to optimized architectures. Integration and maintenance costs, which account for 40 to 60% of total deployment costs, compound the problem.

What is compiled execution and how does it reduce AI costs?

Compiled execution is an approach where plain English business policies are compiled into deterministic execution plans before any document is processed. The plan specifies exactly which fields to extract, which steps run as code (zero tokens), and which steps require LLM calls (with constrained outputs). Context loads once, independent steps run in parallel, and there are no retry loops. The result is 3 to 5x fewer tokens for the same workflow compared to ReAct-style agents.

How many fewer tokens does compiled execution use?

In production workloads, compiled execution uses 3 to 6.5x fewer tokens than equivalent ReAct-based agents. The savings come from three sources: eliminating redundant context reloads (the largest source of waste), replacing reasoning chains with constrained structured outputs, and compiling deterministic operations to code that consumes zero tokens. For complex multi-document workflows with high retry rates, the ratio can exceed 8x.

How should enterprises budget for AI agent token costs?

Treat AI inference like cloud compute: allocate by workflow, measure per-transaction costs, and optimize the highest-volume paths first. Compiled execution makes this practical because per-transaction costs are deterministic. The same document type processed by the same compiled plan consumes the same number of tokens every time. Finance teams can forecast costs by multiplying transaction volume by per-transaction cost, the same methodology they use for cloud infrastructure.

What is the total cost of ownership for an AI agent platform?

Token costs are the visible expense, but they are typically less than half of total cost of ownership. The full picture includes workflow maintenance (updating prompts or policies when rules change), debugging (diagnosing stochastic failures vs. structural ones), and compliance (producing auditable evidence for every automated decision). Compiled execution reduces costs across all four dimensions. Over a three-year deployment horizon, the total cost advantage over ReAct-based architectures is 5 to 10x.

MightyBot compiles plain English policies into hybrid execution plans that eliminate architectural waste. Learn how compiled execution can reduce your AI agent costs.

Related Posts

See all Blogs