Mighty Blog - Building AI Agents That Don't Hallucinate: Compiled Execution vs. ReAct Loops

Summary: Most AI agent frameworks use ReAct loops that hallucinate tool calls, burn tokens, and produce non-deterministic results. MightyBot compiles plain English policies into hybrid execution plans that combine LLM reasoning with deterministic code paths: fewer tokens, no retries, and inspectable behavior before a single action runs.

Every production AI agent failure you have seen shares a root cause: the agent was guessing what to do next.

The dominant paradigm for building AI agents today is the ReAct pattern (Reasoning + Acting), popularized by Yao et al. in 2022 and adopted by nearly every major framework. LangChain, CrewAI, AutoGen, Semantic Kernel: they all implement some version of the same loop. The agent receives a task, reasons about what tool to call, calls it, observes the result, reasons again, and repeats until it reaches an answer or hits a retry limit. This works well in demos. It falls apart in production.

The core problem is not the LLM itself. It is the execution model. When you give an agent a goal and a set of tools with no pre-compiled plan, every step is a inference-time decision. The agent must decide which tool to use, what arguments to pass, how to interpret the result, and whether to retry. Each decision is a chance to hallucinate. Multiply that across a five-step workflow running thousands of times per day, and you get a system that is fundamentally non-deterministic in a context where your business requires determinism.

MightyBot takes a different approach. Instead of giving an agent tools and hoping it figures out the right sequence, MightyBot compiles plain English policies into execution plans before any action runs. The result is a hybrid system: LLM-based reasoning where judgment is genuinely needed, deterministic code paths everywhere else. No drag-and-drop workflow builders. No ReAct loops. No retries.

How ReAct Loops Actually Work

The ReAct pattern is elegant in theory. An agent interleaves reasoning traces ("I need to find the customer's invoice") with actions ("call the search_invoices tool with query='customer ABC'") and observations ("returned 3 results"). The reasoning step is a chain-of-thought prompt that asks the LLM to plan its next move. The action step calls a tool. The observation step feeds the tool output back into the LLM for the next reasoning cycle.

Here is what a ReAct loop looks like for a simple document processing task. The agent needs to extract a vendor name, invoice number, and total amount from a PDF.

Cycle 1: The agent reasons that it needs to read the document. It calls an OCR tool. The tool returns raw text.

Cycle 2: The agent reasons that it needs to find the vendor name. It scans the text, identifies a candidate, and stores it. But it is not confident, so it reasons that it should try a different extraction approach.

Cycle 3: The agent calls a structured extraction tool with a prompt it generates on the fly. The prompt is slightly different from what a human would write. The tool returns a partial result.

Cycle 4: The agent notices the invoice number is missing. It reasons that the OCR might have failed on that region. It retries OCR with different parameters.

Cycle 5: The agent finally assembles a result. Four LLM calls, two OCR calls, one structured extraction call. Total tokens consumed: 12,000+. Time elapsed: 8 seconds. And the result may differ if you run it again on the same document.

This is not a pathological case. This is the normal behavior of a ReAct agent handling a straightforward task.

Where ReAct Breaks Down in Production

Three failure modes dominate ReAct agents in production environments.

Hallucinated tool calls. The agent invents tool names that do not exist, passes malformed arguments, or calls tools in sequences that violate business logic. A ReAct agent processing insurance documents might call a "verify_policy" function that was never registered, then fail silently or retry with a different (also hallucinated) function name. Guardrails can catch some of these, but the fundamental problem remains: the agent is improvising its execution path.

Token burn from retries. When a tool call fails, the agent reasons about why it failed and tries again. This retry loop is unbounded in many frameworks. A single failed extraction can consume 5x the tokens of a successful one. At scale, this makes cost modeling nearly impossible. You cannot predict what a ReAct agent will cost to run because you cannot predict how many retries it will need.

Non-deterministic outputs. Run the same ReAct agent on the same input twice and you may get different results. The reasoning trace depends on the LLM's sampling, which varies between calls. For a chatbot, this is acceptable. For a system processing loan applications, insurance claims, or compliance documents, it is not. Your business logic should not depend on whether the LLM happened to reason well on this particular invocation.

Compiled Execution: How MightyBot's Approach Works

MightyBot replaces the ReAct loop with a compiled execution model. The process has three phases.

Phase 1: Policy definition. You write a policy in plain English describing what the agent should do. For the document processing example: "Extract vendor name, invoice number, and total amount from the uploaded invoice. Vendor name is in the header area. Invoice number follows the label 'Invoice #' or 'Inv No.' Total amount is the last dollar figure in the totals section."

Phase 2: Schema compilation. The platform analyzes your policy and builds a typed schema. Each field gets a name, a type, extraction hints (where to look, what format to expect), and validation rules. This schema is compiled once, not generated at inference time. The schema becomes a contract: the execution plan knows exactly what it is looking for before it processes a single document.

Phase 3: Hybrid execution. The compiled plan runs as a mix of deterministic code and targeted LLM calls. OCR runs once. Field extraction uses structured output against the pre-built schema. Validation is pure code: type checks, format checks, cross-field consistency checks. The LLM is only invoked where genuine judgment is required, such as disambiguating a vendor name that appears in multiple locations.

The same document that took a ReAct agent five cycles and 12,000+ tokens processes in one pass with under 3,000 tokens. The output is identical every time for the same input. There are no retries because there is nothing to retry: the plan knows exactly what to extract and how to validate it.

Concrete Example: Invoice Processing

Consider a production invoice processing workflow handling 10,000 documents per day.

ReAct approach: Each document triggers 3 to 7 LLM calls depending on complexity and retry behavior. Average token consumption: 8,000 per document. At GPT-4 pricing, that is roughly $0.24 per document, or $2,400 per day. Processing time: 6 to 15 seconds per document. Accuracy: 89 to 94%, varying by run. Failed extractions require human review, adding labor cost.

Compiled approach: Each document triggers 1 to 2 LLM calls (OCR + structured extraction against the compiled schema). Average token consumption: 2,500 per document. Cost: roughly $0.075 per document, or $750 per day. Processing time: 2 to 3 seconds per document. Accuracy: 97%+, consistent across runs. Human review is only triggered by genuinely ambiguous documents, not by agent confusion.

The cost difference is significant. The consistency difference is what matters for production. When your downstream systems depend on extracted data, non-deterministic outputs create cascading failures that are expensive to debug and expensive to fix.

Observability: Inspecting Before Executing

One of the most underappreciated advantages of compiled execution is pre-execution observability. A compiled execution plan is an artifact you can inspect, test, and version before it processes a single real document.

With a ReAct agent, you cannot know what it will do until it runs. The reasoning trace is generated at inference time. You can add logging, but you are logging what already happened. If the agent hallucinated a tool call on step 3, you find out after step 3 ran and potentially caused side effects.

With a compiled plan, the entire execution graph is visible before runtime. You can see which tools will be called, in what order, with what arguments. You can see where LLM judgment is required and where deterministic code handles the logic. You can run the plan against test documents and verify behavior before deploying to production. You can diff two versions of a plan to see exactly what changed when a policy was updated.

This is the same principle that makes compiled programming languages easier to reason about than dynamically typed, interpreted ones. The compilation step catches errors early. The execution step is predictable.

When ReAct Still Makes Sense

Compiled execution is not the right tool for every AI agent use case. ReAct loops have genuine strengths in scenarios where the execution path cannot be known in advance.

Open-ended research. An agent exploring a codebase to answer a question about architecture cannot have a pre-compiled plan because the path depends on what it finds. Each file it reads changes what it needs to read next. ReAct's adaptive reasoning is well-suited here.

Conversational agents. A chatbot responding to unpredictable user questions needs the flexibility to reason about each turn independently. The "plan" is the conversation itself.

Exploration and discovery. An agent tasked with "find all the ways this API is used across our codebase" needs to follow leads, backtrack, and adapt. Pre-compilation would over-constrain it.

The pattern is clear: if the task is repeatable and the desired output structure is known, compiled execution wins. If the task is genuinely open-ended and the output structure depends on what the agent discovers, ReAct is appropriate. Most business processes fall squarely in the first category. Invoice processing, compliance checks, data extraction, report generation, onboarding workflows: these are all structured, repeatable tasks that should not be left to inference-time improvisation.

Building on Compiled Execution

If you are evaluating agent platforms or building agents in-house, the architecture decision between ReAct and compiled execution will define your production experience. Ask these questions:

Can you predict the output schema? If yes, compile it. Do not let the agent invent the schema at runtime.

Do you need consistent results? If the same input must produce the same output every time, ReAct's non-determinism is a liability. Compiled plans with deterministic code paths give you reproducibility.

Can you afford retries? Calculate the token cost of your worst-case ReAct loop. Multiply by your daily volume. Compare that to a single-pass compiled execution. The cost difference at scale is often 3x to 5x.

Do you need auditability? Compiled plans produce why-trails: complete records of what was extracted, from where, under which policy version, with what confidence. ReAct traces are useful for debugging but are not structured audit artifacts.

The AI agent space is moving fast. But the fundamentals of software engineering have not changed. Compiled beats interpreted for production workloads. Deterministic beats probabilistic for business-critical processes. Pre-planned beats improvised for systems you need to trust.

Frequently Asked Questions

What is the ReAct pattern in AI agents?

ReAct (Reasoning + Acting) is an agent architecture where the LLM alternates between reasoning about what to do next and taking actions like calling tools or APIs. The agent observes the result of each action and reasons again, creating a loop that continues until the task is complete. Most popular frameworks, including LangChain, CrewAI, and AutoGen, implement variants of this pattern.

Why do ReAct agents hallucinate tool calls?

ReAct agents decide which tool to call at inference time, meaning the LLM generates the tool name and arguments as part of its reasoning trace. If the LLM has imperfect knowledge of available tools, or if the tool descriptions are ambiguous, it may generate calls to tools that do not exist or pass arguments in the wrong format. This is a structural problem with the pattern, not a model quality issue. Better models reduce the frequency but do not eliminate it.

How does compiled execution reduce token costs?

Compiled execution eliminates retry loops and redundant reasoning steps. In a ReAct loop, every retry requires the full conversation history to be sent back to the LLM, including previous failed attempts. A compiled plan runs each step once against a pre-built schema, using structured output to get the result in a single call. For document processing workloads, this typically reduces token consumption by 60 to 70% compared to ReAct.

Can you combine compiled execution with ReAct in the same system?

Yes. MightyBot's hybrid execution model uses deterministic code paths for structured, predictable steps and LLM-based reasoning only where genuine judgment is required. For example, a document processing agent might use compiled extraction for standard fields and an LLM call to resolve ambiguous cases like a vendor name that appears in multiple locations. The key is that the LLM is invoked deliberately, at specific points in the plan, rather than controlling the entire execution flow.

Building AI Agents That Don't Hallucinate: Compiled Execution vs. ReAct Loops

How ReAct Loops Actually Work

Where ReAct Breaks Down in Production

Compiled Execution: How MightyBot's Approach Works

Concrete Example: Invoice Processing

Observability: Inspecting Before Executing

When ReAct Still Makes Sense

Building on Compiled Execution

Related Reading

Frequently Asked Questions

Related Posts

All-in-One AI Agent Stack vs. Stitched Tools: Why Integration Architecture Matters

What Is an AI Agent Operating Model? From Chatbots to Autonomous Workflows