Enterprises Are Rebuilding Their AI Agents From Scratch — Because They Skipped the Plumbing

A wave of first-generation agentic deployments is failing in production. The culprit isn't the models. It's the infrastructure underneath them.

Written by Lena Armitage · Bureau Tech · May 31, 2026

The agents are failing. The models are fine.

Here is the uncomfortable finding from the first wave of enterprise AI agent deployments: the language models mostly worked. The infrastructure around them mostly didn't.

That's the pattern Preeti Somal, Senior VP of Engineering at Temporal Technologies, described at a recent AI Impact Series event in New York. "We do have a lot of customers that come to us where they're building version 2.0 of the same agent," she said. "They had to move really fast, but they didn't take care of the plumbing. Things crash and burn, and then they're back to rebuilding with the reliable foundation."

Temporal, whose workflow orchestration infrastructure predates the current agentic AI wave, has become a reference point for enterprises trying to understand why their agents keep breaking. The answer, Somal argues, is that production AI systems require capabilities that model benchmarks don't measure: durable execution, state management, failure recovery, and cost visibility across multi-step workflows.

What 'long-running' actually means — and why it changes everything

Most early AI deployments were stateless: a user sends a prompt, a model returns a response, the interaction ends. Agentic systems are different. A single enterprise workflow might call several large language models (LLMs — the large neural networks that power tools like ChatGPT and Claude), access retrieval databases, trigger external APIs, and persist over hours or days.

That duration creates failure surfaces that don't exist in simple chatbot interactions. "People will write agents but haven't thought about what happens if the agent crashes," Somal said. "Am I going to need to run the entire agent flow again?"

For enterprises operating under cost constraints, that question has a dollar sign attached. Every model call consumes tokens — the units of text that LLM providers charge for. Restarting a failed workflow from the beginning means paying for every prior step again, with nothing to show for it.

State vs. memory: a distinction that matters more than it sounds

One source of architectural confusion in agentic AI is the conflation of two related but distinct concepts: state and memory.

Somal draws a clear line between them. State concerns workflow execution — where an agent is in a process, which actions have already completed, and where recovery should resume after a failure. Memory, or context, captures information an agent carries forward across interactions or tasks.

"The state of the agent is around what step and what actions have been performed, and if something crashes, where do you want to recover from, versus the context and memory piece," she explained.

The distinction becomes load-bearing when workflows grow complex. Somal pointed to Abridge, a healthcare company and Temporal customer, whose workflows process physician visits through multiple sequential stages: audio processing, summarization, multiple LLM calls, and after-visit summary generation. Each stage has its own state. Losing track of which stages completed — and which didn't — means either rerunning the entire pipeline or producing incomplete outputs.

The deterministic spine

Somal uses the phrase "deterministic spine" to describe how Temporal positions itself relative to the probabilistic outputs of language models. The concept is worth unpacking.

LLMs are non-deterministic: given the same input, they may produce different outputs. That's a feature for creative tasks and a liability for enterprise processes that require consistency. A procurement workflow, a compliance check, or a patient summary cannot simply fail silently because a model call timed out.

The deterministic spine is the orchestration layer that wraps around model calls and enforces reliability regardless of what the model does. "It is denoting the path you want to take," Somal said. "It is calling the brain, but if the brain doesn't respond, it will call it again. If the brain responds but the next step is going to fail, it will pick up from where that failure happened."

In this architecture, the model handles reasoning; the orchestration layer handles execution guarantees. Neither substitutes for the other.

The token tax and the economics of failure

Cost visibility has emerged as a distinct concern as enterprises try to calculate ROI on agentic deployments. Long-running agents make multiple model calls across complex workflows, and without observability into each step, spending patterns become opaque.

Somal described one operational advantage of durable orchestration as granular visibility into where tokens are consumed. "You've got visibility into that entire flow in a single pane of glass," she said. "You can now see where you're spending the tokens in an agent that is multiple steps and calling multiple different systems."

Workflow recovery also shapes cost efficiency directly. Without it, a failure at step seven of a ten-step process means rerunning steps one through six — all billable. "You pick up from where the crash happened," Somal said. "We save you the cost of running the agent from step one again."

The magnitude of that saving depends on workflow complexity and failure frequency, and Somal did not offer specific figures. But the directional logic is straightforward: the longer the workflow and the more expensive the model calls, the higher the cost of unrecovered failures.

The cloud migration parallel — and its limits

Somal's comparison to early cloud adoption is instructive, though it's worth examining carefully. The "lift and shift" pattern she describes — migrating workloads to cloud infrastructure without redesigning underlying architectures — did produce real cost overruns for many enterprises. The analogy maps reasonably well: in both cases, organizations adopted new infrastructure faster than they adapted their engineering practices to it.

The difference is that cloud infrastructure failures are generally well-understood and recoverable. LLM-based agent failures can be subtler — a model that returns a plausible but incorrect output, for instance, may not trigger any error handling at all. That's a failure mode the deterministic spine doesn't fully address, and it's worth noting that Somal's framing focuses on execution reliability rather than output correctness.

'Paved paths' over off-the-shelf platforms

As enterprises revisit first-generation deployments, Somal said a clear preference is emerging: rather than adopting fully managed agent platforms wholesale, organizations want internal frameworks that embed governance, model selection policies, identity systems, cost management, and observability from the start.

"The enterprises are looking at building these paved paths," she said. "Taking something off the shelf is maybe not going to work because there are all of these other requirements."

That preference reflects a broader maturation in how enterprises think about AI infrastructure — less as a product to be purchased and more as a capability to be engineered. Whether most organizations have the internal talent to build and maintain those frameworks is a separate question, and one the current conversation doesn't fully resolve.

What does seem clear is that the first generation of enterprise AI agents was built for speed. The second generation is being built for survival.

Key takeaways

Enterprise AI agent failures are predominantly infrastructure problems, not model problems — crashes, lost state, and uncontrolled token spend are the dominant failure modes in production.
State and memory are distinct concerns: state tracks where a workflow is in execution; memory tracks what context an agent carries forward. Conflating them leads to bad architectural decisions.
Without durable orchestration, a late-stage workflow failure forces a full restart — including every prior model call — multiplying inference costs with no business value delivered.
Enterprises are increasingly rejecting off-the-shelf agent platforms in favor of internal 'paved paths' that embed governance, cost controls, identity, and observability from the start.
The current moment echoes early cloud adoption, when organizations lifted and shifted workloads without redesigning architectures — and ended up spending more for less.

FAQ

What is workflow orchestration, and why does it matter for AI agents?

Workflow orchestration is software that manages the sequencing, execution, and recovery of multi-step processes across systems and services. For AI agents, it provides the reliability layer that language models themselves don't offer — ensuring that if a model call fails or a downstream system crashes, the workflow can resume from the point of failure rather than restarting from scratch.

What's the difference between agent state and agent memory?

State refers to where an agent is in a workflow — which steps have completed and where execution should resume after a failure. Memory or context refers to information the agent carries forward across interactions or tasks. The distinction matters architecturally: state management is an orchestration problem; memory management is a model and retrieval problem. Conflating them leads to systems that handle neither well.

Why are enterprises rebuilding first-generation AI agents rather than patching them?

First-generation deployments were often built quickly without durable execution, state management, or observability. Those aren't features that can be easily bolted on after the fact — they require architectural decisions made at the foundation. Patching a system that wasn't designed for failure recovery typically produces fragile results, which is why many teams are starting over.

What is the 'token tax' in the context of AI agent failures?

LLM providers charge for inference based on tokens — units of text processed by the model. When a long-running agent workflow fails and must restart from the beginning, every prior model call is re-executed and re-billed. The 'token tax' refers to this cost multiplication from unrecovered failures. Durable orchestration that resumes from the point of failure eliminates redundant token spend.

Does better orchestration fix the problem of AI agents producing incorrect outputs?

Not directly. Orchestration addresses execution reliability — crashes, state loss, failure recovery, and cost visibility. It does not address output correctness: a model that returns a plausible but wrong answer may not trigger any error handling at all. Enterprises need both reliable execution and separate mechanisms for validating model outputs, and these are distinct engineering problems.

Citations

AI agents are entering their rebuild era as enterprises confront the reliability problemEnterprises are rebuilding first-generation AI agent deployments due to reliability failures rooted in workflow orchestration, state management, and failure recovery gaps, not model performance.
AI agents are entering their rebuild era as enterprises confront the reliability problemPreeti Somal, Senior VP Engineering at Temporal Technologies, described the 'deterministic spine' concept and the state-vs-memory distinction at the AI Impact Series event in New York.
AI agents are entering their rebuild era as enterprises confront the reliability problemAbridge, a healthcare company, uses Temporal to orchestrate multi-stage physician visit workflows involving audio processing, summarization, LLM calls, and after-visit summary generation.