The agents are failing. The models are fine.

Here is the uncomfortable finding from the first wave of enterprise AI agent deployments: the language models mostly worked. The infrastructure around them mostly didn't.

That's the pattern Preeti Somal, Senior VP of Engineering at Temporal Technologies, described at a recent AI Impact Series event in New York. "We do have a lot of customers that come to us where they're building version 2.0 of the same agent," she said. "They had to move really fast, but they didn't take care of the plumbing. Things crash and burn, and then they're back to rebuilding with the reliable foundation."

Temporal, whose workflow orchestration infrastructure predates the current agentic AI wave, has become a reference point for enterprises trying to understand why their agents keep breaking. The answer, Somal argues, is that production AI systems require capabilities that model benchmarks don't measure: durable execution, state management, failure recovery, and cost visibility across multi-step workflows.

What 'long-running' actually means — and why it changes everything

Most early AI deployments were stateless: a user sends a prompt, a model returns a response, the interaction ends. Agentic systems are different. A single enterprise workflow might call several large language models (LLMs — the large neural networks that power tools like ChatGPT and Claude), access retrieval databases, trigger external APIs, and persist over hours or days.

That duration creates failure surfaces that don't exist in simple chatbot interactions. "People will write agents but haven't thought about what happens if the agent crashes," Somal said. "Am I going to need to run the entire agent flow again?"

For enterprises operating under cost constraints, that question has a dollar sign attached. Every model call consumes tokens — the units of text that LLM providers charge for. Restarting a failed workflow from the beginning means paying for every prior step again, with nothing to show for it.

State vs. memory: a distinction that matters more than it sounds

One source of architectural confusion in agentic AI is the conflation of two related but distinct concepts: state and memory.

Somal draws a clear line between them. State concerns workflow execution — where an agent is in a process, which actions have already completed, and where recovery should resume after a failure. Memory, or context, captures information an agent carries forward across interactions or tasks.

"The state of the agent is around what step and what actions have been performed, and if something crashes, where do you want to recover from, versus the context and memory piece," she explained.

The distinction becomes load-bearing when workflows grow complex. Somal pointed to Abridge, a healthcare company and Temporal customer, whose workflows process physician visits through multiple sequential stages: audio processing, summarization, multiple LLM calls, and after-visit summary generation. Each stage has its own state. Losing track of which stages completed — and which didn't — means either rerunning the entire pipeline or producing incomplete outputs.

The deterministic spine

Somal uses the phrase "deterministic spine" to describe how Temporal positions itself relative to the probabilistic outputs of language models. The concept is worth unpacking.

LLMs are non-deterministic: given the same input, they may produce different outputs. That's a feature for creative tasks and a liability for enterprise processes that require consistency. A procurement workflow, a compliance check, or a patient summary cannot simply fail silently because a model call timed out.

The deterministic spine is the orchestration layer that wraps around model calls and enforces reliability regardless of what the model does. "It is denoting the path you want to take," Somal said. "It is calling the brain, but if the brain doesn't respond, it will call it again. If the brain responds but the next step is going to fail, it will pick up from where that failure happened."

In this architecture, the model handles reasoning; the orchestration layer handles execution guarantees. Neither substitutes for the other.

The token tax and the economics of failure

Cost visibility has emerged as a distinct concern as enterprises try to calculate ROI on agentic deployments. Long-running agents make multiple model calls across complex workflows, and without observability into each step, spending patterns become opaque.

Somal described one operational advantage of durable orchestration as granular visibility into where tokens are consumed. "You've got visibility into that entire flow in a single pane of glass," she said. "You can now see where you're spending the tokens in an agent that is multiple steps and calling multiple different systems."

Workflow recovery also shapes cost efficiency directly. Without it, a failure at step seven of a ten-step process means rerunning steps one through six — all billable. "You pick up from where the crash happened," Somal said. "We save you the cost of running the agent from step one again."

The magnitude of that saving depends on workflow complexity and failure frequency, and Somal did not offer specific figures. But the directional logic is straightforward: the longer the workflow and the more expensive the model calls, the higher the cost of unrecovered failures.

The cloud migration parallel — and its limits

Somal's comparison to early cloud adoption is instructive, though it's worth examining carefully. The "lift and shift" pattern she describes — migrating workloads to cloud infrastructure without redesigning underlying architectures — did produce real cost overruns for many enterprises. The analogy maps reasonably well: in both cases, organizations adopted new infrastructure faster than they adapted their engineering practices to it.

The difference is that cloud infrastructure failures are generally well-understood and recoverable. LLM-based agent failures can be subtler — a model that returns a plausible but incorrect output, for instance, may not trigger any error handling at all. That's a failure mode the deterministic spine doesn't fully address, and it's worth noting that Somal's framing focuses on execution reliability rather than output correctness.

'Paved paths' over off-the-shelf platforms

As enterprises revisit first-generation deployments, Somal said a clear preference is emerging: rather than adopting fully managed agent platforms wholesale, organizations want internal frameworks that embed governance, model selection policies, identity systems, cost management, and observability from the start.

"The enterprises are looking at building these paved paths," she said. "Taking something off the shelf is maybe not going to work because there are all of these other requirements."

That preference reflects a broader maturation in how enterprises think about AI infrastructure — less as a product to be purchased and more as a capability to be engineered. Whether most organizations have the internal talent to build and maintain those frameworks is a separate question, and one the current conversation doesn't fully resolve.

What does seem clear is that the first generation of enterprise AI agents was built for speed. The second generation is being built for survival.