Insights

The runtime problem: why your AI agents fail in production, and why it isn’t the model

2026 is the year agentic AI crossed from pilot to production. It is also the year a great many of those agents quietly died on day two.

By Antony Coppellotti, founder and fractional CTO, Gordion Solutions

The pattern is now well documented. Gartner forecasts that more than 40% of agentic AI projects will be scrapped by the end of 2027, citing escalating costs, unclear business value and inadequate risk controls. MIT’s NANDA study found that 95% of enterprise generative AI pilots deliver no measurable return, and was explicit that the cause is flawed integration rather than model quality. Different studies, different scopes, same finding.

The constraint is rarely model capability. It is operational fit: the unglamorous business of getting a non-deterministic system to behave reliably inside fragmented workflows, legacy estates and approval layers that were never designed to accommodate it.

The clever demo survives the boardroom. It does not survive contact with production.

There is a tidy way to state the lesson the industry has spent eighteen months learning the hard way. The failure point is not the reasoning. It is the runtime.

Spine versus brain

The reflex, when an agent disappointed, was to reach for a bigger brain. The output was wrong, so the next model would fix it. More parameters, longer context, better reasoning traces. The model was treated as the system, and everything around it as plumbing.

The data has turned that assumption over. When enterprise teams are asked where their agents actually break, the answers cluster not around hallucination but around the infrastructure’s inability to hold state, survive failure and coordinate execution across steps.

The industry has started calling this the spine versus the brain. The brain is the model’s capacity to reason. The spine is everything that keeps a multi-step process upright while it runs: memory, durability, recovery, orchestration.

What “state” actually means for an agent

It helps to be precise about what a multi-step agent needs to remember, because it is a good deal more than a transcript.

A single-shot model needs the conversation so far. An agent acting over many steps needs something closer to a working model of the world it is operating on: the entities it is reasoning about, the facts it currently believes, what it has already done, what it attempted and failed, and crucially, why it decided each of those things.

That state is not a convenience. It is the substrate the agent reasons against. Corrupt it, lose it, or let it drift out of sync with reality, and the reasoning on top is worthless no matter how capable the model.

Now look at how most agents are actually built. Stateless orchestration, scripts wiring one model call to the next, context reconstructed on the fly and held in memory for the duration of a run. It works beautifully in a notebook. Then it meets production, and the failure modes are grimly predictable. A container restarts and the context is gone. A reasoning error in step three compounds silently until it becomes catastrophic by step ten. Failures occur that are invisible by design, because nothing was durably recording what the agent believed at the moment it went wrong, so there is nothing to inspect afterwards.

You cannot patch your way out of this with retries and more elaborate prompting. Those are treatments for the symptom. The underlying condition is that the system has no durable, trustworthy place to keep its mind.

The missing primitive

Software has solved a version of this before. Applications did not always have databases. For a while they had flat files, and they got along, until concurrency, scale and the need to answer questions about the data made an actual database non-negotiable. The database was not a feature bolted onto applications. It was the primitive that made serious applications possible.

Agentic systems are at the equivalent moment, and most of them are still on flat files.

What an agent needs is a first-class state store, and the requirements are not exotic:

Durable, surviving restarts and crashes without losing the thread.
Queryable, so the agent and the humans supervising it can interrogate the current state rather than reconstruct it from logs.
Consistent when several agents act against it at once.
Temporally honest, answering a question most data stores quietly ignore: not only what is true now, but what was true at the moment a given decision was taken.

That last requirement is where this gets genuinely hard, and it is where the interesting engineering lives.

Why temporality is the hard part

A grounded state store has to track time along more than one axis, and the distinction matters more than it first appears.

There is the time at which something was true in the world, which is not the same as the time the system found out about it, which is not the same as the moment an agent made a decision on the basis of what it then believed. A record that collapses all of these into a single “last updated” timestamp throws away precisely the information you need to understand the system’s behaviour after the fact.

Reconstruct an agent’s decision a week later and you do not want today’s state. You want the exact state it was reasoning against at the instant it acted, including the things it has since learned were wrong.

This is bitemporality, and in graph-shaped domains, where the relationships between entities are themselves the thing that changes over time, it gets harder still. It is exacting to model, and exacting to make fast, which is most of the reason general-purpose tooling tends to skip it.

It is also why we built Aevum, a Postgres-native, tri-temporal graph engine, when the obvious off-the-shelf options would not give us a state substrate with these properties. I mention it not as the hero of this argument but as evidence for it. The primitive is missing often enough, and badly enough, that building it from the foundations up is a defensible decision rather than an indulgent one.

The lesson generalises whether or not you ever touch our engine. If your agents are making consequential decisions, where their state lives, and whether it remembers time properly, deserves to be answered on purpose rather than by default.

What this changes about how you architect

The practical shift is straightforward to state and uncomfortable to act on. Design the process assuming an agent is in the loop and a durable state store sits underneath it, from the first whiteboard sketch rather than as a retrofit once the pilot wobbles. Treat runtime durability as an engineering concern that somebody owns, with the same seriousness you would give to a database choice, because that is effectively what it is.

There is a governance dividend too. A recurring finding in this year’s research is the gap between the AI governance org chart organisations have drawn and the control layer they have actually built underneath it. A queryable, durable state store is part of how you close that gap. You cannot govern what you cannot inspect, and you cannot inspect a system whose memory evaporated when the container recycled.

The payoff in regulated industries

For most of our clients this is where the argument stops being architectural and starts being existential.

In capital markets, banking and insurance, “the agent did something” is not an acceptable answer to a regulator, an auditor or a customer who has been treated unfairly. You need to reconstruct exactly what the system believed and why it acted as it did, at the moment it acted, not an approximation assembled from whatever logs happened to survive.

A properly temporal state store turns that reconstruction from an act of archaeology into a query. That is, quite literally, the difference between an agent you can responsibly place near a regulated workflow and one you cannot. With obligations like Consumer Duty now framing how firms have to evidence good outcomes, reproducibility stops being a nice-to-have.

It is also why the more regulated a sector is, the slower and more careful its agentic adoption has been. That caution is not a lack of sophistication. It is an accurate read of the risk of putting a forgetful system near a process you have to answer for.

The reckoning, and the way through

The shape of the next eighteen months is becoming clear. The teams that get value from agents at scale will be the ones who treated the spine as seriously as the brain: who built durable, queryable, temporally honest state into the foundations, and who resisted the temptation to paper over a structural problem with one more layer of prompting.

The rest will rediscover what the last wave of automation already taught us. A graveyard of clever pilots is what you get when the cleverness was all in the demo and none of it in the runtime.

If you are building toward production agents and you have not yet decided where their state lives, that is the conversation worth having now, before day two arrives and decides it for you.

Where does your agent’s state live today, and would it survive a restart, an audit, or both?

Gordion Solutions works with technology leaders building AI-first systems that have to stand up in production and in front of a regulator. If the runtime problem is one you are wrestling with, I would be glad to talk.

Sources

Gartner, “Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027” (June 2025). gartner.com
MIT Project NANDA, “The GenAI Divide: State of AI in Business 2025” (July 2025), as reported by Fortune. fortune.com