Agentic AI on the Plant Floor: What the Demos Skip
Six claims that ride in on the agentic AI procurement deck, and why each one breaks against a deterministic control room.
The agentic AI deck has reached the plant. It usually arrives between the historian upgrade quote and the next turnaround, and it promises something the older analytics pitches never did: not a dashboard, not an alert, but an actor. An agent that reads your data, decides what to do, and does it. The demo is genuinely impressive. A language model takes a plain-English request, calls a few tools, queries a database, drafts a work order, and reports back in a paragraph that reads like it came from a competent engineer.
Strip out the marketing and an "agent" is a large language model wired to tools and run in a loop: it plans, calls a function, reads the result, plans again, and stops when it thinks it's done. That loop is the whole idea, and it's also where most of the trouble lives. None of this is fake — the technology is real and some of it earns its place. But the claims that ride in on the procurement slides need dismantling one at a time, because the gap between a polished demo and a running unit is exactly the gap that hurts you. Below are the claims we hear most, and why each is wrong in the way that matters on a plant floor.
"Agentic AI can run the plant autonomously"
This is the headline claim, and it collapses two layers that the process industry has spent decades keeping apart. Control is deterministic. A PID loop, an interlock, a sequence in a batch phase — these execute the same way every scan, and that repeatability is the property you are paying for. A language-model agent does not have it. τ-bench, a 2024 study that tested function-calling agents on realistic tool-use tasks against a simulated user and a policy, found that even a strong model like GPT-4o succeeded on fewer than half the tasks, and that it was "quite inconsistent" — scoring below 25% on a pass^8 metric that simply runs the same task eight times and asks whether it works every time [1]. Sit with that number. A controller that delivered the right action seven times out of eight on an unchanged setpoint would be torn out before the next shift.
There's a deeper reason this fails, and NIST named it. In its Generative AI Profile, published in July 2024, NIST lists confabulation — confidently stated content that is simply wrong — as one of twelve risk categories specific to or amplified by generative models [2]. Confabulation is not an occasional glitch you can patch out; it's a property of how these models generate text. And a confident, wrong number is the worst possible input to a control action, because nothing downstream flags it as suspect. A bad sensor reading violates a range check. A confabulated recommendation arrives in fluent prose with a plausible justification attached.
The safety layer settles the argument on its own. IEC 61511 requires that a safety instrumented system be a separate, independent combination of sensors, logic solver, and final elements, designed and managed to achieve a specified safety integrity level [3]. Independence is the entire point of a protection layer. You don't get to put a non-deterministic component in the path of a protective function, and an agent that can be argued out of its own instructions by the contents of its context window is not independent of anything. Regulators have drawn the same line from the other direction. The EU AI Act, in force since August 2024, treats AI used as a safety component in the operation of critical infrastructure as high-risk, and requires effective human oversight — explicitly including the ability to decide not to use the system, or to override and reverse its output [4]. Its high-risk obligations are phasing in through 2026, but the design intent is already clear: "autonomous" and "overridable at all times" describe two different plants.
There's a workable pattern hiding inside the failed claim, and it's the one mature operations already use for any new automation: agent proposes, human disposes. Let the agent assemble the case — the trend, the candidate cause, the suggested move — and let a person, or a qualified deterministic controller, commit it. That keeps the auditable decision with something accountable, and keeps the execution layer, the world of IEC 61131 logic and validated sequences, free of a component that behaves differently depending on what it just read. The agent adds reach and speed at the analysis stage. It does not get the keys to the final element.
So where can autonomy honestly live? In the monitoring-and-optimization layer, not the control layer. The NAMUR Open Architecture makes the split concrete: it adds a second data channel out of the field, dedicated to monitoring, optimization, and predictive maintenance, deliberately separated from the core process control system so that Industry 4.0 workloads cannot disrupt control [5]. That is the right home for an agent. It can watch, correlate, advise, and even prepare an action there all day. The closed loop stays deterministic, and a human or a conventional controller remains the thing that actually moves the plant.
"Just point it at the historian and it'll optimize the process"
This claim survives because the demo looks like it's doing exactly that. Ask the agent why throughput dropped, and it pulls a few tags, notices a correlation, and writes a tidy paragraph. The trouble starts with what "optimize" actually means. Process optimization is a numerical problem: constraints, objectives, gradients, and a model of how the unit responds to a move. Language models don't do that arithmetic reliably; they generate plausible text, and when the text contains a number, that number is as likely to be confabulated as computed [2]. The work that genuinely moves a process — advanced process control, model-predictive setpoint computation, the soft sensors that infer a quality you can't measure online — runs on math the agent can describe in English but cannot replace.
The "just point it at the historian" half is the bigger fiction. Raw historian tags are not a usable picture of a plant. A tag named FIC_2207.PV means nothing on its own without the asset hierarchy it belongs to, the units, the engineering range, the operating mode it was recorded in, and the relationships that ISA-95 exists to formalize between the control layer and the systems above it [6]. Hand an agent uncontextualized tags and it will confidently misread them — pairing a flow in one unit with a temperature in another, then narrating a story that is internally coherent and physically false. There's nothing in the model that knows those two tags belong to different equipment unless the data model told it so.
What makes an agent useful here is the work nobody puts in a demo: contextualizing the data, defining what each tag means, encoding the relationships between assets, and constraining what the agent is allowed to conclude. That contextualization is the foundation of any serious deployment, and it's the layer our own edge telemetry and analytics platform is built to handle before a model ever sees a value. Get it wrong and the agent's fluency works against you, because a wrong answer in good prose is harder to catch than a wrong answer in a broken chart.
It's worth being concrete about where the real optimization lives, because the agent often gets credit for it. On a distillation column or a kiln, the thing computing the move is a model-predictive controller solving an objective against constraints every few seconds, or a soft sensor inferring a lab value between samples. Those are deterministic, validated, and already in service on many units. An agent can sit beside them usefully — translating an engineer's question into the right query, explaining why the controller backed off a constraint, drafting the shift report — but that's orchestration and interpretation, not optimization. Confusing the two leads plants to expect a language model to deliver gains that only a control project can produce, then to blame the wrong tool when the gains don't appear.
The honest version of this claim is narrower and still worth having. An agent pointed at well-modeled, contextualized data can surface a candidate explanation faster than an engineer paging through trends, draft the first version of a report, or translate a plain-English question into the right query against the right tool. It accelerates the search and removes drudgery. It does not close the loop, and it does not optimize anything by itself — it hands a hypothesis to someone who can.
"It'll replace your process engineers"
The replacement story assumes the hard part of the job is producing an answer. On a plant, the hard part is being accountable for one. When a setpoint move costs a batch or trips a unit, a named person signs for that decision, and an agent cannot hold that signature. The EU AI Act codifies the principle directly: human oversight must be assigned to people with the competence, training, and authority to understand the system's limits, interpret its output, and override it — and, tellingly, it requires those people to stay aware of the tendency to over-rely on a confident automated answer [7]. Automation bias is a documented failure mode, not a personality flaw, and it gets worse, not better, as the output gets more fluent.
Governance pushes the same direction. ISO/IEC 42001, published in December 2023 as the first management-system standard for AI, frames responsible AI use as an organizational obligation: impact assessment, lifecycle management, oversight of third-party suppliers, and clearly assigned accountability [8]. None of that workload disappears when the model gets smarter; it grows. Someone has to own the agent the way someone owns a control narrative or a SIL verification — its scope, its limits, the data it's allowed to touch, and the review cadence that keeps it honest.
What actually changes is where the engineer spends the day. Less time goes to pulling trends and writing the first draft of an analysis; more goes to verifying machine-generated work, because a fluent wrong answer is more expensive to catch than an obviously broken one. That's a real shift and it can be a productive one — fewer hours on the boring middle of a task, more on judgment at the ends. But it raises the bar on the people in the room rather than removing them. The plants that get hurt are the ones that cut the expertise on the theory that the agent now supplies it, then discover, a year in, that nobody left on the team can tell when the agent is confidently wrong. Deskilling is the quiet cost of the replacement pitch, and it shows up exactly when you can least afford it.
"It learns your plant over time"
This one sounds like the natural endpoint of the others, and it's mostly a misunderstanding of how these systems work. A deployed base model is static. Its weights are frozen at training time; it does not absorb your last six months of operation just by running against your data. When a vendor says the agent "learns your plant," they almost always mean one of two things, and neither is learning in the sense an operator would assume.
The first is retrieval: the agent is given access to your documents and recent data at query time and pulls relevant pieces into its context. That's useful, but it's lookup, not learning — and it inherits every weakness of the underlying model, including confabulation when the retrieved context is thin or contradictory [2]. The second is fine-tuning or building a custom model on plant data, which is a real engineering project: curated datasets, held-out evaluation, validation against ground truth, and a release process. ISO/IEC 42001 treats exactly this as lifecycle management that has to be planned and governed, not a side effect of normal use [8]. Done without that discipline, "continuous learning" is an unmonitored model drifting on uncurated inputs — which is a liability, not a feature. If a system genuinely improves with your data, ask to see the evaluation set and the validation results. If those don't exist, what you have is a static model with a good memory of recent documents.
"It plugs into your plant out of the box"
The model is the easy ten percent. Everything around it is the project. A plant agent has to reach data that lives behind the OT/IT boundary, and that boundary exists for reasons a SaaS connector doesn't get to wish away. NIST's Guide to Operational Technology Security, SP 800-82 Revision 3, published in September 2023, is built around this very point: OT carries performance, reliability, and safety requirements that general IT security guidance doesn't address, and it expects network segmentation with controlled conduits between zones rather than flat, open access [9]. An agent that wants live process data has to cross that line through a defined, controlled path, and crossing it casually is precisely the mistake the guide is written to prevent.
NAMUR's architecture is, once again, the clean pattern. Pull the data the agent needs through the second monitoring-and-optimization channel, so the analytics workload never touches the core control system [5]. On legacy instruments that can mean tapping the existing 4–20 mA and HART signals through isolators without interrupting the live connection to the control system. That's an architecture decision, made per site, with hardware and a documented data path — not a checkbox in an onboarding wizard.
Layer the contextualization problem on top of all that. The historian's tag conventions, the units, the asset model, the access controls, the question of which data even leaves the site and which has to stay on the edge for latency or confidentiality — each of these is real work before the agent does anything useful. "Plugs in out of the box" describes the part you can finish before lunch, not the part that decides whether the result is trustworthy. The realistic budget puts most of the cost in integration, data engineering, and a segmentation review, with the model license as one of the cheaper lines. Treat it the other way around and the project stalls in the place every plant IT/OT project stalls: at the boundary.
There's also a maintenance question the demo never raises. An agent's behavior depends on a model, a set of tool definitions, a system prompt, and the data it's allowed to read — change any of those and the behavior can shift in ways that aren't obvious until something acts oddly on shift. That's a configuration-management problem, and a plant already knows how to handle those: version what the agent can do, review changes, and keep a record of what it was permitted to touch and when. ISO/IEC 42001's lifecycle expectations point in the same direction [8]. "Out of the box" implies it stays the box you bought; in practice it's a system you now own and have to keep current.
"Adding an agent doesn't change your security posture"
It changes your posture more than almost anything else you could install, and this is the myth to end on because it's the one that gets waved through. An agent is, by construction, a component that takes instructions in natural language and acts on them with tools. That is a new and awkward attack surface, and it doesn't behave like the ones your OT security program already covers. The OWASP Top 10 for LLM Applications, in its 2025 edition, ranks prompt injection as the number-one risk, and the reason is structural rather than incidental: a language model processes instructions and data in the same channel, with no clean separation, so content it merely reads can be interpreted as a command it must obey [10].
On a plant, the data an agent reads is not all trusted, and that's the part operators underestimate. A maintenance PDF, a vendor work order, an operator's free-text note, an email the agent is asked to triage — any of these can carry an indirect injection: instructions hidden inside content that seize the session when the model parses them, with no human in the loop having typed anything malicious [10]. So ask the obvious question: what happens when the document your "autonomous" agent reads tells it to do something its operator never would? The same OWASP list names the agentic failure directly — excessive agency — where an LLM holding too much functionality, too many permissions, or too much autonomy is driven into damaging actions by hallucination or injection [11].
The recommended mitigations are ones a control engineer will recognize on sight, because they're least-privilege thinking applied to a new component: minimize the tools and permissions to only what the task needs, prefer granular, specific functions over open-ended shell access, and require explicit human approval before any high-impact action [11]. That sits on top of the OT obligations you already carry, not beside them. IEC 62443 and NIST SP 800-82 expect zoning, conduits, and least-privilege access for anything that can reach control systems [9], and an agent with credentials, tool access, and a known tendency to be talked into things is exactly the kind of component those controls were written for. NIST's Generative AI Profile lists information security among its twelve risk categories for the same reason [2].
Two more items on the OWASP list deserve a plant engineer's attention because they don't show up until you're in production. Supply-chain risk covers the model itself, its training data, and the third-party components an agent depends on — provenance you rarely control and often can't inspect [10]. Unbounded consumption covers an agent that, prompted into a loop, runs up tool calls, queries, and cost with no ceiling — a denial-of-wallet that becomes a denial-of-service if those calls touch shared plant systems. Neither is exotic. Both are the predictable consequence of putting a non-deterministic planner in front of real tools, and both are why "monitor activity, log everything, rate-limit sensitive operations" reads less like AI advice and more like the OT hygiene you already practice.
So the question to put to the vendor is not whether the agent is secure. It's narrower and harder: which tools can it call, with what permissions, on which network segment, reading whose data, and who signs off before it acts? If those answers aren't already on the slide, the agent isn't ready for your plant. That — not the demo — is the measure that decides whether agentic AI belongs on your floor or in the conference room.
References
- τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (arXiv:2406.12045, 2024)
- NIST AI 600-1 — Generative Artificial Intelligence Profile (July 2024)
- ANSI/ISA-61511-1-2018 / IEC 61511-1 — Functional Safety: Safety Instrumented Systems for the Process Industry
- Regulation (EU) 2024/1689 — Artificial Intelligence Act
- NAMUR Open Architecture (NOA)
- ISA-95 — Enterprise-Control System Integration
- EU AI Act — Article 14, Human Oversight
- ISO/IEC 42001:2023 — AI Management System
- NIST SP 800-82 Rev. 3 — Guide to Operational Technology (OT) Security (Sept 2023)
- OWASP LLM01:2025 Prompt Injection — Top 10 for LLM Applications
- OWASP LLM06:2025 Excessive Agency — Top 10 for LLM Applications
Reuse & license
This article is published by Zoniax Innovations LLC under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt it for any purpose, including commercially, as long as you give appropriate credit to Zoniax and link back to the original article.
Disclaimer
These Field Notes are general technical information, published as-is for industry peers. They are not professional, engineering, safety, legal, or financial advice, and nothing here is a recommendation to buy, sell, or act. Figures are cited from public sources believed reliable but are not independently guaranteed — verify them against the primary sources and your own plant conditions before acting. Zoniax Innovations LLC and the author accept no liability for decisions made from this content. Naming a standard, product, or vendor is not an endorsement.
Cite this article
Nõmm, A. (2025). Agentic AI on the Plant Floor: What the Demos Skip. Zoniax. https://zoniax.com/blog/posts/agentic-ai-plant-operations
Permalink: https://zoniax.com/blog/posts/agentic-ai-plant-operations