Running LLMs at the Edge Inside the Plant

The box sits in a 19-inch rack in the MCC room, two slots up from a managed switch. Fanless. About the size of a thick hardback. It draws less than a desk lamp. Inside it is a language model that never talks to the internet. The operators call it "the assistant." It reads batch logs, drafts shift handover notes, and answers questions about the asset history in plain language. It does all of that without a single packet leaving the plant.

Two years ago that sentence would have been marketing. It isn't anymore. The reason is boring and physical: the silicon got small enough, the models got light enough, and the people who run plants got tired of sending their process data to someone else's data centre. So this is a field note about what running a large language model inside the fence actually looks like — the enclosure, the bottleneck, the heat, and the thing that surprised us most.

Why the model moved into the plant

Start with the pull factor. Centralised AI is getting expensive to power. The International Energy Agency puts global data-centre electricity use at around 415 terawatt-hours in 2024 — roughly 1.5% of the world's consumption — and projects it to more than double to about 945 TWh by 2030, with AI named as the main driver. The agency notes that a single AI-focused data centre can draw as much electricity as 100,000 households, and the largest ones now under construction will draw twenty times that. [IEA, 2025] When the cloud bill tracks that curve, doing the inference yourself stops looking exotic.

Then the push factors. Process data is operational property. Recipes, yields, downtime causes, and quality deviations are the things a plant least wants on a third-party server. Latency matters too: a model that answers in 300 ms from the rack beats one that answers in two seconds over a congested WAN, and it keeps answering when the link drops. And the regulatory ground has shifted under everyone's feet. The EU AI Act entered into force on 1 August 2024; obligations for general-purpose AI models began applying on 2 August 2025, and the bulk of the high-risk rules and transparency duties apply from 2 August 2026. [European Commission] Alongside it, ISO/IEC 42001:2023, published in December 2023, gives the first auditable management-system standard for AI — the kind of framework an internal AI deployment now has to answer to. So keeping the model on-prem isn't just a performance choice. It makes governing the model, and proving you govern it, a great deal simpler.

What does the plant actually get out of it? Not a chatbot for its own sake. The honest use cases are narrow and dull and valuable: turning a wall of alarms into a ranked summary, drafting the handover so the night shift inherits context instead of a log dump, and letting a maintenance tech ask "what changed on line 3 last shift" and get an answer drawn from records that already exist. None of that needs a frontier model. All of it needs the records to be local, clean, and readable.

What the box actually is

Forget the rack of accelerators you picture when someone says "AI." Edge inference hardware for a plant is a small, sealed compute module. The common reference point is NVIDIA's Jetson Orin family. The AGX Orin module delivers up to 275 TOPS, the Orin NX up to 157 TOPS, and the Orin Nano up to 67 TOPS, with the Nano configurable between roughly 7 W and 25 W. Industrial integrators wrap these in fanless, wide-temperature enclosures and feed them 24 VDC off the panel. The whole thing looks like another I/O module, not a server.

The number that matters for language models is not TOPS, though. It's memory — both how much and how fast. A model has to fit in the module's RAM, and during generation it has to be read out of that RAM token by token. That second part is the whole game, and most people get it wrong on the first specification pass.

The decode wall

Generating text from an LLM has two phases. The prefill phase reads your prompt and is compute-bound — it likes TOPS. The decode phase produces the answer one token at a time, and each token requires sweeping the model's weights and the growing key-value cache out of memory. Decode is memory-bandwidth-bound, not compute-bound. A recent edge-inference study makes the point directly, proposing a "memory bandwidth utilisation" metric precisely because raw FLOPS mispredicts how these models behave on small hardware. [arXiv 2508.11269, 2025] Buy a module for its TOPS rating and you can still end up bandwidth-starved on the only workload you care about.

This is why quantisation is mandatory, not optional. Storing weights at 4 bits instead of 16 cuts both the footprint and the bytes that have to move per token. A 4-bit quantised model in the 1-to-3-billion-parameter class is the realistic target for a sealed edge module. That sounds small next to the headline cloud models, and it is. But it's enough to summarise a log, classify an alarm, or answer a question against documents you feed it — which is most of what a plant actually wants. Pair the small model with retrieval over the plant's own records, and the answers get specific without the model needing to be large.

That pairing is the part worth dwelling on, because it's where most edge deployments either earn their keep or quietly fail. A small model on its own knows nothing about your plant. It knows language. The value comes from putting your records in front of it at query time — the P&IDs, the maintenance logs, the SOPs, the last six months of batch data — and asking it to answer only from what it was handed. Done right, the model becomes a fast index over documents the plant already trusts, and the size of the model matters far less than the quality of what you feed it. So the engineering effort moves off the model and onto the data pipeline: getting records out of a dozen historians and PDFs into something the model can read, keeping that store current as the plant changes, and tagging it so retrieval returns the right page and not a plausible-sounding wrong one. Get the retrieval layer wrong and a bigger model won't save you; get it right and a 2-billion-parameter model in a sealed box answers better than a frontier model that's never seen your tags.

Here's what surprised us. A 2026 study that ran the same 4-bit model under sustained load across very different edge silicon found that an RTX 4050 laptop GPU produced about 131.7 tokens/second at 34.1 W, while a dedicated Hailo-10H neural processing unit produced about 6.9 tokens/second at under 2 W. [arXiv 2603.23640, 2026] Nineteen times slower — but at a sixteenth of the power, with near-zero variance. And on a plant floor, predictable and cool often beats fast and hot. Seven tokens a second is faster than anyone reads a handover note, and the slow part will hold that rate in a sealed box at 50 °C ambient where a hungry GPU would throttle. The fastest part on the bench is rarely the right part in the cabinet.

Heat, dust, and the thing that fails first

The same study is a useful warning about thermal behaviour. Under sustained inference, an iPhone 16 Pro lost nearly half its throughput within two iterations, and a Galaxy S24 Ultra hit an OS-enforced frequency floor that ended inference entirely. [arXiv 2603.23640, 2026] The lesson transfers cleanly: peak compute is not the constraint, sustained thermal headroom is. Consumer devices benchmark beautifully and collapse under a continuous load. A plant model runs continuously.

So the enclosure design is the deployment. We size for the worst-case ambient inside the cabinet, not the room, and that delta can be 15 to 20 °C once you account for VFDs and power supplies sharing the steel. We prefer modules rated for the full industrial temperature band and a passive heat path to the enclosure wall — no fan to clog with the fine particulate that gets everywhere in a mill or a dryer hall. What fails first on a poorly specified edge box? Never the model. It's the fan, the thermal paste, or a connector that worked loose on a vibrating skid. Choose the hardware the way you'd choose any other panel component: by what survives the environment, not by what wins the benchmark.

Confabulation on the plant floor

A language model will state a wrong answer with the same confidence as a right one. NIST has a precise word for it. In its Generative AI Profile, published in July 2024, NIST defines confabulation as "the production of confidently stated but erroneous or false content (known colloquially as 'hallucinations' or 'fabrications') by which users may be misled or deceived," and notes these have "been shown to be pervasive in current state-of-the-art LLMs." [NIST AI 600-1, 2024] That is not a reason to avoid the technology. It is a reason to scope it correctly.

The rule we hold to is simple: the model never closes a loop. It does not write a setpoint, trip an interlock, or acknowledge an alarm. It reads, summarises, retrieves, and drafts — and a human signs off on anything that acts on the process. The model is an interface to information the plant already has, not a controller. Control logic stays in the PLC and the DCS, governed by the standards it has always been governed by. The LLM sits beside that, on the information side of the wall, and every output it produces is treated as a draft a person reviews. NIST's profile also flags information security and information integrity as core generative-AI risk areas, which is exactly the framing a plant should keep: the model is a source of drafts to verify, not facts to trust. Scope it as decision support and the confabulation risk is bounded by the human reading the screen. Wire it into control and that same risk becomes a hazard.

Security: the model is an OT asset

Putting the model on-prem is a security decision before it is a performance one, and it has to be treated with the same discipline as any other OT asset. Under IEC 62443, the industrial-security standard organises systems into zones with shared security requirements and connects them through conduits — the controlled communication paths between zones — and rates the required strength of protection on security levels from SL 1, casual exposure, up to SL 4, nation-state-grade threats. An edge inference box belongs in that model explicitly. It is a zone. The data feeds into it and the queries out of it are conduits, and they get the same firewalling, monitoring, and least-privilege treatment as any other crossing.

The air gap is the payoff. A model that runs locally needs no outbound connection to a model provider, which removes an entire class of exposure: no prompt data crossing the boundary, no dependency on a vendor's uptime or terms, no exfiltration path through an API. But that doesn't make the box safe by default. The weights are an asset to protect, the query interface is an attack surface to harden, and the module needs patching like any industrial computer. Still, "the process data physically cannot leave" is a property you can put in front of an auditor, and it is a far stronger one than a contractual promise from an off-site provider. For plants in regulated sectors, that is frequently the deciding factor.

What it costs to run

The economics are not the headline, but they're cleaner than expected. The capital cost is a sealed compute module and its enclosure — the price of a decent industrial PC, not a server room. There is no per-token meter and no egress bill, so the marginal cost of a query is essentially the electricity to compute it. At the power figures above, that is watts, not dollars. The operating cost is the part people forget: someone has to own the model — keep it patched, watch what it's asked and what it answers, and re-validate it when the underlying weights are updated. That governance work shows up as staff hours, not licence fees, and it is the line item that gets left off the first budget. The hardware is cheap. The discipline is not.

Against the off-site alternative, the trade is straightforward. You give up the largest frontier models and you take on the maintenance. In return you get data that stays put, latency measured in milliseconds, a flat and predictable cost, and operation that survives a dropped WAN link. For a plant whose data is its competitive position, that trade usually favours the box in the rack — but only if you're honest about the maintenance you're signing up for.

Where this applies

Don't begin with the model. Begin with one question worth answering and the data that answers it — shift handovers, alarm rationalisation, maintenance history, batch records. Pick a workload where a confident-but-wrong answer is an inconvenience, not a hazard, because the human is reading the output anyway. Size the hardware to the sustained workload and the cabinet's worst-case temperature, not the spec sheet. Then put the box in a 62443 zone and treat it like the OT asset it is.

That is the work we do at Zoniax: instrumenting plants and standing up the edge telemetry and analytics platform that gives a local model something true to read, and the industrial AI deployment discipline to keep it governed once it's running. The model in the rack is the easy part. Feeding it real plant data, keeping it out of the control loop, and proving to an auditor that none of it ever left the fence — that's the engineering. Done that way, the small box two slots up from the switch earns its rack space.

Notes and limits

A few caveats worth stating plainly. The throughput and power numbers above come from controlled studies on specific silicon and a single small model; treat them as the shape of the trade-off, not as a spec for your hardware. Your tokens-per-second will depend on the model, the quantisation, the prompt length, and the thermal envelope of your enclosure — benchmark on the part you intend to ship, in the cabinet it will live in. The regulatory dates are the EU AI Act's phased schedule as it stands; how the high-risk and transparency rules land on a specific plant deployment depends on the use case and is a question for your compliance function, not a blog. And "air-gapped" is a claim you have to keep earning: a local model only stays local if nobody quietly adds an outbound update path later. None of this is a reason to wait. It's the checklist for doing it properly.

Running LLMs at the Edge Inside the Plant

Why the model moved into the plant

What the box actually is

The decode wall

Heat, dust, and the thing that fails first

Confabulation on the plant floor

Security: the model is an OT asset

What it costs to run

Where this applies

Notes and limits

References

Reuse & license

Disclaimer

Cite this article

Why the model moved into the plant

What the box actually is

The decode wall

Heat, dust, and the thing that fails first

Confabulation on the plant floor

Security: the model is an OT asset

What it costs to run

Where this applies

Notes and limits

References

Reuse & license

Disclaimer

Cite this article

Related articles