Predictive Maintenance for Processing Plants

From sensor data to avoided downtime: how to choose where inference runs, how readings travel, and whether alarms come from rules or learned models.

A bearing rarely fails all at once. It warns you for weeks. A vibration peak climbs at a defect frequency, the housing runs a little hotter, the motor current drifts off its baseline. Predictive maintenance, usually shortened to PdM, is the discipline of catching those early signals and acting on your own schedule instead of the failure's. Done well, it turns a 3 a.m. line stop into a planned ten-minute swap during a changeover.

That much is settled. What isn't settled, and what this piece is about, is how you build the signal chain that gets you there. Where does inference run, at the machine or in a data center? How do the readings travel, over wire or over radio? And does the alarm come from a fixed threshold or a learned model? Each of those is a real fork, with real cost and reliability consequences, and the right answer depends on the plant, not on the brochure.

So let's set up the comparison properly. First the economics, because they decide whether any of this is worth doing. Then the standard signal chain every PdM system shares. Then the three architecture decisions, judged against the criteria that actually matter on a plant floor: cost, latency, reliability, and the maintenance burden of the monitoring kit itself.

The economics, and why the ladder is worth climbing

Maintenance strategies sit on a ladder. At the bottom is reactive, or run-to-failure: cheapest until something breaks, then ruinous. Above it is preventive, where you service on a fixed calendar or runtime interval whether the machine needs it or not. At the top is predictive, where condition data tells you when intervention is genuinely due.

The U.S. Department of Energy's Operations & Maintenance Best Practices Guide puts numbers on the climb. According to the DOE guide, moving from reactive to preventive maintenance saves on the order of 12% to 18% on average.

Predictive monitoring then adds another increment on top. Based on the same DOE guide, layering condition-based predictive maintenance over a preventive program is worth a further 8% to 12%. The gains compound because each rung removes a different kind of waste: preventive kills the catastrophic surprise, predictive kills the unnecessary calendar service on a machine that was still healthy.

The bigger jump is for plants starting low. Against a facility still running mostly to failure, the DOE guide notes the savings opportunity from predictive maintenance can exceed 30% to 40%. Those are program-level averages, not guarantees for any one asset, but the direction is consistent across the literature, and it tracks with what we see when a site moves its worst-offender machines onto continuous monitoring first.

The same DOE source lists the industrial averages it attributes to a functioning predictive maintenance program:

  • Return on investment around 10 times
  • Reduction in maintenance costs of 25% to 30%
  • Elimination of breakdowns of 70% to 75%
  • Reduction in downtime of 35% to 45%
  • Increase in production of 20% to 25%

Treat those as a ceiling for a mature program on well-instrumented rotating equipment, not a forecast for week one. But they explain why the question is rarely "should we?" and almost always "how, and on which machines?" The honest version of the business case starts by ranking assets on consequence of failure times probability, then instruments the short head of that list first.

There's a second, quieter return that rarely makes the slide deck: spares and labor planning. When you know a gearbox has roughly six weeks of life left rather than failing without warning, you order the part on a normal lead time instead of paying for expedited freight, and you fold the repair into a planned outage instead of pulling a crew onto overtime at night. That smoothing of the maintenance workload is often worth as much as the avoided downtime itself, and it's the part operators feel first.

The signal chain every PdM system shares

Before arguing about architecture, it helps to agree on the stages, because they're standardized. ISO 13374, the international standard for data processing in machine condition monitoring, defines an open six-block reference architecture. The same six blocks underpin the MIMOSA OSA-CBM open specification, which implements the ISO functional model with concrete data structures and interfaces. Every PdM deployment is some arrangement of these blocks, whether the vendor names them or not:

  1. Data acquisition — read the installed sensors and digitize the raw signal.
  2. Data manipulation — filter, transform, and extract features (an FFT spectrum, an RMS velocity band, a temperature trend).
  3. State detection — compare features against expected values or limits and raise condition indicators or alarms.
  4. Health assessment — rate current health and diagnose the likely fault.
  5. Prognostic assessment — estimate remaining useful life before the next significant state change.
  6. Advisory generation — turn the diagnosis and forecast into a recommended action.

Why does this matter for an architecture decision? Because the forks below are really questions about where each block runs and how smart it is. Edge versus cloud is a question of which blocks execute at the machine. Rule-based versus learned is a question of how blocks three through five make their calls. Keeping the ISO 13374 model in mind stops you from comparing a vendor's slick advisory dashboard against another's raw acquisition hardware as if they were the same thing.

The program around the chain is standardized too. ISO 17359 gives the general procedure for setting up condition monitoring: identify the critical machines, choose the measurement parameters that map to their real failure modes, set baselines and alarm criteria, then review. It explicitly lists the usual parameters — vibration, temperature, tribology, flow, contamination, power, and speed. The discipline it enforces is choosing parameters from failure modes backward, not bolting on whatever sensor a supplier happens to stock.

That backward reasoning has practical bite. A slow-speed agitator and a high-speed centrifugal pump are both rotating machines, but their failure modes and useful parameters barely overlap. The pump tells its story in vibration velocity and bearing-frequency acceleration; the agitator may reveal more through motor current and lubricant condition. Pick the sensor first and you end up monitoring what's easy to measure instead of what actually predicts the failure you're trying to avoid. Pick the failure mode first, as ISO 17359 directs, and the sensor list almost writes itself.

Decision one: edge or cloud inference

The first fork is where state detection, health assessment, and prognostics actually compute. Push them to a gateway or smart sensor at the machine, and you're doing edge inference. Stream features or raw data to a central server or cloud, and the heavy thinking happens there.

On latency, the edge wins by construction. A vibration protection function that has to trip a press in milliseconds can't wait on a round trip to a regional data center. If an alarm needs to drive a local interlock, it belongs at the edge. Cloud latency is fine for trends measured over hours and days, which is most of condition monitoring, but useless for anything in a control loop.

On cost, the picture inverts as you scale. Edge nodes cost more per point because compute and storage ride on every device, but they ship far less data upstream. Cloud concentrates the compute, so the marginal cost of one more analytic is low, while continuous high-rate streaming runs up bandwidth and ingest bills. A handful of critical machines often pencils out cheaper at the edge; a thousand low-criticality points usually favor centralizing the analytics and sending only features.

On reliability, the edge degrades gracefully. A gateway that keeps watching a pump through a network outage is worth a lot in a plant where the link to the enterprise data center is the least reliable thing in the building. Cloud buys you managed uptime, redundancy, and someone else's patching, which a small maintenance team rarely matches on its own. The trade is dependence on connectivity you don't fully control.

On the maintenance burden of the monitoring kit itself — the part teams forget — edge means more devices to power, mount, calibrate, and eventually replace in a hot, wet, vibrating environment. Centralized analytics means fewer field assets but a heavier integration and data-pipeline job. Neither is free. The realistic pattern most plants land on is hybrid: fast state detection at the edge for protection and first alarms, with feature streams sent up for the model-heavy health assessment and prognostics that benefit from fleet-wide history.

Decision two: wired or wireless telemetry

The second fork is how readings travel from sensor to analytic. Wired telemetry — 4-20 mA loops, fieldbus, or industrial Ethernet running OPC UA — gives you deterministic delivery and steady power. OPC UA, published by the OPC Foundation, added a publish/subscribe model in its 2018 release (standardized as IEC 62541-14) that suits many-to-many telemetry and, over Time-Sensitive Networking, can hit deterministic timing for the fast loops. Wire is the default for anything safety-related or anything inside a control loop.

Wireless earns its place on retrofit and reach. Running conduit to a motor on the far side of a process unit, or to a rotating asset, can cost more than the sensor it serves. Battery-powered wireless vibration and temperature nodes turn a multi-day cable pull into an afternoon. The cost you accept is power management and a duty cycle: a battery node that wakes every few minutes to push an RMS value and a spectrum will run for years, but it can't stream continuously, so it's poorly suited to high-rate transient capture.

On reliability, wire still leads where it's installed. A radio link contends with metal, moving equipment, and a 2.4 GHz band that plant Wi-Fi and everything else already crowd. Mesh protocols handle this better than they used to, but a buried fact remains: the more your alarm path depends on radio, the more your monitoring system inherits the radio's bad days. (We see this most on metals and heavy-process sites, where structural steel is everywhere and reflective.) Wireless lowers installation cost and raises operating attention; wire does the reverse.

Decision three: rule-based or learned models

The third fork is how the system decides something is wrong. A rule-based approach compares a measured feature to a fixed limit. The classic example is ISO 10816, the standard for evaluating machine vibration by measurements on non-rotating parts, which sorts broadband velocity into severity zones A through D in mm/s RMS. For a medium-sized machine the standard places the alarm-worthy boundary around 4.5 mm/s RMS and the shutdown region above roughly 11.2 mm/s RMS. Rules like this are transparent, need no training data, and a vibration analyst can defend every one to an auditor.

Rules have a ceiling, though. A single broadband limit tells you something changed, not what or how soon. It struggles with machines whose "normal" shifts by load and speed, and it can miss an early-stage defect whose energy hides in a narrow frequency band well below the overall alarm. That's where frequency-domain rules — tracking the specific bearing defect frequencies, BPFO and BPFI — and learned models come in.

Learned models infer the boundary from data instead of a standard. The public benchmarks that made this credible are worth knowing because they show both the promise and the prerequisites. NASA's Prognostics Center of Excellence publishes a data repository built precisely for remaining-useful-life work: turbofan run-to-failure simulations, bearing, battery, milling, and IGBT sets, all tracking degradation from healthy to failed. Its turbofan C-MAPSS data, introduced for the PHM08 prognostics challenge (Saxena, Goebel, Simon, and Eklund, Denver, October 2008), is the reference RUL problem: four sub-datasets, FD001 through FD004, of run-to-failure engine cycles where the task is to estimate cycles remaining. On the diagnostic side, the Case Western Reserve University bearing dataset, sampled at 12 kHz with single faults seeded by electro-discharge machining at 0.007, 0.014, and 0.021 inch diameters, is the canonical fault-classification set.

Here's the catch those datasets quietly teach. Both are labeled run-to-failure records. A learned prognostic model needs examples of the machine actually failing, and most plants — sensibly — don't run critical assets to destruction to generate training data. So the practical entry point is usually anomaly detection against a learned healthy baseline, which needs only normal-operation data, with full RUL prognostics reserved for fleets large enough that failures accumulate. A learned model also costs more to keep honest: it drifts as the process changes, and it needs a feedback loop from work orders to confirm whether its calls were right.

Putting the criteria side by side

The three forks are separable, but they share the same evaluation criteria. This is the comparison in one view, rated qualitatively rather than with invented numbers:

CriterionEdge inferenceCloud inferenceWireless telemetryLearned models
LatencyLowest; suits interlocksFine for trends, not loopsDuty-cycled, not real-timeAdds compute, but offline-tolerant
Cost shapeHigh per node, low bandwidthLow per analytic, high data egressLow install, ongoing battery/attentionHigh to build and maintain
ReliabilitySurvives link lossManaged uptime, needs connectivityInherits the radio's bad daysDrifts; needs feedback loop
Maintenance burdenMore field devicesHeavier data pipelineBattery and RF managementRetraining and labeling
Best fitFew critical, fast-acting assetsMany points, fleet analyticsRetrofit and hard-to-reach assetsLarge fleets with failure history

Read down the columns and the pairings fall out. Edge plus wired plus rule-based is the conservative, defensible build for a small set of high-consequence machines. Cloud plus wireless plus learned is the scalable build for blanketing a large, lower-criticality population where the value is in catching the few surprises across thousands of points. Most real plants run both, on different tiers of equipment.

Measuring whether it worked

Whatever you build, instrument the outcome, not just the machines. The cleanest yardstick is the one operations already trusts: Overall Equipment Effectiveness. ISO 22400-2, the international standard for manufacturing operations KPIs, defines OEE as the product of availability, performance, and quality. Predictive maintenance acts almost entirely on the availability term — fewer unplanned stops, shorter ones — so track availability before and after, by line, and you'll know in a quarter whether the program is paying. Tie every avoided failure back to a work order and a saved downtime estimate, and the business case stops being a slide and becomes a ledger.

This is also where a tidy data path pays off. State detection at the edge, feature streams normalized to a common model, and a place to hold both the live signals and the history a learned model trains on — that's the spine of any serious deployment, and it's what an edge telemetry and analytics platform is for.

Where predictive maintenance doesn't pay off

It's not universal, and pretending otherwise is how programs lose credibility. Some honest limitations:

On cheap, redundant, or fast-to-replace assets, the monitoring costs more than the failures. A run-to-failure policy is the correct engineering choice for a redundant pump you can swap in twenty minutes from stores. Instrument it and you've spent money to watch something you'd happily let break.

On failure modes with no useful lead time, prediction has nothing to read. A bearing degrades over weeks and announces itself; a brittle fracture or an electronic latch-up can go from fine to failed faster than any sample interval will catch. PdM rewards gradual, observable degradation. Where the physics doesn't cooperate, no model saves you.

And the program can fail on its own overhead. Sensors drift and need calibration. Alarm thresholds set too tight bury the control room in false positives until people stop trusting them, which is worse than no system at all. A learned model with no feedback from the work-order system slowly goes stale. None of these are reasons to skip PdM; they're reasons to scope it to the assets and failure modes where the lead time is real and the consequence justifies the attention.

Security deserves a line, even though it isn't the focus here. Connecting formerly isolated machines to gateways and networks widens the attack surface, and the IEC 62443 series is the reference framework for industrial automation and control system security. Treat the monitoring network as part of the OT estate it touches, segment it, and don't let a vibration sensor become the soft way into a control system.

Which fits your plant

Start from the assets, not the architecture. Rank equipment by consequence of failure and by whether the failure mode gives you observable lead time. For the short list of critical, fast-acting machines, build edge-first: wired sensors where you can run them, rule-based protection you can defend, and frequency-domain diagnostics on top. For the long tail of lower-criticality equipment, lean toward centralized analytics, wireless retrofit sensors, and learned anomaly detection that earns its keep by catching the rare surprise across many points.

Then prove it on the availability term of OEE before you scale. The technology choices matter, but they matter less than picking the right machines and closing the loop from alarm to work order to verified save. Get that discipline right and the architecture debate becomes what it should be: an engineering trade-off with a clear answer for each tier of your plant, rather than a religious one.

References

  1. Operations & Maintenance Best Practices: A Guide to Achieving Operational Efficiency (PNNL-14788) — Maintenance Approaches
  2. ISO 13374-1:2003 — Condition monitoring and diagnostics of machines: data processing, communication and presentation
  3. MIMOSA OSA-CBM — open architecture for condition-based maintenance
  4. ISO 17359:2018 — Condition monitoring and diagnostics of machines: general guidelines
  5. ISO 10816 — Mechanical vibration: evaluation of machine vibration by measurements on non-rotating parts
  6. NASA Prognostics Center of Excellence — Prognostics Data Repository
  7. Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation (PHM08, C-MAPSS), Saxena et al., 2008
  8. Case Western Reserve University Bearing Data Center
  9. ISO 22400-2:2014 — Manufacturing operations management KPIs (OEE)
  10. OPC Foundation — OPC Unified Architecture (OPC UA), including PubSub / IEC 62541-14
  11. IEC 62443 — Security for industrial automation and control systems

Reuse & license

This article is published by Zoniax Innovations LLC under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt it for any purpose, including commercially, as long as you give appropriate credit to Zoniax and link back to the original article.

Disclaimer

These Field Notes are general technical information, published as-is for industry peers. They are not professional, engineering, safety, legal, or financial advice, and nothing here is a recommendation to buy, sell, or act. Figures are cited from public sources believed reliable but are not independently guaranteed — verify them against the primary sources and your own plant conditions before acting. Zoniax Innovations LLC and the author accept no liability for decisions made from this content. Naming a standard, product, or vendor is not an endorsement.

Cite this article

Nõmm, A. (2020). Predictive Maintenance for Processing Plants. Zoniax. https://zoniax.com/blog/posts/industrial-predictive-maintenance