Anomaly Detection on Process Data: Before the Alarm Trips

Every plant runs on limits. A bearing temperature trips at 95 °C, a header pressure annunciates at 9 bar, a flow drops below setpoint and the DCS lights up amber. Those limits work, and they will keep working. The problem is what they miss: the slow drift, the correlation that breaks before any single tag leaves its band, the pump that sounds wrong three shifts before it actually fails. By the time a fixed limit trips, the fault is usually well underway. This memo is about the layer that sits below the alarm rack, learned anomaly detection on raw process data, and specifically about the part most teams get wrong: how you decide something is anomalous when nothing has crossed a line yet.

I'll keep this narrow. Not a survey of every algorithm, but the working mechanics of detecting anomalies in multivariate time-series from process sensors: what kind of model produces the score, how you threshold that score without labelled faults, and what the public benchmarks tell you to expect before you trust any of it on a live unit. So the order here follows the order you'd build it in.

Out-of-limits is a coarse filter

The dominant form of monitoring on real plants is still the out-of-limits (OOL) check: a value is compared against high and low thresholds, and an alarm fires when it strays outside. NASA's own assessment of spacecraft monitoring made the same observation about telemetry. Despite decades of more sophisticated research, OOL approaches remain the most widely used form of anomaly detection, kept popular by low computational cost and ease of understanding, but prone to missing anomalies that occur within defined limits [1]. A processing plant is no different. A fixed band catches the gross excursion and ignores everything subtler.

It helps to be precise about what subtler means. The time-series literature splits anomalies into three kinds. Point anomalies are single values sitting in a low-density region. Collective anomalies are whole sequences that are wrong even though no single sample is. Contextual anomalies are values that look perfectly normal in isolation but are wrong for their local context [1]. An OOL limit is built to catch point anomalies and little else. In the labelled NASA telemetry set, which holds 105 real, expert-confirmed anomaly sequences drawn from the Soil Moisture Active Passive satellite and the Mars Science Laboratory rover, 41% were contextual, the kind a fixed band cannot see [1]. And there's no reason to expect process data to be cleaner. A discharge temperature that's normal at full rate is anomalous at 40% rate; the number never leaves its band.

The other reason to push detection below the alarm layer is operator load. The alarm management standard ANSI/ISA-18.2-2016 organises the whole alarm system around a lifecycle, with rationalisation, the systematic justification, documentation, prioritisation and classification of every alarm against a written philosophy, as its core discipline [2]. Rationalisation exists because annunciator racks tend toward flood. So adding more fixed limits to catch subtle faults only makes the flood worse. A good anomaly detector does the opposite. It watches hundreds of tags, scores the joint behaviour, and raises one signal when the pattern, not a single tag, goes wrong.

Two ways to turn sensors into a score

Practically every useful detector reduces a wall of sensor streams to a single anomaly score per timestep, then thresholds that score. There are two families worth knowing, and they differ in what they predict.

The first is the forecast-residual model. You train a model to predict the next value of a stream from its recent history, then watch the residual, the gap between what the model expected and what the sensor actually reported. When the process behaves the way it did during training, residuals stay small. When it drifts, they grow. The NASA work built one Long Short-Term Memory (LSTM) network per telemetry channel, each shallow, with two hidden layers of 80 units, dropout of 0.3, trained on input sequences of length 250, and used the prediction-error stream as the raw anomaly signal [1]. One model per channel sounds heavy. But it keeps each model interpretable and sidesteps the problem that you cannot feed thousands of correlated streams into a single network.

The second is the reconstruction model. Here you train an autoencoder, a network that compresses a window of multivariate input down to a small latent vector and then rebuilds it, using only normal-operation data. It learns to reconstruct normal patterns accurately. Feed it an abnormal window and the reconstruction error spikes, and that error becomes the score. The approach models many sensors jointly, so it captures the inter-tag correlations that contextual anomalies live in. A 2020 method called USAD (UnSupervised Anomaly Detection) sharpened the idea by training two autoencoders adversarially, so one learns to amplify the reconstruction error of inputs the other rebuilds too easily; it was published at KDD 2020 and evaluated across five public datasets [3].

Both families share the property that matters for a plant. They train on normal running and need no catalogue of labelled faults. That's not a nicety. Most units have run for years without anyone tagging exactly when each incipient fault began, and the faults you most want to catch are the ones you haven't seen yet. So a detector that demands labelled examples of every failure mode is a detector you can't actually deploy.

Both families also share two knobs that decide everything downstream: the window and the bottleneck. Window length sets how much history the model sees at once. Too short and a slow drift never accumulates inside the window, so the model treats each step as locally normal and the residual stays flat. Too long and the model smears across operating states, blurring the transition you wanted to flag. The 250-step input window in the NASA forecasting work was a deliberate balance of context against responsiveness [1], and the same judgement applies to an autoencoder's window. The bottleneck, the size of the latent vector, sets how much an autoencoder is forced to compress. Make it too wide and the network learns an identity map that reconstructs anomalies as faithfully as normal data, killing the signal. Make it too narrow and even normal variation reconstructs badly, raising the noise floor. Neither knob has a universal value; both are set against your own data and revisited when the process changes.

The hard part is the threshold, not the model

A reconstruction error or a forecast residual is just a number that wiggles. Turning it into an alarm means drawing a line, and drawing that line is where most deployments quietly fail. What happens if you set it tight? You flood the operator with false positives until they mute the system. Set it loose and you miss the fault you built the thing to catch.

The naive approach, assume the errors are Gaussian and flag anything beyond a few standard deviations, breaks because smoothed prediction errors are usually not Gaussian, and a normality test will say so [1]. The NASA team's answer is worth copying. First, smooth the raw error with an exponentially-weighted moving average so a single noisy sample doesn't spike the score. Then set the threshold nonparametrically: sweep candidate thresholds over a trailing window of errors and pick the one whose removal most reduces the mean and standard deviation of what remains, rather than assuming any distribution at all [1].

The numbers from that work are the part to internalise. With nonparametric thresholding alone, the system reached a precision near 48.9%, meaning for every real anomaly flagged, roughly one alert in two was noise [1]. Then they added a simple pruning step, which discards a flagged sequence whose error isn't sufficiently larger than the next-highest. Pruning raised precision to 87.5% while costing only 4.8 percentage points of recall, dropping it from 84.8% to 80.0% [1]. That single result is the whole engineering tradeoff in one line. A modest, deliberate sacrifice in recall bought a 38.6-point jump in precision, and precision is what keeps operators from muting you.

Recall also splits hard by anomaly type, which is the other thing to plan for. On the same data, recall on point anomalies was substantially higher than on contextual ones [1]. The anomalies a fixed limit already half-handles are the ones the learned detector finds easiest, and the genuinely novel contextual faults stay hardest for everybody. So don't promise the plant a detector that catches everything. Promise one that catches a class of faults the alarm rack structurally cannot, and be honest about its blind spots.

What the public benchmarks actually contain

Before any of this touches a live unit, validate it on data where the faults are known. Four public datasets carry most of the load in industrial anomaly-detection work, and their structure tells you a lot about what a fair test looks like.

Dataset	Domain	Channels	Sampling	Origin
C-MAPSS turbofan	Gas-turbine degradation	21 sensors + 3 op. settings	per flight cycle	NASA / Saxena et al., 2008 [4]
SWaT	Water treatment	51 sensors + actuators	1 s	SUTD iTrust, 2015 [5]
Tennessee Eastman	Chemical process	41 measured + 12 manipulated	3 min (default)	Downs & Vogel, 1993 [6]
SMAP / MSL telemetry	Spacecraft	82 unique channels	1 min batches	NASA / Hundman et al., 2018 [1]

NASA's Prognostics Center of Excellence repository hosts the C-MAPSS turbofan set: four subsets (FD001 through FD004) of simulated run-to-failure trajectories under different combinations of operating conditions and fault modes, each cycle recording operational settings and sensor channels of temperatures, pressures and shaft speeds [4]. The same repository carries bearing run-to-failure data from the University of Cincinnati's Center for Intelligent Maintenance Systems and the FEMTO-ST accelerated-life tests [4]. Those are the closest public analogues to rotating-equipment monitoring on a real plant, and worth pulling before you simulate anything yourself.

The Secure Water Treatment (SWaT) testbed is the one I point process engineers at, because it's a physical six-stage plant, not a simulation. It processes roughly 19 litres of water per minute, fully instrumented and logged at one-second resolution; the public dataset covers seven days of normal operation followed by four days during which 41 distinct attacks were staged [5]. The one-second sampling matters. It's the regime where reconstruction models earn their keep, and it's representative of fast process loops rather than the per-cycle sampling of the turbofan data. It also makes the dataset large enough that the difference between a real method and an overfit one shows up clearly.

The Tennessee Eastman process, defined by Downs and Vogel in 1993, remains the canonical chemical-process benchmark: a simulated plant of reactor, condenser, compressor, separator and stripper, with twenty preprogrammed process disturbances to detect [6]. Its default three-minute composition sampling is a useful reminder that not every tag arrives at the same rate. Analyser data is slow and laggy, and your feature pipeline has to handle mixed sampling without quietly forward-filling a fault into invisibility.

Two cautions about benchmark scores. First, a detector tuned to SWaT's one-second water-treatment dynamics will not transfer unchanged to per-cycle turbine data; the sampling rate and the physics both change what normal looks like. Second, point-adjusted scoring, where flagging any single sample inside a labelled anomaly window counts the whole window as caught, inflates results, and the field leans on it heavily. So read benchmark precision and recall the way you'd read a vendor's efficiency curve. Useful, and measured under conditions that flatter it.

Getting it onto the unit

The model is the easy 20%. Deployment is where a detector lives or dies, and the constraints are ones every controls engineer already knows.

Before any of it, you need a clean baseline of normal. Both model families learn what normal looks like from the data you give them, so a training set quietly contaminated with an undiagnosed fault, a fouled exchanger, a sticking valve, a sensor reading low, teaches the model that the fault is normal and it will never flag the recurrence. So curating that baseline is real work, not a data dump: pull a span the operators agree was a good run, exclude commissioning and trip-recovery transients unless you want them treated as normal, and check that every tag you intend to score was actually healthy across the window. Garbage-normal in, blind detector out. This is the least glamorous step and the one that most often decides whether the thing works.

Start with sampling and acquisition. Decide the rate per signal before anything else. Fast loops, meaning flow, pressure and vibration, want sub-second to one-second data, the SWaT regime. Thermal and degradation signals can live at minutes. Pull tags off the DCS or PLCs over OPC-UA or Modbus, timestamp at the source, and resist the urge to resample everything to a single clock. A window that aligns a one-second flow with a three-minute analyser by forward-filling has erased the very residual you're hunting. Mixed-rate handling is a feature, not a nuisance.

Then decide where the model runs. The shallow per-channel networks in the NASA work were deliberately small, two layers and 80 units, and that's the right instinct for the edge [1]. Reconstruction and forecast scoring at one-second cadence is cheap enough to run on an industrial gateway next to the line, which keeps the detection loop local even when the network back to the historian drops. And keeping it local is the difference between a detector that works during an upset and one that goes dark exactly when you need it.

Plan for drift. A detector trained on last quarter's normal will start crying wolf after a feedstock change, a catalyst swap, or a seasonal cooling-water shift, because the plant's normal moved and the model didn't. So plan retraining as routine maintenance, validate each retrain against held-out faults before it goes live, and never let a model silently absorb a developing fault into its definition of normal. That's how you accidentally train a detector to ignore the exact degradation it exists to catch.

And keep the operator in the loop. The NASA system was built around expert review and labelling of what it flagged, feeding human judgement back in [1]. Do the same. Route anomaly scores to the people who run the unit, give them a way to mark a flag right or wrong, and use that feedback to prune thresholds. An anomaly detector that bypasses the control room and emails a dashboard nobody owns will be muted within a month. One that hands the panel operator an early, trustworthy "something's drifting on K-201" earns its place. That handoff, from model to operator with honest precision and a known blind spot for contextual faults, is the whole point, and it's the part of any real industrial AI deployment that takes the most care to get right.

None of this replaces the alarm rack. Fixed limits are the safety backstop and they stay. What learned anomaly detection adds is lead time, a class of early, correlation-level warnings the OOL layer was never built to give. Score it honestly, threshold it carefully, keep the operator in the loop, and you catch the fault while it's still a drift instead of a trip.

Anomaly Detection on Process Data: Before the Alarm Trips

Out-of-limits is a coarse filter

Two ways to turn sensors into a score

The hard part is the threshold, not the model

What the public benchmarks actually contain

Getting it onto the unit

References

Reuse & license

Disclaimer

Cite this article

Out-of-limits is a coarse filter

Two ways to turn sensors into a score

The hard part is the threshold, not the model

What the public benchmarks actually contain

Getting it onto the unit

References

Reuse & license

Disclaimer

Cite this article

Related articles