From Historian to Lakehouse: The OT Data Pipeline

How years of plant tag history move out of the historian into an open, queryable lakehouse, walked layer by layer from sensor to served model.

Start at the finish line. What a historian-to-lakehouse build gets you is a single queryable store holding years of tag history in an open file format, sitting on cheap object storage, readable by SQL and Python at the same time, contextualized against your asset hierarchy, and feeding dashboards and models without anyone touching the DCS. Your historian keeps doing its job at the edge. The lakehouse becomes the place where analytics, reporting, and machine learning actually run. This piece walks the path from the sensor to that store, layer by layer, and flags the one spot where most rollouts quietly fail.

What the lakehouse adds that the historian can't

A process historian is built for one thing and does it well: swallow a firehose of timestamped tags off the control system and keep it for years without dropping points. It compresses aggressively, answers trend queries fast, and survives bad network days. None of that is in question. Where it hits a wall is analytical. Joining tag history against batch records, lab results, maintenance logs, and energy meters in one query is awkward inside a historian's own data model. So why not just point the data scientists at the live system for a year of training data? Because that read is slow, it competes with the capture workload, and in most plants it is politically a non-starter. And the per-tag licensing model gets expensive once you decide you want to store everything rather than only what an engineer flagged.

The lakehouse pattern was defined to close exactly this gap. A 2023 survey of the architecture describes it as breaking data warehousing into independent components built on open standards, so one store serves both BI-style queries and data-science workloads instead of copying data into a separate warehouse first [1]. You keep raw data in open files on object storage, then add a table layer on top that supplies the transactions, schema control, and metadata a warehouse would. Your historian stays the operational system of record. The lakehouse sits next to it as the analytical plane. This is an addition, not a rip-and-replace, and reading it any other way is how these projects start fights with the controls team they don't need.

ConcernProcess historianLakehouse
Primary jobCapture and retain live OT tagsAnalytics, reporting, ML over combined data
StorageProprietary, often per-tag licensedOpen files on commodity object storage
Query reachTag trends, within its own modelSQL and Python joins across all sources
Schema changesLimited, vendor-dependentIn-place evolution via table format
Best placementAt or near the plant edgeIT side, behind the OT boundary

Step 1 — Extract without disturbing the control loop

Rule one is that getting data out must not put the process at risk. The cleanest way to honor that is a read-only path running parallel to the control system rather than through it. NAMUR formalized this as its Open Architecture: a second channel that carries field and process data to monitoring and optimization applications in a non-reactive way, over OPC UA, without changing the existing automation network [2]. Your closed-loop world stays intact. Everything pulled for analytics travels a separate, lower-trust route, and a failure on the analytics side can never reach back into control.

In practice the source is one of three things: the historian's own export or replication interface, a direct OPC UA subscription to the PLCs and DCS, or an edge gateway that already aggregates both. OPC UA earns its place here for more than transport. Its address space carries the semantics of the data, not just values, and companion specifications let industry groups publish standard information models for machine tools, robots, AutoID, and the like, so a client can discover what a server exposes at connection time [3]. Capture that structure at extraction and you save yourself a contextualization fight later. Flatten everything to bare tag names and you inherit that fight downstream, after the data has already piled up.

Extraction runs continuously, not as a nightly dump. An edge agent subscribes to changes, typically report-by-exception so a steady signal doesn't generate redundant traffic, buffers locally when the link drops, and forwards to the landing zone when it recovers. Buffering is not optional. The OT-to-IT hop is the least reliable link in the chain, and a writer that loses points during a network blip produces gaps that look like process events and aren't. Decide early how much local buffer the edge holds, because that number sets how long an outage you can ride out before history actually goes missing.

Step 2 — Land it raw on object storage (bronze)

Data arrives first as raw files, untouched, in an object store. This is the bronze layer, and the discipline is to write what you received before transforming anything. Keep the original timestamps, the quality flags, and the source tag identity. If a transform later turns out wrong, you reprocess from bronze instead of re-extracting from a plant you may not be able to re-query.

Land it as Apache Parquet. Parquet is a columnar file format: it splits a table into row groups and stores each column separately, with its own encoding and compression, and a reader pulls only the columns it needs after consulting the file metadata [4]. For process data, where a query typically wants three tags out of four thousand across a date range, columnar layout is the difference between scanning gigabytes and scanning megabytes. Time-series values compress well column by column too, because neighbouring readings are similar, so storage stays cheap even when you stop discarding data.

That last point is the quiet economic shift. Historians compress with exception and swinging-door algorithms because their storage was expensive and on-prem. Object storage is cheap enough now that you can keep full-fidelity raw history and treat compression as a column-encoding concern rather than a decision about which points to throw away at the source. You stop pre-deciding what future analysis will need, which matters because the question a model wants to ask in two years is rarely the question you optimized capture for today.

Partition the bronze data the way it will be read. For plant time-series that almost always means partitioning by date, and often by area or line under that, so a query for one unit over one week touches a handful of files rather than the whole history. Getting the partition layout roughly right at landing time saves expensive reorganization later, though, as the next step shows, the table format can soften that mistake if you make it.

Step 3 — Put a table format over the files (silver)

Raw Parquet on object storage is a pile of files, not a table. Without a layer above it, concurrent writes corrupt each other, there are no transactions, and a half-finished job leaves readers seeing partial data. A table format is what turns the pile into something a query engine can trust, and two open ones dominate: Apache Iceberg and Delta Lake.

Delta Lake, published at VLDB in 2020, keeps an ordered transaction log alongside the Parquet data. That log records which files belong to the table, and it is periodically checkpointed back into Parquet so metadata reads stay fast; it is what delivers ACID guarantees, time travel to earlier table versions, and schema handling on top of plain object storage [5]. Iceberg solves the same problem with a tree of metadata and manifest files tracking snapshots of the table state. Either way you can query the table as it stood at any past commit, which for an operations team means you can reproduce exactly the data a model trained on or a report was built from when someone challenges the result.

The feature that earns its keep over the life of a plant is schema evolution. Instruments get added, tags get renamed, a new line comes online. Iceberg supports adding, dropping, renaming, and reordering columns, plus evolving the partition scheme, as metadata-only operations that don't rewrite existing data [6]. So when the plant changes, the table changes with it, and last year's history stays readable through the change. The format keeps moving as well: Iceberg's v3 specification, ratified in mid-2025, added binary deletion vectors and row-level lineage, which make corrections and merges cheaper than the older copy-on-write approach [7].

Silver is also where you clean and conform. Bad-quality samples get flagged using the OPC UA quality codes you preserved, duplicate points from buffer replays get deduplicated, units get normalized, and tags get parsed into consistent identifiers. What comes out is trustworthy, typed, queryable plant data. It is not yet organized around how the business thinks about the plant, though. That's the next step.

Reading it back: engines and the catalog

One property makes the open formats worth the trouble: storage and compute are separated. The data sits in your object store in Parquet under an Iceberg or Delta table, and any engine that understands the format can read it without importing a copy. A SQL engine drives the reporting dashboards, a Spark or Python job builds model features off the same tables, and an analyst can point an ad-hoc query at last quarter's history, all against one set of files. No engine owns the data, so you can change engines later without migrating anything.

Holding that together is a catalog: the service that tracks which tables exist and where their current metadata lives. It is also the natural place to attach access control and lineage. Treat the catalog as part of the design rather than an afterthought, because a lakehouse without a governed catalog drifts into a directory of mystery files within a year, and nobody can say which table is authoritative for a given measurement.

Separation of storage and compute changes the economics too. You size compute to the query you're running right now and shut it down afterward, while the data sits at rest on object storage costing the same whether anyone queries it or not. A heavy model-training job spins up its own engine, runs against the gold tables, and releases the capacity; the reporting dashboard reads through a small always-on engine; neither contends with the other or with the edge capture. That elasticity is hard to get from a historian, where the same server both captures and answers every query, and a demanding analytical request can compete with the job that keeps the data flowing in.

Step 4 — Contextualize and serve (gold)

A timestamp and a value mean little on their own. The gold layer attaches them to an asset model, so a reading belongs to a specific sensor, on a specific unit, in a specific area, on a specific line. The natural frame for that model is ISA-95, the standard (published internationally as IEC 62264) that defines the equipment hierarchy and the levels running from physical process up through control, manufacturing operations, and enterprise planning [8]. Model your gold tables against that hierarchy and a query can ask for energy per tonne on Line 2 last quarter without an engineer hand-mapping forty tag names first.

Gold tables are purpose-shaped: a feature table for a predictive-maintenance model, a daily energy-and-throughput rollup for reporting, an OEE table joining downtime events to production counts. Because everything sits in one open store, these tables join freely against lab data, batch records, and maintenance history that never lived in the historian at all. This is the combined-source query a historian alone could never give you, and it is where instrumented plant data starts paying back. It is the layer your live models read from, and the surface a maintenance engineer's report is built on, all from the same governed tables rather than four exports that disagree.

Where most rollouts go wrong

Here's the failure I see most: teams treat the lakehouse as a dumping ground and skip the contract. Tags land with no asset context, no schema enforcement, no agreed identity, and within a year there are six versions of the same measurement under five names, plus millions of tiny files because the ingest wrote one object per message. What happens then? Every query becomes an archaeology project, the store technically holds everything and is useful for nothing, and the people who were promised insight go back to spreadsheets.

The fix is boring and it works. Define the schema and the asset model before you bulk-load, not after. Enforce them at the silver boundary so malformed data is quarantined, not silently absorbed. Compact small files on a schedule so the metadata layer isn't tracking millions of kilobyte objects, since the small-files problem will throttle query performance long before storage cost becomes the issue. And treat the medallion layers as a contract: bronze is immutable raw, silver is conformed and trustworthy, gold is business-shaped. The reason the lakehouse pattern bundles a table format in the first place is to give object storage the transactions and schema discipline a warehouse always had [1]. Skip that discipline and you've built an expensive data swamp with a fashionable name.

Security and ownership along the path

Moving plant data toward IT crosses a trust boundary, and the architecture has to respect it. NIST's Guide to Operational Technology Security, SP 800-82 Revision 3, published in 2023, is the reference: it expanded its scope from industrial control systems to OT broadly and added an OT overlay that tailors security control baselines for low-, moderate-, and high-impact systems [9]. The implications for this pipeline are direct. Extraction is read-only and one-directional, so a compromise of the analytics side can't write back into control. It crosses a segmented boundary, ideally through a DMZ, consistent with the layered defence the guide describes. And access to the lakehouse is governed independently of access to the control system, through that catalog, so granting an analyst a year of history is never the same act as granting reach into the DCS.

Ownership is the other half. The historian is the operational system of record; the lakehouse is a downstream copy for analytics, and it should be labelled and treated as such so no one mistakes a reprocessed gold table for the authoritative measurement. Time travel helps here too, because the table format's versioning gives you an audit trail of what the analytical data looked like at any point, which matters when a model's decision gets questioned months later [5].

What to keep and what to retire

Nothing on this path retires the historian. It stays at the edge as the resilient, real-time capture and operational-trend system it was built to be, close to the process and able to ride out network outages. What you retire are the workarounds: the nightly CSV exports to a shared drive, the analyst with a standing query against production, the separate warehouse you were copying data into for reporting. Those collapse into one open store that BI and data science read from directly.

The sequence to build it is the sequence you just read: extract on a non-reactive read-only channel, land raw Parquet in bronze, impose a table format and conform to silver, contextualize against your asset model into gold, govern the whole path under an OT security baseline. Each layer is independently replaceable because each rests on an open format rather than a vendor's lock-in. That is the real payoff of the lakehouse approach for a plant: not a single product, but a decomposed, open pipeline you can evolve one layer at a time as the operation, and the formats, keep changing [1].

If you'd rather not assemble all of that yourself, this is the kind of build our industrial AI deployment work handles end to end, and the gold layer is exactly what the Zoniax edge telemetry and analytics platform reads from to drive live models. Either way, the architecture above is the part worth getting right first.

References

  1. The Data Lakehouse: Data Warehousing and More (arXiv:2310.08697, 2023)
  2. NAMUR Open Architecture (NOA) — NAMUR
  3. UA Companion Specifications — OPC Foundation
  4. Apache Parquet — File Format — Apache Software Foundation
  5. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores — PVLDB Vol. 13 No. 12, 2020
  6. Evolution — Apache Iceberg documentation
  7. Iceberg Table Spec — Apache Software Foundation
  8. ISA-95 / IEC 62264 — Enterprise-Control System Integration — ISA
  9. Guide to Operational Technology (OT) Security — NIST SP 800-82 Rev. 3, 2023

Reuse & license

This article is published by Zoniax Innovations LLC under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt it for any purpose, including commercially, as long as you give appropriate credit to Zoniax and link back to the original article.

Disclaimer

These Field Notes are general technical information, published as-is for industry peers. They are not professional, engineering, safety, legal, or financial advice, and nothing here is a recommendation to buy, sell, or act. Figures are cited from public sources believed reliable but are not independently guaranteed — verify them against the primary sources and your own plant conditions before acting. Zoniax Innovations LLC and the author accept no liability for decisions made from this content. Naming a standard, product, or vendor is not an endorsement.

Cite this article

Nõmm, A. (2026). From Historian to Lakehouse: The OT Data Pipeline. Zoniax. https://zoniax.com/blog/posts/historian-to-lakehouse