Modern DERMs Architecture

### Winds of Change > Copyright (c) 2024 by George L. Willis. All rights reserved. When I first started to work with a large eastern US utility over a decade ago, there was a large gathering where I asked a member of senior management if the utility intended to internally develop any of the SCADA or Grid Control ICS software they wanted to deploy, as my background was in large scale software development and architecture. The response was simple -- "We are not a software company. We prefer buying over building." An understandable policy which has been adopted by many utilities. So we bought a DSCADA and ADMS system for two different regions from a large vendor, went through a disappointing FAT where system scale and performance were below promises (SLAs), and remember saying to the OT lead on the project who was very upset -- that given the timeframe for federal reimbursement, victory would be declared in the face of disappointment. Recently in dealing with another major utility in the Northeast US, I was made aware that they were unhappy with their current ADMS system, and were moving away from their current vendor. They had looked at other offerings, but were not impressed with any, given the current issues that distributed solar generation and energy storage were bringing to the forefront. The age of large scale DERMS had brought another solution as an option -- "why don't we build our own?" As daunting as the task may seem, there are many good reasons to consider developing an open architecture on standards that favor integration of components, or as Gartner has named it "Composable Business Architecture". I wish to discuss SmartGrid Industrial Control Systems (SG-ICS), where past solutions have missed the boat, and what enablers we have a decade later that are leading to the option of integrating best-of-breed offerings with internally developed components to tackle DERMS at scale. ### "Are we there yet?" Nothing was more annoying than a long family car ride where entertainment for the kids was "I Spy", "99 bottles of beer on the wall", and the constant polling of children with one question -- "Are we there yet?" Thank heavens for iPads! In North America, DNP3 is the primary protocol from OT devices in the substation, and these devices are polled by a local and/or remote RTU (RTAC if you like). You got lots data for each point, like voltage. A monotonous stream of nearly duplicate data like 124.3v, 124.2v, 124.4v, 124.5v... you get the idea. But do I care about these minor fluctuations? The answer is 99.9% of the time, I do not. Yet these are sent through network up through banks of RTUs, onto headend systems, and then replicated by Realtime Databases to ICCP servers and beyond. This monotonous data leads to problems in performance and scale, and is an architectural design flaw. Now don't get me wrong. Polling is not bad, since it allows for fault-tolerance and scaling via pools of RTUs, as well as buffering data in the origin device to take advantage of bandwidth gains ethernet has made over the past decade. > Most don't understand the difference between network bandwidth and latency. Imagine a mule is carrying your data over a mountain. Over the last decade, the mule has gotten stronger, and can carry more data (bandwidth), but the mule hasn't gotten any faster (latency). In fact, the slowest part of distributed systems is networking, which is why you want to DRY up networks -- "Don't Repeat Yourself". This is one reason why Event-Driven Architecture (EDA) and Streaming Data is all the rage. The issue is that 99.9% of the time, we care if a point has "significantly changed", and can consider minor fluctuations as "unchanged". What we need is a way to gather, store, and filter for significance, and only send significant change further upstream. The solution, employ a time-series analysis at the substation level that process the monotonous data into time interval data where data is expressed in te form of "from this timestamp to this timestamp, the voltage was 124.4v, plus or minus 0.2v". At the same time, go ahead and cache recent actual readings, so that if a headend has the rare instance where it cares to use the actual readings, it can retrieve it. With this solution, you reduce the dataflow to the data you care about, while preserving the ability to retrieve actual values in the exceptional case. The best of both world. If a significant change to the data occurs outside the "guardrails" that define the insignificant, then we capture that as a seperate discreet value or time series value, depending on whether multiple value occur within the significant "event". Events are what we are shooting for, and events are defined be changes to near static values that "trend" outside of the guardrails which define significance. Time-Series Databases are an enabling, mature technology in this arena. They consolidate the storage of repetitious streaming data into more concise time-series data, and allow queries to be reconstituted into time intervals specified in the query. They capture "trends" of data over time. ### "I can either teach, or I can do. What's your pleasure?" -- Robert Duvall in *Deep Impact* Another source of bad design is the replication of data from a Realtime (Database) Server, to an ICCP Gateway Server. The Realtime Server (RTS) is where all data ends in Realtime, and this is a scaling issues, especially if you asks the RTS to not only store incoming data, but to replicate it to another server. Distributed Log Technology (like Kafka, Pulsar, Beam, Object Storage internals, etc.) allows data to be streamed to many servers in parallel for processing. This technology is mature, and the underpinnings of replication and consensus algorithms are found in many advanced distributed databases. It's highly performant because it is focused on solving the critical issues of replicating a streaming log of data to a cluster of server to achieve both performance and data safety (redundant copies). It yields a highly scalable architecture where multiple servers can work in parallel on the same or different tasks. It scales beautifully. This is what Event-Driven Architecture (EDA) is all about. So rather than ask the RTS to both teach and do storage -- just let it do storage. An ICCP server can subscribe to the same stream. Even better, we can let this archaic protocol die an ignominious death, and use modern VPN or other encryption in motion and at rest services to secure the transmission and consumption of the distributed log, like large financial houses do (something I know a little about.) Employ Distributed Ledger Technology (DLT) if you are sharing outside enterprise walls where immutable trust is the concern. Anyone who has every had to configure Bidirectional tables or try to integrate OMS and DMS systems via ICCP tags understands that ICCP is a protocol that has no place modern IT. Most vendors do not even implement all the ICCP blocks! Of course, you still need to get security on board, but that's why Penetration Tests are mandated at least annually, and every 6 months in the first year. ### "Not your Father's Oldsmobile" There are a wide number of computer languages to choose from. Java is mainstream, but not a great choice for containers due to the JVM needing to be embedded in every container produced. Jave predated containers, and despite marketing that would say otherwise, it's not a great choice for microservice architectures. The Garbage Collector interrupts lead to non-deterministic performance, and frankly I am surprised JVM-based languages are employed in performance critical systems. Of course, the is Golang, which is much more container friendly, but it to leverages Garbage collection, so suffers from non-deterministic issue that need to be architected around, usually be turning off the GC and running pods of Golang containing that a periodically respawned. You can try Rust as I did, but you will find that the tradeoff for no Garbage Collection and memory safety is language complexity and immature implementation of basic element like ranges for standard datatypes. Get ready to pay premiums for development and technical debt. Perhaps we need to learn from history. When Ericson wanted to ensure the performance and reliability of the phone system in Europe, they developed a new language call Erlang. Now wait! Hear me out. The only language I have ever found difficult about reasoning about is Erlang. It'll make you wonder how any code ever got written. But what has been written is performant, distributed, fault-tolerant, and scalable. If only somebody would invent a new language that sat on top of Erlang, and that was as friendly to developers as Ruby. Ruby on Rails took over Web Application developer because of the velocity of development and prototyping, but fell short in scaling and performance. Enter ***Elixir***, and this is not your father's Erlang. Speed of development and the maintainability of codebase akin to Ruby -- on top of all that Erlang goodness. Frankly, it's the only language I would consider for developing performant, large scale systems. But don't take my word for it, google it for yourself. Yes, it runs on Beam, which is the Erlang virtual machine; but it's so performant due to it's maturity and small size that I'll trade a small amount of nondeterministic latency for all this goodness. Anywhere it becomes an issue, you can always employ small amounts of Rust to overcome. It's been a long time since I've been excited about a computer language. ### "The nice thing about standards are there are so many to choose from." Event-Driven Architecture is not without it's complexities. There are no generic APIs that allow you to quickly switch between various streaming solutions, with the exception of Apache Beam, which has limited support. With few exceptions, every technology has it's own APIs making for code migration issues should you decide to switch infrastructure. And worse, they don't all handle the semantics of topics the same. Some will create topics on the fly, others insist they must be well declared before use. Some aggregate subtopics into a parent topic, others view a parent topic as an independent topic. Then you get into the whole taxonomy of topics, which can cause migration issues as your early designs evolve into more mature topical hierarchies. It's a painful place to learn, and the costs of change is high. If only we could have a simple way to reason about domain data, like in Domain-Driven Design (DDD)? Yea, SQL is easy! If you can't handle SQL, you better get out of IT. That's why early on, so many tried to use the database to integrite application. This proved to be a major anti-pattern for database integrity, but the motivation was clear enough -- the governance of a unified, simple data model. If only their was a way to share the structure of a database, without allowing the data to be changed. An immutable SQL database that versioned every change??? What if we could have tables like Employees, and when somebody inserted a record, we could use a "trigger" to process that new record. And when you fired an Employee, that would be a delete, but you you just create a new version of the record, where it is now marked as deleted. Same for updates, a new version of the record that had the updated record. Suddenly, you could have an API as simple as SQL to develop to where events would have a CRUD operation akin to REST verbs, and triggers would fire event processing. All that is required is a distributed database that employs triggers to tables as topics! I've just defined Change Data Capture (CDC), and why projects like Debezium are gaining traction. The simplicity of SQL semantics leveraged to bring sanity to Event-Driven Topics and Schemas. Of course, you have to use a database that can store CRUD operations as a new, immutable version of a record. After all, you need the ability to change both the plumbing, and the data model -- to achieve evolutionary development. The only other alternative is to use Avro Schemas, which has become the defacto approach to versioning the Event Schemas.