Grey Wolf Optimization and Deep Belief Networks for Data-Efficient Forecasting in Smart Renewable Energy Systems

Download the Audio (Right-click, Save-As)

Ludwig Boltzmann was a physicist living in 19th century Austria. He's remembered for being the person who laid the foundations of Statistical Mechanics, the field that connects microscopic particle behavior to macroscopic properties like temperature and energy. His key idea was that complex physical systems can be described probabilistically, with states distributed according to what is now called the Boltzmann distribution. Instead of tracking every particle, you assign probabilities to the possible configurations (or "states") the system is likely to occupy. And the probability of each state depends on that state's energy.

Lower-energy states are more stable, so they're more likely.
Higher-energy states are less stable, and are less likely.

We call that the "energy-based probability" rule.

Now fast forward a century after Boltzmann's death, to the 1980s. Geoffrey Hinton and colleagues have invented a new type of neural network. The Boltzmann Machine. In this model each possible configuration of neurons is assigned an energy, and the network learns by adjusting those energies so that desirable patterns, the ones that match the data, become low-energy and therefore more likely. And the less desirable ones that don't match the data become high-energy and therefore less likely. Instead of directly predicting outputs from inputs (like a standard feedforward neural network would), the model learns a probability distribution over all possible input patterns. This allows it to model the underlying structure of the data and generate new samples without explicit labels or direct supervision. A major leap forward in unsupervised learning and probabilistic modeling that laid the groundwork for both the deep learning architectures and particularly the generative models that we use today.

And as it turns out, Boltzmann Machines are stackable. If you take multiple layers of these models and train them one at a time, feeding the learned representation from one layer into the next, you gradually build up more abstract features at each level. The lower layers capture simple patterns, and higher layers capture more complex relationships. The result is what we call a Deep Belief Network (DBN). It might sound like a CNN, but it's different. A DBN is essentially a stack of Restricted Boltzmann Machines (RBMs) trained layer-by-layer, followed by a fine-tuning phase that adjusts the whole network.

While a CNN also builds hierarchical representations, it usually does so through a fully supervised, end-to-end process, where all layers are optimized simultaneously using backpropagation and labeled data. And the structure of that network is explicitly designed to exploit spatial locality through convolution and pooling.
In a DBN, each layer is trained independently, largely unsupervised, and learns to model the distribution of its inputs before passing a transformed representation up to the next layer. Only after this layer-wise pretraining is complete is the entire network fine-tuned. This makes DBNs less dependent on large labeled datasets and more focused on capturing the underlying structure of the data itself.

Now, why did I just tell you all that? Because DBNs are core to what the authors are doing in today's paper. They're taking a deep belief network and pairing it with an optimization strategy (GWO) that selects which inputs the model should pay attention to. And all of this in service of the task of accurately forecasting the performance of renewable energy systems. On today's episode, we'll walk through how their pipeline works, and what it's actually useful for. Let's dive in.

HRES, Hybrid Renewable Energy System, is the generic name we give to any installation or infrastructure setup that can generate electricity from multiple sources (like solar and wind) or balance supply and demand through storage and control. And as a group, HRES systems generate a ton of sensor data. You have solar irradiance, wind speed and direction, atmospheric pressure, humidity, panel tilt angles, grid load, and more. And all of that data has some relationship with the eventual output of the system. The trouble is, these variables aren't equally predictive. Some of them are statistically independent of the system's output, others are highly correlated. Some are unique perspectives on particular phenomena, others are highly coupled with each other in redundant ways. If you dump all of these variables into a deep learning model, you'll cause a "curse of dimensionality" problem. Training will slow down, the model can overfit on noise, and generalization can suffer. You can end up with something that's computationally expensive but doesn't actually predict anything particularly well.

This is what we call a "feature selection" problem. If you want to train a model that can make accurate predictions, you've first got to find a smaller subset of input variables that preserves as much predictive power as possible. Strip out the redundant ones, strip out the noisy ones, and feed the model only what it needs. Done well, this speeds up training and improves accuracy. Done poorly it just throws away useful signal and makes the model even worse.

Most approaches to feature selection are either brute-force (try every combination) or heuristic (apply some statistical filter). The authors here use a metaheuristic optimization algorithm, instead. It's called GWO, Grey Wolf Optimization. It sits in the middle-ground between those two options. It searches the possible feature space, guided by a multi-objective function that balances predictive accuracy, feature redundancy, and computational cost. But it does this without needing to exhaustively evaluate every possible combination. The inspiration for GWO comes from the hunting behavior of grey wolves. An alpha leads, the beta and delta wolves maintain the second and third best positions. And all other wolves adjust their movements relative to those three leaders in a way that progressively closes in on the prey. In GWO, this translates into a search where candidate solutions, (feature subsets), move through the search-space with the best-performing subsets acting as leaders that pull the rest toward high-fitness regions. At the tail end of this process you're left with a reduced set of features that the system thinks are most likely to yield high predictive performance.

Now what does all that have to do with a DBN? Well, those selected features are what's fed into the DBN. This happens in two phases.

The first is an unsupervised pre-training phase, where each RBM layer (each Boltzmann Machine) is trained individually, from the bottom up. This initializes the weights of the entire network based on the structure of the data, before any labels are introduced. This gives the network a meaningful starting point. Rather than beginning from random weights, the DBN enters the second phase already having internalized something about the shape of the data.
That second phase is supervised fine-tuning, where the entire network is trained end-to-end using backpropagation on labeled training examples. The loss function being minimized here measures the deviation between the model's predicted power output and the actual recorded values, with a regularization term added to penalize unnecessarily large weights and keep the model from overfitting.

When all is said and done, the output layer produces predictions for the overall performance or power output of the system. So to recap everything going on here. There are 4 key ideas:

The whole point is to take in environmental and system variables from HRES sensors and get the system's power generation as output.
These systems generate so many readings that you can't train a deep learning model effectively without doing feature selection. And that's what GWO is for.
Those pruned features are then fed into the DBN, twice, to learn hierarchical representations and then train the system to associate a target power value with a given set of selected inputs.
The final result is a model that can take a new set of sensor readings and infer a power estimate for the rig that produced them.

The question is, does any of this actually work? To find out, the authors validated the pipeline against two real-world datasets. The first was collected from sensors at a university campus in Turkey, and the second is a French national electricity grid dataset covering hourly wind and solar production records. Using two geographically and climatically distinct datasets is important because it tests whether the framework generalizes across different environmental conditions rather than just fitting tightly to one region's particular weather patterns. The results were positive across virtually every metric. On Mean Absolute Error, competing models all produced errors many times higher than the new system. Root Mean Square Error saw the same trend. As did the coefficient of determination. The authors' system also trained faster, finishing in well under a second on both datasets, while the competing models all took multiple seconds. So overall we're talking about a system that can predict power output and generalize across different environments more effectively than its peers, without requiring large amounts of labeled data or exhaustive feature search or excessive computational cost. Not bad at all.

If you'd like to go deeper, into the DBN architecture, the hyperparameter configurations used in each model, or the mathematical treatment of GWO's position updates, I'd highly recommend that you download the paper.