Shared Autoencoder-Based Unified Intrusion Detection Across Heterogeneous Datasets for Binary and Multi-Class Classification Using a Hybrid CNN

Download the Audio (Right-click, Save-As)

If you work at a large company, there's a good chance that your IT department is maintaining more than one network at once. There's the enterprise network that you and your coworkers connect your machines to, there's the IoT network that connects the company's physical systems so your properties can be monitored, secured, and automated, and there are the cloud networks, where your servers, applications and data live.

These networks are heterogeneous by nature. There's no need, for example, for your cloud network to expose low-level packet timing features, or for your IoT network to generate rich application-layer logs. Different use-cases, different data representations, and different protocols. But also: different attack surfaces.

If a bad actor is trying to compromise the system as a whole, the entry points and techniques they'll use will depend on which sub-network they're trying to attack. Their intention might be consistent, but their actual tactics will be network-dependent. And that creates a rather subtle problem. If we collect access data from these environments, we won't get a single unified view of what an "attack" actually is. We'll get multiple, incompatible representations of it instead. Different events captured at different levels of the stack, with different feature sets, in different schemas with different statistical structures.

Now, what happens if we try to train an intrusion detection system on that kind of setup? Exactly what you'd expect: a model trained on one network's data performs well there, but breaks when deployed somewhere else. And when you try to merge the datasets, you run into mismatched features and inconsistent distributions. This leaves you with two bad options:

You can train separate models (one for each network), and deal with the loss of the shared learning.
You could try to train on the heterogeneous set, and deal with the distortions that come from forcing incompatible feature spaces into one model.

The authors of today's paper believe there's a better way. A way to help our models learn a unified notion of attack behavior without forcing all the data into the same schema. A way to generalize across environments without losing the unique signals each one provides. Their solution is a shared autoencoder with projection layers that map each dataset into a common latent space. That space is designed to capture the underlying semantics of attack behavior, independent of the original feature format. In this paper they stand up a CNN-DNN classifier that operates on this shared representation, extracts local patterns, and performs global classification. A single model that can learn from diverse network environments and generalize across them, rather than overfitting to any one dataset.

On today's episode we'll dig into how the autoencoder is designed, what it actually means for a latent space to capture behavioral semantics, and how the hybrid classifier was built to take advantage of that space. But first, we need to wrap our heads around the core issue here: why combining heterogeneous datasets is such a hard problem in the first place.

At first glance, this whole thing might seem like a straightforward data engineering problem. Just align the columns, normalize the values, and move on. But the difficulty runs much deeper than that. These datasets are not just formatted differently, they encode fundamentally different views of reality. One might describe traffic at the packet level, and captuure fine-grained timing and header information. Another might aggregate flows over time, collapsing thousands of packets into a handful of summary statistics. A third might log high-level device behavior or application events. When you try to bring these together, you are not just reconciling column names, you are attempting to map between entirely different abstractions. Features that appear similar might carry different meanings, and related behaviors may be expressed through completely different signals. Any naive alignment process will either destroy useful information or introduce false correspondences. And the model will be left trying to learn from a representation that is internally inconsistent. Neural networks assume that patterns are consistent across their training corpus. That a given feature or combination of features carries a stable semantic meaning. In a heterogeneous setting, that assumption doesn't hold. So the optimizer is pulled in conflicting directions as it tries to fit the incompatible distributions. This can lead it to converge on shortcuts that rely on dataset-specific artifacts rather than true attack behavior. These are brittle models that memorize statistical quirks, or overfit to dominant datasets, or ignore minority patterns entirely.

So the core challenge is not just combining the data, it is generating a shared representation where different datasets can become comparable. And doing so without erasing what makes that data distinct. That tension between alignment and preservation is what makes this problem fundamentally hard. And that is the constraint the authors are trying to resolve.

Their solution is built around an autoencoder, a neural network that is trained to compress data into a much smaller representation and then reconstruct the original data from that compression. Why? Because when you compress it you strip away redundancy and noise, and when you uncompress it that structure has to be reconstructed from the compact signal. Learning to do this makes the model very good at capturing the essential patterns that define the data. And this skill is particularly valuable when you're trying to align data from different sources. The compression process is called "encoding", the compressed form is the "latent space", and the expansion back to the original form is called "decoding". To compress data well and reconstruct it accurately, the system has to learn what aspects of the data are most important for representing it compactly. As a result, the latent space ends up capturing structure that supports both reconstruction and generalization across datasets.

The key idea here is that a single encoder with shared weights is trained on all three datasets at the same time. So every dataset gets compressed by the same function. This is what forces the model to generalize. If the encoder only had to compress one dataset, it could get away with learning the quirks specific to that environment. When it has to compress three structurally different datasets, it's forced to find what they have in common at a fundamental level. And here, that common ground ends up being the behavioral mechanics of attack (and normal) traffic, abstracted away from any particular protocol, naming convention, or network environment.

Before reaching that shared encoder, each dataset passes through its own projection layer. This is the part of the architecture that handles the incompatibility in feature counts. Each projection layer maps its dataset from its original dimensionality into a standardized intermediate representation. So by the time the data reaches the shared encoder, every dataset looks fairly uniform. By performing this kind of 'dimensional alignment' the projection layers ensure that the encoder's weights can be meaningfully optimized across all three domains.

On the other end, (after the latent space is generated), each dataset gets its own dedicated decoder that reconstructs the original input. These decoders serve a quality-enforcement role. If the encoder compressed everything into representations that were too abstract, the decoders would fail to reconstruct the original inputs accurately and the reconstruction loss would rise. This creates a tension that the researchers call "Structural Dualism". The encoder is being pulled:

Toward generalization by the requirement to compress all three datasets with shared weights.
Toward specificity by the requirement that the decoders can accurately reconstruct each domain.

The latent space that emerges from that tension is one that has internalized both universal attack patterns and domain-specific nuance. And that unified data layer is what the authors' hybrid classifier operates on. It's a combination of a CNN (a convolutional neural network), and a DNN (a deep neural network).

The former applies learnable filters across its input to detect localized patterns within the latent representation.
The latter takes those extracted patterns and processes the full representation globally, learning higher-level relationships across all dimensions at once.

The CNN is doing fine-grained pattern extraction, the DNN is doing global relational mapping.

And importantly: it works! Ablation experiments confirmed that this division of labor is effective. The authors benchmarked the hybrid against a standalone CNN and a standalone DNN, and their new system came out on top. That being said, there are some significant limitations here. Extending the framework to more than three datasets, for example, will introduce a lot of additional engineering overhead. More projection layers, more decoders, and more training coordination. The optimal latent dimensionality needs to be tuned empirically for each new combination of datasets rather than derived analytically. And hyperparameter selection for the classifier is iterative, which creates friction for anyone deploying this in a new context. These aren't fatal limitations, but they're real costs that would need to be weighed against the system's benefits.

If you want to go deeper, make sure you download the full paper. It includes a semantic analysis of each latent dimension, full per-class confusion matrices across all the evaluated datasets, and more detailed results from their benchmarking of each classifier.

Menu

Shared Autoencoder-Based Unified Intrusion Detection Across Heterogeneous Datasets for Binary and Multi-Class Classification Using a Hybrid CNN–DNN Model