Intrusion detection using TCP/IP single packet header binary image for IoT networks

Download the Audio (Right-click, Save-As)

The year was 1973, and a group of ARPANET researchers had a problem. Data could be sent between machines, but there was no general way to guarantee correct, ordered delivery across networks with different characteristics. Packets were dropped, duplicated, or reordered, and applications had no consistent abstraction to rely on. Existing protocols assumed reliable, homogeneous networks and broke down as soon as those assumptions failed. What was needed was a general-purpose mechanism that could sit above the network itself and guarantee end-to-end reliability. That need led Vint Cerf and Bob Kahn to design what would become TCP: the Transmission Control Protocol. A way to break data into packets, number them, send them, retransmit anything that was lost, and reassemble everything correctly on the other side.

TCP's job was not to understand applications, users, or intent. It existed to solve a narrow engineering problem. And the header fields that define TCP reflect that original scope. Sequence numbers, acknowledgments, window sizes, and flags. These are purely mechanical tools, meant to keep conversations between cooperative endpoints synchronized and robust. Nothing more than that.

But over time, TCP escaped that original context and (somehow) became the default transport layer for the modern internet. Today, TCP carries web traffic, financial transactions, industrial control signals, medical telemetry, and the constant chatter of IoT devices. Sensors, cameras, appliances, and embedded controllers most-often speak TCP, but they do so in environments that are resource-constrained, and poorly monitored. Contexts that were never envisioned by the protocol's designers. This has turned TCP itself into a massive attack surface. Malicious behavior now hides inside floods of legitimate traffic. Attackers exploit protocol semantics and stateful assumptions. They abuse connection setup and teardown. They manipulate congestion control, retransmission, and flow control. This results in attacks that are difficult to distinguish from normal traffic and hard to stop without accidentally breaking legitimate communication.

But, the same parts of the protocol that expose IoT systems to attacks may also offer a way to defend against them. Today's paper starts from the observation that TCP headers, despite their age, still encode rich, low-level behavioral signals about how a device is communicating. Instead of reconstructing sessions or inspecting payloads, the authors treat individual TCP/IP packet headers as data, transforming them into compact binary representations that can be classified in real time. In doing so, they turn the basic mechanics of TCP into a detection signal. On today's episode, we'll walk through how this idea works, and why it matters for IoT security. Let's dive in.

The core insights here are quite simple.

Network packets have structure.
Attack packets have a different structure than normal packets.
Convolutional neural networks are really good at recognizing structural differences.

The problem is that CNNs are built (largely) to work on images, not other types of data. So, to use them here, you'll need to transform your inputs (the TCP packet headers) into images. And that's essentially what the authors did in this paper. They call their invention the SPHBI: the Single Packet Header Binary Image.

To be fair, they're not the first people to think of representing network traffic as images. Other researchers have done this before, converting packet data into grayscale images, or RGB images, or various other visual representations. But, since previous approaches used grayscale or color, each pixel could take on hundreds or millions of different values. That requires more computation, more memory, more everything. The SPHBI approach uses pure binary. Black or white. One or zero. Each pixel is literally a single bit (like in a QR code). This creates maximum contrast, which makes patterns easier for the CNN to detect. It also makes the whole system dramatically more efficient.

So how do they construct these images? Well, they start with a TCP/IP packet header. Not the whole packet, just the header. This is important for several reasons.

First, it's a fixed size, which means you don't need variable-length inputs to your neural network.
Second, it aligns with user privacy because you're not looking at the payload, you're only looking at routing and protocol information.
Third, it works even when the payload is encrypted.

The authors had to choose which header fields to include and which to exclude. They ignore IP source and destination addresses because including them would prevent the model from generalizing. They also ignore checksums from both the IP header and TCP header because these don't provide useful information for classification. They're just error-checking values. In the TCP header specifically, they ignore the sequence number and acknowledgment number because these are only meaningful when you're looking at multiple packets in a session. Since this system is designed to make decisions on single packets, that sequential information isn't available anyway.

So what do they keep? From the IP header, they extract things like header length, service type, total packet length, packet identification, fragmentation flags, time to live, and protocol type. From the TCP header, they extract source and destination ports, TCP-specific flags, and window size. All of this adds up to just 18 bytes total.

Those eighteen bytes translate to 144 bits that they arrange into a 12x12 matrix. No padding needed, no extra bits added, just a clean square grid where each cell is true/false, 1 or 0, either black or white. That is to say: each bit becomes a pixel. If the bit is 1, the pixel is white. If the bit is 0, the pixel is black. And boom, that's your image.

Because of the high contrast, the images of malicious traffic look dramatically different from images of normal traffic. The patterns just pop out. It's almost like the difference between a barcode and random noise. Your naked eye can't necessarily see the pattern, but a CNN absolutely can. And since you're working with pure binary images, batch normalization is essentially built-in. Since the pixel values can only be zero or one, you don't need to normalize the input distribution. The data is already perfectly normalized. This eliminates an entire preprocessing step and reduces computational overhead during both training and inference.

The authors built two different CNNs to learn these patterns.

The first one is called Binary IDS. Is this packet benign or malicious? That's it, very minimal scope. It starts with the binary image as input. The first convolutional layer scans across the image looking for patterns. Then comes a max pooling layer that reduces the spatial dimensions, keeping only the most important features. A second convolutional layer scans again, this time looking for higher-level patterns in the features the first layer found. Another pooling layer reduces dimensions further. Then you flatten everything down to a single vector and feed it into an output layer that makes the final classification decision.

The entire model has only 35 trainable parameters. Not 35,000, 35 total. For context, a typical CNN might have millions of parameters. Even lightweight mobile-optimized models usually have tens of thousands. But binary classification is straightforward. You're just asking "attack or not attack." But what about actually identifying which type of attack it is? What about multiclass classification?

That's what their second model is for. They call it Multi-class IDS. This one is more sophisticated, but still remarkably lightweight by deep learning standards.

The architecture follows a similar pattern, but deeper. Instead of two convolutional layers, it has four. Each one is followed by pooling. The conv layers use multiple filters instead of just one, which lets them detect multiple different patterns simultaneously. After all the convolution and pooling, the feature maps get flattened and fed into an output layer with one neuron for each possible class. For training, they use Stochastic Gradient Descent instead of Adam, with momentum to help it converge faster and avoid getting stuck in local minima. And they train for more epochs than the binary classifier because multiclass classification is a harder problem that requires more time to learn.

So how well does all this actually work?

The authors tested both models on two public, packet-level IoT intrusion datasets: Edge-IIoTset and MQTTset. These datasets are important because they provide raw TCP packet captures with labels assigned at the individual packet level, rather than aggregated flows or sessions. That makes them a direct test of the paper's central claim: that a single TCP/IP packet header, in isolation, contains enough information to distinguish benign from malicious behavior. For evaluation, the authors filtered the traffic to include only TCP packets, then they excluded non-data protocols, and capped class sizes to control imbalance. Performance was measured using accuracy, precision, recall, F1-score, and false positive rate, and results were reported on held-out test sets and via cross-validation.

The results were striking. On both datasets, the binary classifier achieved perfect or near-perfect detection performance on unseen data, with zero false positives for normal traffic. This means the system never misclassified benign packets as attacks. In multiclass settings, the more complex model achieved very high accuracy on Edge-IIoTset and perfect classification on MQTTset. Where errors did occur, they were concentrated between attacks within the same broad category, rather than between benign and malicious traffic. Across both datasets, the proposed models matched or outperformed significantly heavier baselines, including deep neural networks, recurrent models, and random forests, despite operating on a single packet at a time.

So what can we learn from this paper?

Well, the key takeaway here is that TCP's lowest-level mechanics do still encode meaningful behavioral signals, even in modern, encrypted, and heterogeneous IoT environments. This study shows that intrusion detection does not necessarily require payload inspection, flow reconstruction, or even large models with long temporal context. Instead, carefully chosen header-level representations can expose attack behavior directly and reliably. By reducing the problem to single packets and minimal models, the authors demonstrate a path toward intrusion detection systems that are not only accurate, but fast, privacy-preserving, and realistic for deployment on constrained devices. To dive deeper into their architecture diagrams, dataset construction, or per-class evaluation results, make sure you download the paper.