Today's article comes from the IEEE Open Journal of the Industrial Electronics Society. The authors are Shahri et al., from the University of Aveiro, in Portugal. In this paper they argue that the MQTT protocol is not suitable for industrial applications because it lacks timeliness guarantees. They propose a new system to overcome these limitations. Let's see what they came up with.
DOI: 10.1109/OJIES.2024.3373232
Let’s say you and I are Oompa Loompas in a chocolate factory. We have various jobs, and we do all the meaningful work, while the crazy guy in the hat gets all the credit. But that’s fine, it’s what we signed up for. Your job is to take caramel squares and dip them in chocolate. Then you hand them to me. I sprinkle a little salt on top and carefully wrap each chocolate-covered caramel in a cellophane wrapper, twist the ends, and then place that carefully in a box lined with tissue paper. You dip, I wrap.
Everything’s fine except you can dip way faster than I can wrap. It takes you 2 seconds to dunk a caramel, and it takes me 10 seconds to sprinkle, wrap, and put it in the box. So what do we do? If you dunk as fast as you can, you’re going to be sending me chocolates 5X faster than I can wrap them, and within a few minutes we’re going to have an “I Love Lucy” situation where I’m stuffing chocolates in my pockets to keep up with your pace. Nobody wants that.
So we have at least two options. We can, for example:
In these two examples: the fancy conveyor belt, and the cooling racks. These are both examples of queues. And you and I, the Oompa Loompas, are components or “workers” in a distributed system (a factory in this case). The queues in our factory, these special pieces of equipment that allow us to buffer the work that needs to be done between different components, are the core of all distributed systems. Whether those distributed systems are web applications, car assembly lines, or a long-running mapreduce functions, odds are: there is some kind of queuing mechanism at their core. Queues are the fundamental data structure, the skeleton that makes giant unwieldy, unpredictable systems possible. In my opinion, this was best articulated in 2014 in the Reactive Manifesto. If you’ve never read it, check out reactivemanifesto.org it only takes a couple of minutes to read the whole thing.
There are a number of commercial queuing products on the market. AWS SQS is probably the biggest name there; Google Cloud also offers a pub/sub that’s very popular. But those kinds of cloud service offerings aren’t always the right form factor for the system you’re building.
By the way, don’t get hung up on the difference between a “pub/sub” and a “queue”. There are certainly differences but for the purpose of this episode we’re treating them as one family of products and using the terms interchangeably.
Anyway, sometimes it makes sense to use the cloud queues, sometimes it makes sense to operate your own queue. Many of the open-source options aren’t just applications, but they’re actually a set of novel queuing protocols and the application that uses them. So if you use RabbitMQ, that’s all going to be based on the AMQP protocol. If you’re going to use ZeroMQ, it’s based on ZMTP.
Side note about ZeroMQ. If you want to go down a rabbit hole on queuing, there is a YouTube video from a decade ago called “Pieter Hintjens - Distribution, Scale and Flexibility with ZeroMQ” that only has, I think, 12k total views. It is in my top 5 best tech talks I’ve ever seen. If you’re in management or leadership in any way and you’ve never heard of Conway's Law, stop listening to this episode right now and go watch that instead. RIP to Pieter.
I digress. In addition to the AMQP protocol and the ZMTP protocol, there’s a protocol called MQTT: the Message Queuing Telemetry Transport protocol. Like some of the others, it’s an application-layer protocol but it's built on top of TCP. So if you think of the OSI model, it lives in the same layer as things like HTTP, FTP, and SSH. MQTT has been around since the late 90s; it’s big and open, it has an ISO standard, and a governing body and all that. It’s not like a small 1-person-in-a-garage kind of thing, it’s a big legitimate protocol, and changes to that protocol take time, and there’s a process and a bureaucracy to navigate. MQTT doesn’t change on a whim.
So that brings us to today’s paper, which, by my account, is at least the 6th paper by these authors about this topic. They have been, since, I believe 2021, slowly building the case in a series of articles that MQTT isn’t currently suitable for industrial applications, and they have a proposal for how exactly we can overcome that.
In previous work, they defined the problem space, mapped out possible options, and architected a solution, and this paper is actually just enhancing the previous solution and then running a bunch of advanced benchmarking and analysis on it to prove that it does what they claim it can do. Based on some of the hints they dropped during the paper, I’m going to guess that they have at least 5-6 more papers to publish on this, that will probably take them several more years, so this one comes right in the middle of this giant decade-long case that they’re making. Since we’ve never talked about their research at all, I figured now would be a good time to catch up on the problem they defined, the solution they’ve architected so far, and then give a preview of where this research is going over the next couple of years.
The problem with MQTT:
Think of MQTT as offering 3 types of message delivery:
Within those services, they offer different QoS (quality of service) guarantees, named QoS 0, 1, and 2:
The issue with these guarantees (QoS 0,1,2) is not about what they offer, it’s about what they don’t. None of these guarantees offer anything concerning timeliness. The guarantees say what is going to happen, but not when it’s going to happen. There’s no way for me to say, for example: “I need at-least-once delivery within this specified time frame”. Why does this matter? Well, let’s go back to the chocolate factory. Let’s say the caramels you’re dipping right now are a special order for Violet’s birthday, tomorrow. They’re special caramels, they’ve got her name stamped right into the chocolate coating. That order needs to be boxed and packaged to go out by this afternoon. You’re loading up these cooling racks with chocolates, there’s no way for you to communicate to me that not only should I prioritize these, but there is a specific deadline I need to hit.
These lack of temporal / timeliness / “predictable execution” guarantees, the authors argue, make MQTT unsuitable for industrial environments. And that should make some kind of intuitive sense. If you’re running a factory, you need to be able to delegate things to different priority levels and attach specific timelines to their delivery and execution. Without that, your job is a lot harder.
What the authors propose:
Starting in MQTT version 5, which was released in 2019, developers can add what’s called “user properties” to a message. Instead of the message having a fixed header and a payload, there’s a fixed header, a payload, and what’s called a variable-payload where you can add key-value pairs of data. So in their proposal, they’re adding a series of key-value pairs to the variable payload:
But all of those new key-value pairs in the user properties don’t magically do anything by themselves, right? It’s like, if I drop a postcard in the mail and write on it “Hey USPS, deliver this by Thursday”, that’s not going to have any effect on anything at all. Conveying my timeliness desire doesn’t matter if there’s not also the infrastructure in place to make good on my request. With USPS, that means buying the class of post that guarantees delivery at a certain date. With MQTT, we need some kind of system that can look at the requests in the header and make good on them. That is where SDN comes in.
Very briefly, SDN (software-defined networking) is basically an application built into switches, routers, and other internet infrastructure that can route packets based on logic instead of just analog routing. We covered SDN in more detail in the episode on October 9th titled “Traffic Classification in Software-Defined Networking Using Genetic Programming Tools.” I’d recommend reviewing that episode for more details. And another sidenote: This is a good time to mention, as of yesterday afternoon, the episode archive is up at JournalClub.io/archive. So if there are any episodes from the past that you missed and you want access to, like the SDN episode I just mentioned, go to the archive, find the episode name, and then send me a note with the episode name so I can re-trigger the email send for you. At some point, re-sends will be fully automated, but, baby steps.
Anyway, back to the story. When MQTT is running, and its publishers are sending unicasts, multicasts, and broadcasts through the MQTT broker and out to the subscribers, the packets involved may be getting routed through SDN (if you’ve set up that kind of system). So in the authors’ proposal, the SDN looks at the headers, sees the timeliness requests, and then performs the routing in a way that maximizes the chances of those timeliness requests being honored.
And that’s basically it, in a nutshell. The system they defined is called RT-MQTT (Real-Time MQTT) and MRT-MQTT (Multicast Real-Time MQTT), which for all intents and purposes is a superset of RT-MQTT with extra logic for those use cases.
Most of the paper is spent defining the architecture, governing logic and algorithms for how the SDNs would operate, but then the authors spend a considerable amount of time simulating and analyzing the performance of their new system. They set up emulations in Mininet, which is a network emulator, and used the Ryu OpenFlow controller to simulate some of the lower-level aspects of SDN. They ran the system under low, medium, and high loads and focused their analysis on WCRT.
WCRT stands for Worst-Case Response Time. If you’re familiar with Big O notation for algorithms, it’s a lot like that, but for a distributed system, in the way that they both represent upper bounds. In order to produce WCRT, they first calculated HA and TA. HA (Holistic Approach) focuses on the cumulative delays across all nodes and switches on the path, while TA (Trajectory Approach) calculates the latest time a message could start at its final destination by working backward through the network path. HA is more pessimistic, but TA requires more computational resources and complexity to implement. So they considered both. Here are the broad strokes of their results:
The road ahead:
The results so far are promising, but the authors aren't done yet. Over the next few years, we can expect more papers from them that will go further and deeper on this. Future research is likely to focus on improving scalability across even larger multi-edge networks, refining the precision of the real-time schedulability analyses, and exploring advanced techniques for handling mixed traffic loads with minimal impact on non-real-time data. They may also investigate integrating more robust security mechanisms that don't compromise the timeliness guarantees.
We’ll be keeping an eye on them as they continue to publish. And when there’s a big announcement, you’ll hear about it here on Journal Club.
If you’d like to view the architecture diagrams they created for this system, or the formulas that govern its execution, please do download the paper. If you’d like to read the previous papers this team authored on this subject, they’re listed in the reference section, as reference numbers 12-16.