Rate-adaptive RDMA congestion control for AI clusters

Download the Audio (Right-click, Save-As)

When you drive from Oakland to San Francisco you usually cross the Bay Bridge. Highways from north, south and east all converge into the toll plaza that sits just before the bridge itself. That's where you pay the crossing fee. At that point you've got 18 lanes, all packed with cars. But here's the thing. The bridge itself is only 5 lanes wide.

So the Engineers who built this system had a problem. How do you get 18 lanes of traffic to merge into 5 in a very short distance, without causing a traffic jam, a bottleneck or leading to any accidents. More to the point: how do you construct this system so that it's as efficient as possible; so that the maximum number of cars can enter and cross the bridge per minute. Not so many cars that the traffic slows, but not so few cars that you're holding back people for no reason. Their solution? Metering lights. The 18 toll booth lanes merge into 12 lanes that each have their own stoplight. The stoplights are set up to let just the right amount of cars from each lane enter the bridge at just the right time. If the Engineers want to modify the logic it's as simple as changing the timing or order of the lights. No need to remark the lanes or add any extra systems to the bridge.

And it works! The system easily moves over a quarter-million cars across the bridge every day, many of them crossing during the rush hour commute.

So why am I telling you all this? Well, it turns out that this concept: metering the entry of participants into a system, and optimizing that logic for maximum throughput...that's applicable to more than just roads and bridges. In the paper we're looking at today, the authors are applying it to RDMA traffic inside of data centers, and specifically to the problem of congestion control in the networks that support large-scale GPU training.

Instead of cars entering a bridge, the participants here are the thousands of concurrent data flows that are generated during ML training runs. Just like the Bay Bridge, the bottleneck there is right at the point where everything converges. Modern clusters can unknowingly inject far more traffic into a switch than the downstream links can carry, causing backup, head-of-line blocking, and Priority Flow Control pauses. There are existing congestion control schemes, for sure. But...they often react too slowly. By the time they adjust their sending rates, the queues have already formed, flow control has kicked in, and throughput has collapsed.

In this paper they propose a potential solution. A switch-driven, rate-adaptive congestion control mechanism that meters traffic at the point of convergence, adjusting flow rates in real time so the network stays fast, stable, and fully utilized. On today's episode we'll explore how their system works, and how well it performed when they put it to the test. Let's dive in.

Let's start with some context. Modern data centers use something called RDMA, Remote Direct Memory Access. It lets a machine exchange data in memory without involving the processor or operating system. The result is much lower CPU overhead and significantly reduced network latency. For AI workloads these kinds of gains are critical.

RDMA typically runs over something called RoCEv2, RDMA over Converged Ethernet, version 2. It's a protocol that uses a mechanism called PFC: Priority-based Flow Control, to ensure lossless transmission. PFC is essentially a hop-by-hop control mechanism. When a downstream switch's queue exceeds a threshold, it sends a pause signal to the upstream switch, telling it to temporarily stop sending data. This does prevent packet loss, but it isn't without its drawbacks. The pause mechanism is coarse, meaning it stops all traffic on a priority class, not just the problematic flows. This can cause PFC deadlock and head-of-line blocking, where innocent flows get paused along with the congested ones. This of course degrades the overall network performance.

To address this, there are a number of enhanced congestion control protocols. DCQCN, TIMELY, HPCC, etc. They use different signals to detect congestion.

DCQCN uses Explicit Congestion Notification marks on packets.
TIMELY measures Round-Trip Time.
HPCC uses In-Network Telemetry to estimate congestion.

The problem is, these end-to-end approaches still have a fundamental limitation: they require at least one full Round-Trip Time (RTT) to respond to congestion.

That's a problem. An enormous amount of data can flood the network within the span of a single RTT. So severe congestion can pop up before the sender even realizes there's a problem. And if you have bursty traffic (as you do with AI training workloads), this is especially problematic. Both for the health of the network and for the time it takes you to actually train.

So what can we do about it? The authors' solution is called FACC: Fast and Accurate Congestion Control. It takes a three-pronged approach: fine-grained congestion feedback, adaptive rate adjustment, and rapid convergence. Let's walk through each of them.

First there's the congestion feedback. Rather than waiting for signals to travel from the switch to the receiver and back to the sender, FACC enables switches to immediately send congestion notifications directly to the sender as soon as they detect a problem. This shortens the congestion control loop. But making feedback faster isn't actually enough. You also need to know which flows are actually causing the congestion. For that, FACC classifies each switch port into three states: non-congested, congested, and uncertain.

A port is congested when it's transmitting packets at line rate and its queue length exceeds a threshold.
A port is uncertain when its queue is above the threshold but transmission exhibits a periodic pause-and-resume pattern, which is typically caused by PFC.

When a packet passes through a congested port, FACC generates a Congestion Notification Packet with a congestion state marker. When it passes through an uncertain port, it gets an uncertain state marker.

These markers use specific bits in the Type of Service field in the IP header. One pattern means uncertain, another means congested. But how does it get where it needs to go? Well, the notification packet actually has its addressing reversed so the destination becomes the source and the source becomes the destination. This means it gets forwarded back to the sender through the ingress port. This allows the sender to identify which flows are responsible for congestion versus which flows are innocent victims of the PFC pauses.

But how does the sender actually measure the severity of congestion? Well, FACC leverages a property called packet conservation. In a lossless ethernet network with hop-by-hop flow control, packets don't get dropped. Every injected packet either flies in the network pipe, queues at a switch port, or gets delivered and acknowledged. Because of this, FACC tracks two things during each time interval: the number of packets sent and the number of congestion notifications received. The congestion level is simply their ratio.

It's worth noting that a packet might traverse multiple bottleneck links and trigger multiple notifications with different congestion signals. FACC handles this by assigning higher priority to notifications carrying the congestion state mark than those carrying the uncertain state mark. A table maintains the mapping between packet sequence numbers and congestion signals for each flow. When the same flow triggers multiple notifications, the congestion signal updates only if the new signal has higher priority.

For the rate adjustment, the system employs a Proportional-Integral controller. This is borrowed from control theory. The authors wrote a utility function that combines two components. The first is the proportional component, which measures the difference between the current congestion degree and the desired value. This enables a prompt response to deviations from steady state. The second is the integral component, which measures the difference between the current congestion degree and its value in the previous cycle. This is what facilitates rapid convergence toward equilibrium. The utility function ensures the value remains positive to avoid unreasonable rate adjustments. Then the sending rate gets calculated based on this value. The target sending rate in RDMA data centers represents the line rate. There's a threshold that distinguishes mild from severe congestion. When the utility value exceeds this boundary, gradual rate reduction becomes insufficient and FACC triggers immediate flow halt.

When a pause is necessary, FACC calculates a halt duration to avoid indefinite suspension. During each cycle, the sender monitors the number of packets sent and the number of notifications received. If a pause is needed, the difference between packets sent in the previous cycle and notifications received in the current cycle estimates the number of packets queued at the congested switch. This value divided by the notification reception rate determines how long to pause.

The question is: does any of this actually work? To find out, the authors constructed a simulation of a fat-tree topology with hundreds of servers. Each group of servers connects to a Top-of-Rack (TOR) switch. Above that layer, aggregation switches organized into pods connect each TOR to multiple aggregation switches through high-speed uplinks. At the top, core switches interconnect all pods, with each aggregation switch connecting to all core switches in full-mesh. This structure provides equal-cost multipath connectivity between any pair of servers.

Then they compared FACC against six other congestion control protocols. All the benchmarks were implemented using the same framework, and traffic was generated based on realistic workloads from production data centers. Some were skewed toward small flows, while others have a more balanced mix of small, medium, and large flows.

In the end, the results were pretty impressive. With burst flows, FACC dramatically reduced both the maximum and average queue length at the congested switch port compared to other protocols. And more importantly, congestion duration dropped substantially. For convergence, FACC reached stable rates quickly with minimal overshoot, and approached the fair-share rate rapidly.

That might all sound great, but how would you actually implement any of this? Well FACC only requires a few minor modifications on both switch and host sides. The basic functions can be implemented on any programmable switches that support the P4 language. And at the sender, the system needs a timer, counters for transmitted packets and received notifications, and logic to calculate congestion level and adjust rates through the PI controller. Overall, it's a fairly manageable implementation burden.

If you're working on data center networking, RDMA systems, or AI infrastructure, I'd encourage you to download the full paper. The authors provide mathematical derivations of the control system stability analysis, complete diagrams of the system architecture, ablation studies testing individual components, and additional experimental results across different workload mixes. There's a ton of technical depth and an exploration of various edge cases that we just didn't have time to cover here.