Free Sample Episode

Application of Proximal Policy Optimization for Resource Orchestration in Serverless Edge Computing

Today's article comes from MDPI's Journal of Computers. The authors are Femminella et al., from the University of Perugia, in Italy. This paper takes on the serverless "cold start" problem and proposes a fix for it. If you've ever used AWS Lambda, Google Cloud Functions, or Azure Functions I think you'll appreciate this research.

DOI: 10.3390/computers13090224

Book
Book
0:00 0:00
Download this Episode (Right-click, Save-As)

Serverless is great. In a number of ways it's truly a transformative technology. And it's spreading: every year more hosts jump into this space to offer their own Function as a Service (FaaS) platforms: from AWS, GCP and Azure, to the edge offerings available on Cloudflare, Fastly and Akamai, to the open-source options like OpenFaaS.

But as with anything, serverless is not without its drawbacks. From the "uncapped costs" problem, to the "how do I run this locally?" problem, there are obviously a few areas where the developer-experience could be improved. And that's to be expected given how nascent the technology still is. But there is one issue that's a deal-breaker for many devs. An issue so off-putting that it actually prevents them from adopting serverless at all. That issue is the "cold start" problem.

In the tech world there are actually two things referred to as the “cold start” problem. One has to do with starting marketplaces, that usage of the term was popularized by Andrew Chen is his book of the same name (which is a great book btw). The other use of the term cold start refers to the fact that serverless functions have a cold-state and a warm-state. When they’re warm, a request sent to a serverless function can return a response very quickly. For most cloud offerings that means as little as single-digit milliseconds. But in a cold-state there is a noticeable delay in the request processing. Cold start delays can range from hundreds of milliseconds to multiple seconds. So maybe "noticeable" is an understatement.

In this paper the authors take on the serverless cold start problem. They examine why it happens, then they spin up an OpenFaaS system and augment its internal scaling function with a machine learning model. That model is optimized to minimize cold starts while also minimizing overall costs. First, a little background:

Cold starts happen for a fairly simple reason: someone has made a request to a container that doesn’t exist yet. You see, under the hood serverless functions are just lightweight applications running within a container. The FaaS platform abstracts all this away, so that you as the FaaS developer don’t have to worry about having your application listen on a port or do any bootstrapping. You just write a function, and the platform turns it into an application. But in order for your application to process an incoming request, it still needs to exist and be running somewhere. In most cases a gateway (usually provided by the FaaS infrastructure) sits in front of all the functions, accepts the incoming requests, and attempts to relay the request to an available container that has your application running inside of it. Now let’s say a request comes in but your system only has one container running your application and that instance of your application is busy processing a different request. Well the system has two options: it can either wait for the busy container to finish, or it can spin up a new container and route the request there. Neither one of these options is instantaneous. Waiting takes time, as does spinning up a new container. This time delay between the request coming in and it getting routed to an available container is the cold-start.

The key thing to know is that when the FaaS infrastructure spins up a new container to process an incoming request, it doesn't immediately terminate that container afterwards. That container sticks around for between a few seconds and a few minutes making itself available to process other new requests. When the system has at least one extra container sitting around waiting to process the next request, we say the system is "warm". If you have very consistent load, that is you receive roughly the same number of requests every minute all day, your system will likely stay warm all the time and after the initial ramp up you'll never experience a cold start.

But in startup-land especially most applications don't experience consistent load, they experience peaky load: A bunch of hits all at once, then none for a while, then another spike, then a plateau, up and down, up and down. And in those types of scenarios, your users will be hitting cold starts all the time. Parts of your system that are suppsoed to be fast and snappy will be unacceptably slow, in an intermittent way that will make for a very inconsistent and frustrating user experience.

To try to avoid cold-starts, most FaaS platforms have some kind of decision-logic built into an auto-scaling function that preemptively starts to scale the number of containers according to different thresholds and trip-wires. The problem is: these systems are largely just not very good. They're either too aggressive (meaning they spin up extra containers too often which wastes money), or they're too conservative (meaning they don't respond quickly enough to load spikes). What's needed is a robust solution that can adapt to the particular load patterns of the application, and teach itself to spin up just the right amount of containers at the right time.

Enter: this research paper.

The authors decided to build a proof-of-concept of an auto-scaling function that uses Reinforcement Learning instead of thresholds and trip-wires. They built it on top of OpenFaaS, which itself runs Kubernetes under the hood. So the autoscaling function they needed to extend was actually the Kubernetes Horizontal Pod Autoscaler (HPA), the same autoscaling logic that expands and shrinks any other k8s cluster in response to demand. In fact, side note: the default HPA in OpenFaaS Community Edition has scaling limits not found in Kubernetes, so the authors swapped it for the actual HPA from k8s.

The scaling algorithm is a control loop. It runs the same logic over and over at a default interval of 60 seconds. For this research they lowered it to 15-second intervals and modified it to run inference on the model after each loop. In realtime that model would take in all the contextual information on the state of the system and output a scaling-decision.

Now they needed a model that was optimized to:

  1. Minimize the number of functions (containers) instantiated
  2. Maintain a given SLA (hard limit) on response times.

So how did they train it?

They started by downloading the Azure Functions Trace 2019 dataset. This is a dataset of invocation profiles and usage patterns that illustrate real-life demand/load on Azure Functions in the wild. They setup a cluster of OpenFaaS functions and used a load tester called “Hey” to simulate/replicate the load patterns found in that dataset. While the system ran, they collected a number of metrics at specific intervals. Namely:

  1. Mean response latency
  2. The CPU threshold in the HPA
  3. Total and average CPU usage
  4. Total and average RAM usage
  5. Number of requests received
  6. Success rate of requests
  7. The time of day

This collected data became the dataset they would use to train their model. They trained the model using PyTorch and Gymnasium, and implemented a PPO algorithm (Proximal Policy Algorithm) utilizing the Stable-Baselines3 package. The PPO algorithm (a Reinforcement Learning algorithm) paid attention to rewards associated with each incoming request. During training the system allocated failed requests and high-latency requests with negative rewards and allocated positive rewards to requests that were successful and processed within the SLA. So the process of training was really just the PPO algorithm running back the load tests, and adjusting parameters in order to maximize the rewards. After training the authors plugged their model into the autoscaling logic they had implemented, then ran additional simulations to validate the system.

The results were interesting: At small amounts of load, the model is actually far more aggressive than the normal HPA: it proactively warms up containers and operates practically on a hair-trigger. But during times of higher load it gets more and more efficient. At significant load the new model was able to not only maintain the SLA but do so with 100% improvement in resource utilization. meaning it was allocating the right number of containers just in time.

As someone who has worked with FaaS systems, I think this research is a big win. I’ve personally struggled with the cold-start problem, and I've resorted to naive workarounds like pinging all my functions several times a second just to make sure there’s always more than enough capacity available. I’ve hoped that a better, more elegant solution would come around eventually and I think this is it. Or at least this is the right direction.

If you're using a hosted FaaS (like AWS Lambda), a solution like this isn’t something you can necessarily implement yourself. You don't have access to the autoscaling function. But it does shine a light on possible mechanisms that the cloud providers could use to improve the cold start issue for their clients. I for one would love if I was given the option to opt my functions into a service like this. While it is more aggressive during low-load, the fact that it's so much more efficient during higher load means that, in theory, a model like this could save you money while improving the user experience of your application. Win-Win.

If you’re running your own FaaS cluster and want to try augmenting the HPA algorithm yourself, definitely download the paper to read more details about how they trained the model and integrated it into OpenFaaS. And for the rest of us, just be on the lookout for cold-start optimization options coming down the pike from the cloud providers sometime in the future. Fingers crossed.