On the Optimization of Kubernetes toward the Enhancement of Cloud Computing

0:00 0:00

Download the Audio (Right-click, Save-As)

Way back in 2014 Google released an open-source version of Borg, their in-house cluster management system. Borg was immense and highly specific to Google’s systems and servers, but this new open-source version would be different: It would be smaller, portable to different types of systems and hardware, focused narrowly on container orchestration, and generally useful for a number of workloads. They called the new project: Kubernetes.

Kubernetes came out at exactly the right time. Adoption of Docker had spiked in the previous years, microservices were all the rage, and companies of all sizes were grappling with a non-trivial task: container orchestration. With no robust turnkey solutions available, developers were either building one-off systems from scratch, or struggling to port their existing CI/CD systems to work with containers. Google was uniquely positioned to step in. They had been early adopters of containers, and had spent a decade learning how to run them at scale. Kubernetes (or k8s as it came to be known) was the distillation of everything they’d learned...just scaled down.

It’s now ten years later. Kubernetes is as popular as ever, and is still gaining market-share. The enterprise adoption of Kubernetes (and the workloads those enterprises are running), has meant that the framework is being pushed beyond its original scope. Remember, k8s was supposed to be a scaled-down version of Borg, not a fully-featured replacement. And now k8s is being pushed to levels of scale and complexity that it, arguably, was never intended to facilitate.

As k8s users have scaled their deployments, they’ve hit serious growing pains, and have had to figure out workarounds for the bottlenecks. The authors of this study are some such users. This paper details the playbook they've developed to scale k8s clusters beyond its original limits.

Before we jump into the list of modifications they made, let’s review the high-level architecture of Kubernetes as a whole. K8s has a distributed Primary-Secondary architecture (what we used to refer to as "Master-Slave"). The brain of the system is the Primary node. It’s responsible for scheduling, coordinating and assigning tasks to the secondary nodes, as well as storing configuration data. The Secondary nodes receive and process task instructions from the Primary node, and are responsible for governing the full lifecycle of the containers they carry. There are also a number of standalone CLI tools and SDKs/APIs that let a user interact-with and configure any of the servers in their system. The CLI abstracts away the Primary-Secondary convention, opting instead to present Pods, Labels, Deployments, Services and other high-level concepts to the user. As a whole, the k8s system is design to optimize for three goals:

Automation: Kubernetes configuration is declarative, so all the plumbing and logic necessary to manifest that configuration is automated and abstracted away. For example: if a user sets that there should be three instances of a given container running all the time, the system will figure out how to make that true, regardless of updates, crashes, failures, or anything else that would throw off the state of the system.
Service Oriented Architecture: Kubernetes is service-centric. You can try to run a monolith on k8s if you really want to, but you’ll be fighting against the conventions that the system wants you to adopt. To use k8s in an idiomatic way is to build containerized, horizontally replicable services and then to let k8s scale them up and down for you independently based on service-specific demand.
High Availability: Kubernetes is built to maintain SLAs at scale. It does this by allowing its own secondary nodes to scale horizontally, by handling many different types of failure-conditions and corner-cases, and by ensuring that rolling updates can be performed without experiencing significant downtime.

So what’s wrong with this architecture? Nothing, really. It’s been suitable for hundreds of thousands of systems, big and small. Right out-of-the-box it offers more power and configuration options than most users will ever need. The authors of this article however are part of the tiny slice of users who have managed to outgrow it. Here are the issues they faced. As their systems grew in scale and complexity, they began to experience six distinct problems:

Etcd, the distributed key-value store that keeps the k8s configuration data, struggled to scale horizontally. As network latency and disk I/O latency rose, etcd’s heartbeat would timeout and reset their consensus mechanism. As this occurred more and more, performance suffered.
Etcd’s auto-backup mechanism was missing key k8s objects like Type, Namespace and Label. This made it unsuitable to backup certain resources.
Corner-case issues began appearing in the rolling update procedure. First, the system seemed unable to handle a TERM signal if it received it during an update. Second, if certain pods were not running just before an update, they would fail to be restarted during the update.
As more services were added, the ingress configuration file became larger. And the larger it became, the more time the Ingress Controller took to reload after a change. At a certain update frequency, these long reloads can cause a significant quality of service issue, or even a service interruption.
The k8s autoscaling mechanism relies on an internal tool called cAdvisor (container advisor) but as the overall system scales up, cAdvisor’s indicators and metrics start to lag. This lagginess means that the autoscaling decisions are being made on stale data.
The scheduling algorithm (the algorithm that decides which container should live on which machine) doesn’t take into account all of the server-metrics that may be relevant in a scaled-up environment: like network I/O. This can lead to the scheduler making suboptimal decisions and choosing the wrong machines to allocate containers to.

The authors proposed a multi-part solution that addresses each of the above issues.

They provisioned dedicated SSD drives for etcd, and allocated it higher disk I/O permissions.
They swapped the default backup mechanism for an application called Velero. It's a disaster recovery and migration tool.
They used lifecycle hooks, readiness probes, and Pod Interruption Budgets to enable more aggressive zero-downtime procedures and smooth out the rolling updates.
They swapped the default ingress controller for Traefik, which supports continuous configuration updates without restarts.
They swapped the cAdvisor default metrics for custom monitoring via Prometheus. Prometheus, ironically, was originally developed for Borg. These improved metrics allowed for better-informed autoscaling decisions.
They wrote an extension of the k8s scheduler. Kubernetes offers a Scheduler Extender mechanism that allows users to add on custom logic and integrate arbitrary sources of data, rather than rewriting the entire scheduler itself.

Once their suite of solutions was operational, the authors deployed their cluster and performed benchmarking and analysis on it using the load testing libraries Fortio and Gatling. The results were promising. When the number of concurrent requests scaled beyond 2,000, their system was able to shave 7.6% off of the average request time, and reduce the minimum number of request-failures by 32%. While CPU and memory usage wasn’t reduced by a significant margin it also didn’t increase, which is definitely a win.

So what can we take away from this research? Narrowly, if you’re scaling a Kubernetes cluster and are pushing it to the limits of its performance, the tips and tweaks in this paper should help you break through some of those bottlenecks.

But more generally, I think the authors’ methodical approach to scaling is broadly applicable. That is: as you push systems to their limit, your instinct may be to tear everything down and start over; to re-architect a completely different system using a different design-pattern and different architecture. While that approach may be satisfying, it’s wrought with risks, and rarely solves the problem without creating far more. A significantly more conservative approach (slowly and deliberately modifying key pieces of the infrastructure one little bit at a time) may be a better option. The key is to focus on one bottleneck at a time, solve it, then move on to the next one.

If you’re interested in reading more of the details of exactly how they implemented their solutions, I’d encourage you to download the paper.