Kubernetes Autoscaling Overview
December 15, 2020
Kubernetes is often touted as the solution for modern cloud native deployments of applications. It provides an agnostic platform for running code in a fashion that can tick all the “-ility” boxes (i.e. reliability, scalability, observability, manageability etc).
However when working with Kubernetes, by default nothing autoscales!
Autoscaling is a common tool when considering the scalability of your application. After all having to scale up a production application manually is time consuming and error prone. In Kubernetes you can specify a deployment with multiple replicas which can help with high availability, but that doesn’t address any sort of autoscaling concerns. Autoscaling can help your application keep up with increased load and save you money in times of reduced load.
While they don’t come setup by default, there are three main types of autoscaling that you can utilize when it comes to Kubernetes. They are:
- Horizontal Pod Autoscaler (HPA)
- Vertical Pod Autoscaler (VPA)
- Cluster Autoscaler
As implied by the names, HPA and VPA work on the pod level inside the cluster, whereas cluster autoscaler will add entire nodes to your worker fleet, dynamically scaling the entire cluster. Note that both HPA and VPA require Metrics Server installed in your cluster in order to function.
HPA and VPA should not be used together on CPU/Memory for a deployment. You can however combine them if using custom metrics on your HPA definition. That would allow you to scale up if you hit a CPU threshold, and scale out if you hit another limit, like request throughput, or queue length if you are autoscaling some kind of task processing pod.
Horizontal Pod Autoscaler is a part of the “autoscaling” api group in Kubernetes. As of right now, the v1 version of this API only supports autoscaling based on CPU, but the beta version supports autoscaling based on memory or custom metrics.
HPA is a resource in your cluster that you can create with a definition like so:
apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: my-app spec: maxReplicas: 10 minReplicas: 1 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app targetCPUUtilizationPercentage: 80
It behaves pretty simply. You set a target utilization, minimum and maximum number of pods, and what deployment to target, and HPA does the rest. It will scale your deployment up/down based on the CPU utilization and try to keep the CPU metric as close to the target as possible, respecting the min and max replica values. If overnight you get a lot less traffic, now your deployment can scale down so as not to be using as many resources during that time. Then in the morning when traffic picks back up, it will automatically scale up to keep up with the increased load.
Whereas HPA deals with scaling horizontally, or adding/removing pods, Vertical Pod Autoscaler scales vertically, which adjusts the resources allocated to pods in order to give them more CPU/Memory.
Also unlike HPA, VPA is not baked into one of the default APIs, and must be installed into the cluster with a Custom Resource Definition (CRD).
A sample VPA definition might look like this:
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: my-app spec: targetRef: apiVersion: apps/v1 kind: Deployment name: my-app updatePolicy: updateMode: "Auto"
VPA will examine your pods, and if they are maxing out their resources, will update the pods to give them more resources. There’s some nuance to the “updateMode”, in that “Auto” by default will evict running pods to adjust their resources. This gives you a more dynamic scaling, but also means that pods can be evicted by VPA, potentially causing downtime if your configuration doesn’t have multiple replicas and/or you haven’t configured Pod Disruption Budgets.
There is no current way for VPA to update resources on a pod in place, but if/when that becomes available in the future, it is likely that it will become the default updateMode when selecting “Auto”.
Cluster Autoscaler behaves much like HPA, but instead of scaling pods it adds/removes worker nodes in your cluster. When combined with HPA and/or VPA, this allows for a truly dynamic cluster experience, allowing you to go from a single worker node with a single pod all the way up to a large cluster with tons of pods without having to scale anything manually.
Installing and using cluster autoscaler takes a little bit more work than HPA though, since it needs to be aware of where and how you are running your cluster. For example, if you’re running a cluster in AWS, cluster autoscaler needs IAM permissions to manage the autoscaling groups for your worker nodes, where as if you’re running in Azure, you’ll need to create a service principal in your Subscription with at least the “Contributor” role.
Cluster autoscaler works by looking at the resources of all the nodes and all the pods, and determining if nodes need to be added to support the current number of pods (or unschedulable pods because of a lack of resources), or if entire nodes can be removed because they are either unused or their workload could be moved to other nodes.
Cluster Autoscaler is powerful because it doesn’t simply scale on the CPU Utilization % on the worker nodes, but performs pod scheduling simulations to see where pods could be scheduled if it were to add or remove nodes.
Since your infrastructure usage can fluctuate day to day or even minute to minute, your costs could vary greatly. Likely for any company already running applications in the cloud, you are used to the pay-as-you-go model, but its still something to take into consideration. Month to month your cloud bill could look quite different
When setting up autoscaling, you are likely operating in an elastic cloud environment that could scale up “infinitely”. Due to this, an issue with your autoscaling definitions, a bug, or a force majeure could cause your cloud bill to sky rocket unexpectedly. To prevent this, you should always place reasonable maximums on your autoscaling policies to prevent situations like these. What is “reasonable” will differ between companies or even between applications at the same company, but don’t get caught with thousands of dollars in extra cloud spend because you let your cluster scale up to 100 nodes on accident!
In the last paragraph, I put “infinitely” in quotes. This is more than me just saying that obviously there isn’t infinite computing power. Within your cloud provider, there will be limits that could prevent you from autoscaling, even if you wanted to. I speak mostly in an AWS context, but these are sure to apply to other clouds. Here is a non-comprehensive list of some limits you could run into:
IP Address exhaustion (Make sure your VPC/subnet CIDRs have enough IPs available!)
- Every pod in Kubernetes gets its own IP address. Depending on your networking setup, a cluster can easily eat up all the available IPs in your CIDR.
Availability Zone EC2 Capacity
- As hinted at before, the number of available EC2s running in a given Availability Zone is not infinite. It is possible to try to spin up an EC2 and have AWS reject your request because there isn’t capacity at the current moment.
- It can be useful to be able to use multiple instance types, as sometimes only a specific instance type can be at capacity
AWS Service limits
- In part to protect you from unexpected spend, and in part to protect them from people eating up a lot of resources unnecessarily, AWS has Service Quotas. If you hit limits on things like Network Interfaces or EC2 Instances, you will have to request a quota increase from AWS.
- If either you can’t or don’t autoscale your cluster, but do autoscale applications within the cluster, you can quickly exhaust the available cluster resources. In that scenario you might then be unable to schedule new pods and/or lower priority pods will be killed so that higher priority pods can continue to run smoothly.
- If your applications scales up and is serving more and more traffic, you will need to ensure that the path that the traffic takes is also elastic enough to keep up with demand. This might mean needing beefier instances, scaling up load balancers etc so you don’t flood your own network.
I’m sure there are other nuances and caveats that I haven’t addressed, but these are some major ones that I am aware of or have personally experienced in the past. Keep them in consideration when planning out for a Kubernetes cluster that will involve autoscaling!