Scale a specific Workload

Before getting into specifics on how Service Mesh Manager helps with detecting any issues, let us summarize how scaling works in Kubernetes. When it comes to Pods, there are two basic resources (omitting the storage aspect for now) that can be specified and limited: memory and available CPU shares. The CPU and Memory usage of a Pod is the sum of resources required by the containers inside the Pods.

Detecting issues regarding insufficient memory is usually easy: if a container wants to use more memory than allowed, it goes into OOMKilled status, as the operating system cannot give it more memory. To ensure the container is operational, the first thing to do in such cases is to increase the memory available for the container, so that it can start. Any investigation can continue after the system is already operational.

CPU resources are by nature different from memory: the underlying operating system can rely on the kernel’s scheduler to give less CPU cycles for a given container instead of terminating it. The amount of work that the given container wanted to execute, but did not get the CPU shares to do is called CPU throttle. If a container’s CPU throttle value is more than zero, the container is processing requests and data at a slower rate than it would be able to do due to a CPU starvation issue.

Horizontal and vertical scalability

A workload scales horizontally when adding more Pods to the workload increases the total performance or the number of bandwidth/requests the given workload can serve. An example for such workloads are stateless API servers, where adding one more Pod to the workload increases the throughput almost linearly, as more “workers” are serving the same incoming traffic. Horizontal scalability is desired in Kubernetes environments as most Kubernetes deployments can dynamically provision new Kubernetes worker nodes, thus allowing for more Pods to be executed on the cluster.

A workload scales vertically when its performance can only be increased by allocating more resources (memory, CPU) to the given workload. This is usually the case with database engines (such as MySQL, PosgreSQL, or even Prometheus). For example, for MySQL it is easier to increase the amount of CPU and Memory available for the database workload, than to start setting up multi-master replication of the database to allow for some horizontal scalability.

The main issue with vertical scalability is that Kubernetes nodes always have a hard limit on the amount of maximum CPUs and Memory available. In case a workload requires more resources than what the nodes can architecturally provide, then the workload is never scheduled, causing an outage. That’s why horizontal scalability is preferred over vertical scalability.

Scale a specific workload

To find the right strategy for scaling a workload, you must first understand if the workload scales horizontally or vertically.

Note: Usually, workloads exhibit both vertical and horizontal scaling properties. For an example on a horizontally scalable service, see Scale the Istio Control Plane. A typical vertically scaling service is Prometheus, see Scale the built-in Prometheus.

Determine memory requirements

If a Workload is running out of memory, it is instantly visible as the pod is killed instantly. In such cases, before starting to analyze the memory usage patterns, try to significantly (by 50%) increase the requests/limits of the workload. This might help to (at least temporarily) recover the system, minimizing the perceived outage by the end user. It also shows if the current incident is caused by insufficient resource allocation, or it is a systematic issue.

The graph showing the memory usage of the Pods can be also used to determine what is the minimum required amount of memory for the Workload to operate properly:

Memory saturation Memory saturation

Since this only shows data for the last few hours, if you don’t know the scaling behavior of service, complete the following tests to better understand it:

  • Generate synthetic load on the service to see how the Pod’s memory utilization changes over time
  • Upscale the Workload to have more replicas, and then apply the same synthetic load, so that it is visible how memory utilization scales

Determine CPU requirements

In case of CPU, the kernel of the node can decide to not give CPU cycles if the given container is over its limit. These starvation issues mean that any operation that the container would want to execute is executed slower, resulting in longer processing times or bigger response times.

To check if this is happening, navigate to the Workload and check the HEALTH > SATURATION: CPU.

CPU saturation CPU saturation

In this example, it is visible that the database container is most likely responding slower than if it could have used the 0.6 vCPU cycles to serve additional requests. In such cases, upscale the container at least with the amount visible in the throttled seconds graph to avoid performance issues.

Auto-scaling a workload vertically

If a Workload only scales vertically, usually it is recommended to find a safe limit where the Workload is performant enough, and set up alerts in case the given metric (Memory or CPU) gets close to the limit.

Another solution is to use a Vertical Pod Autoscaler (VPA), however, this has some limitations and it does not solve the issue of Kubernetes nodes having a finite set of resources.

The most important limitation is that usually stateful services such as databases are vertically scalable, but in order to change the resource limits the Pods must be restarted. If the database in question cannot handle this failover, then a small outage happens.

Auto-scaling a workload horizontally

It is generally recommended to run multiple instances of horizontally scalable Workloads with small CPU and Memory usage. Kubernetes provides built-in support for automatically adjusting the number of the Pods serving the Workloads based on the CPU or Memory metrics.

Using Horizontal Pod Autoscaler

To use the Horizontal Pod Autoscaling support in Kubernetes, complete the following steps.

  1. Install the metrics-server which is responsible for providing resource usage metrics to the autoscaler.

    NOTE: Service Mesh Manager does not install the metrics-server by default, as this is a cluster-global component.

    To verify that the underlying Kubernetes cluster has metrics-server installed, check if it is running in the kube-system namespace (assuming default installation):

    kubectl get pods -n kube-system | grep metrics-server
    metrics-server-588cd8ddb5-k7cz5                                        1/1     Running   0          4d7h

    If the cluster does not have metrics-server running, you can install it by running the following command:

    kubectl apply -f
  2. If metrics-server is available, the Horizontal Pod Autoscaler (HPA) tries to set the number of replicas for Deployments and StatefulSets so that the actual average memory or CPU utilization is a given percentage of the requests on the Pods of the Workload.

    For this to work, set the requests and limits for memory and CPU for the Pods of the Workload based on the previous sections.

  3. Afterward, use the HorizontalPodAutoscaler resource configure the scaling parameters. The following example configures the HPA to dynamically scale the istiod-cp-v113x Deployment in the current namespace, so that the average CPU utilization of all Pods compared to its resource requests is at 80%.

    apiVersion: autoscaling/v1
    kind: HorizontalPodAutoscaler
      name: istiod-autoscaler-cp-v113x
      namespace: istio-system
      maxReplicas: 5
      minReplicas: 1
        apiVersion: apps/v1
        kind: Deployment
        name: istiod-cp-v113x
      targetCPUUtilizationPercentage: 80

NOTE: If the Workload scales up slowly (for example, because of slow startup times), you can increase the limits to allow some additional headroom until the scaling is complete.