Finding bottlenecks

Except for out-of-memory errors, most performance issues manifest as either increased response times (or processing times for queue-based systems) or increased error rates. Service Mesh Manager provides built-in observability features that help find out where these issues are.

Usually, investigations for these bottlenecks are triggered from two primary sources:

  • an alert fires on a specific service (see Service Level Objectives for setting this up), or
  • some end-to-end measurement (such as measuring response times at the ingress gateway side) shows increased values.

The first case is easier to deal with, as the faulty service is (most likely) already identified. If this is the case, refer to the scaling a specific Workload section.

Finding the root cause of an issue detected at ingress side

Service Mesh Manager provides two major features to find these bottlenecks in your system: the health subsystem and traces.

Using the health subsystem

The health subsystem provides automated outlier analysis for your Workloads and Services by learning their previous behavior and highlighting any diverging patterns. When trying to find the root cause of bottlenecks in the system, you can use the topology view to pinpoint any issues, by pulling up the topology chart of your solution:

Topology view highlighting an error

In this example, the catalog’s v1 Workload is highly likely the culprit. Of course, this is a simple setup, where it is visible from the topology view what service is problematic.

In microservice environments failures usually propagate either downstream or upstream, like when considering the following topology:

Topology view multiple issues

You can see that the following services might be affected by the issue:

  • frontpage
  • bookings
  • analytics
  • catalog
  • movies

Given that the demoapplication is relatively small compared to a real-life solution consisting of hundreds of microservices, it is possible to check the details of those services and execute the Workload specific scaling tasks on them. However, the best practice is to only check the services that are either at the top of the call-chain or at the end of the call-chain.

In this case this means checking the frontpage, the analytics and the movies services. This is due to the fact, that either some of the microservices in the deeper regions of the call-tree are behaving differently (such as the analytics service being too slow and slowing down the whole transaction processing), or the frontpage’s behavior has changed a bit.

The health framework has, however, a significant limitation: it can only detect outliers that are new. If some issue has been happening for the last week, then the health system had already learned that behavior and thus it would report that as normal.

Using Traces

Service Mesh Manager has built-in support for tracing, with minimal changes required at application level to enable this feature. When the first approach does not work well, use the traffic tap feature by setting the filter to the namespace or Workload that handles the ingress for the cluster.

Given that is already integrated into the system, the operators can always look for requests with slow response times and traces available, and find the service causing the bottleneck there.

This is especially true, if a the performance degradation is caused by not a single service being too slow to respond, but rather by some downstream services being called too many times, which will be obvious from the traces.