Service Level Indicator template

Service Level Indicators (SLIs)) in Service Mesh Manager are based on the ServiceLevelIndicatorTemplate custom resource. This resource describes the PromQL queries that calculate the number of goodEvents and totalEvents using a Go templated string. Note that the ServiceLevelIndicatorTemplate is a namespaced custom resource. To use the Service Level Indicator in a Service Level Objective, you have to reference both the name and the namespace of the Service Level Indicator in the ServiceLevelObjective custom resource.

apiVersion: sre.smm.cisco.com/v1alpha1
kind: ServiceLevelIndicatorTemplate
metadata:
  name: <name of the SLI Template>
  namespace: <namespace of the SLI Template>
spec:
  goodEvents: |
        <PromQL query that returns the number of good events (for example, successful HTTP requests)>
  totalEvents: |
        <PromQL query that returns the total number of events (for example, every HTTP request)>
  kind: availability
  description: |
        <Description of the SLI. This is displayed on the {{< product >}} UI as well.>
  parameters:
  - default: source
    description: the Envoy proxy that reports the metric (source | destination)
    name: reporter

For example:

apiVersion: sre.smm.cisco.com/v1alpha1
kind: ServiceLevelIndicatorTemplate
metadata:
  name: http-requests-success-rate-demo
  namespace: smm-system
spec:
  goodEvents: |
    sum(rate(
      istio_requests_total{reporter="{{ .Params.reporter }}", destination_service_namespace="{{ .Service.Namespace }}", destination_service_name="{{ .Service.Name }}",response_code!~"5[0-9]{2}|0"}[{{ .SLO.Period }}]
    ))    
  totalEvents: |
    sum(rate(
      istio_requests_total{reporter="{{ .Params.reporter }}", destination_service_namespace="{{ .Service.Namespace }}", destination_service_name="{{ .Service.Name }}"}[{{ .SLO.Period }}]
    ))    
  kind: availability
  description: |
        Indicates the percentages of successful (non 5xx) HTTP responses compared to all requests
  parameters:
  - default: source
    description: the Envoy proxy that reports the metric (source | destination)
    name: reporter

To make custom SLI available in Service Mesh Manager, create a custom ServiceLevelIndicatorTemplate resource, then apply it to your cluster. For example:

kubectl apply --namespace smm-demo -f sli-example.yaml
servicelevelindicatortemplate.sre.smm.cisco.com/http-requests-success-rate-demo created

For a more detailed tutorial, see the Defining application level SLOs using Service Mesh Manager blog post.

ServiceLevelIndicatorTemplate CR reference

This section describes the fields of the ServiceLevelIndicatorTemplate custom resource.

apiVersion (string)

Must be sre.smm.cisco.com/v1alpha1

kind (string)

Must be ServiceLevelIndicatorTemplate

spec (object)

The configuration and parameters of the Service Level Indicator.

spec.description (string)

A human-readable description of the SLI. This text appears on the Service Mesh Manager web interface as well. For example:

spec:
  description: |
        Indicates the percentages of successful (non 5xx) HTTP responses compared to all requests

spec.goodEvents and spec.totalEvents (string)

These fields specify the Go templated PromQL queries that return the number of good events and the total number of events, for example, the number of successful HTTP requests and the total number of HTTP requests.

The following variables are available for interpolation:

  • .Params.: The values associated with each parameter when a new ServiceLevelObjective is specified.
  • .Service.Namespace: The namespace of the service this ServiceLevelObject is created on.
  • .Service.Name: Name of the service this ServiceLevelObjective is created on.
  • .SLO.Period: The period on which the SLO is defined.

Service Mesh Manager enforces the best practice of formulating the SLI as the ratio of two numbers: the good events divided by the total events. This way the SLI value will be between 0 and 1 (or 0% and 100%), and it’s easily matched to the SLO value that’s usually defined as a target percentage over a given timeframe. (For further details and examples on defining SLIs, see our Tracking and enforcing SLOs blog post.)

goodEvents: |
  sum(rate(
    istio_requests_total{reporter="{{ .Params.reporter }}", destination_service_namespace="{{ .Service.Namespace }}", destination_service_name="{{ .Service.Name }}",response_code!~"5[0-9]{2}|0"}[{{ .SLO.Period }}]
  ))  

totalEvents: |
  sum(rate(
    istio_requests_total{reporter="{{ .Params.reporter }}", destination_service_namespace="{{ .Service.Namespace }}", destination_service_name="{{ .Service.Name }}"}[{{ .SLO.Period }}]
  ))  

spec.kind (string)

Specifies the type of the Service Level Indicator. This value also determines which other fields must be set. Possible values:

  • availability: Used for error-rate characteristics.
  • latency: Used for measuring microservice latencies.

spec.parameters (object)

Contains parameters specific to the SLI. These parameter values are specified in the ServiceLevelObjective CR and used when the previous queries are evaluated. For example:

  parameters:
    - default: source
      description: the Envoy proxy that reports the metric (source | destination)
      name: reporter

If you specify advanced: true for a parameter, it appears on the Service Mesh Manager web interface only when the SHOW ADVANCED PARAMETERS option is selected.

You can declare any parameter, that your Service Level Indicator would depend on.

The reason behind this implementation is that Prometheus only provides a histogram_quantile function that yields a given latency percentile. We could have created an SLI Template on top of that, but it felt more natural to express latency SLOs by specifying a simple threshold.

Example: specifying bucket ranges

Specifies a comma-separated list of histogram buckets used to collect timeseries for a corresponding Prometheus metric. For example:

  parameters:
    - advanced: true
      default: '0.5, 1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000, 300000, 600000, 1800000, 3600000'
      description: comma separated list of histogram buckets used to collect timeseries for a corresponding Prometheus metric
      name: buckets

To help to interpolate values from histograms, in the goodEvents and totalEvents queries you can use the .FloorBucket and .CeilBucket functions to determine which buckets are the closest to a given threshold. For example, in our Duration (latency) SLI Template we use these functions to process histograms in a way that allows you to specify any latency threshold, not just the ones defined as buckets in the envoy-proxy exporter.

Example: threshold parameter

To define the default acceptance threshold for latency values in milliseconds, Service Mesh Manager uses the following parameter definition.

  parameters:
    - default: '100'
      description: acceptance threshold for HTTP latency values in milliseconds
      name: threshold