SLO-based alerting in Production

Introduction

For a production system, its monitoring solution should be the most reliable component in the whole environment, both from data persistence and reliability perspective. The monitoring solution work with reliable data so the engineers can trust its notifications.

When using Service Mesh Manager to measure and monitor your systems' Service Level Objectives (SLOs), you need to configure the following items to have this kind of reliability.

  1. To achieve data consistency, the Prometheus deployment that collects the metrics and measures the SLOs should use persistent volumes. For details, see Set up Persistent Volumes for Prometheus.

  2. To achieve operational reliability, configure Service Mesh Manager to use Prometheus in a highly available (HA) mode. For details, see Set up High Availability for the Monitoring Stack.

  3. Configure Service Mesh Manager to use an Alertmanager deployment.

    • If you already have an Alertmanager deployment, you can configure Service Mesh Manager to use it. For details on setting up the connection to you existing alert manager, see Use an existing Alertmanager.
    • If you don’t have and existing Alertmanager deployment, or you want to use a separate deployment, you can still use the prometheus-operator built in to Service Mesh Manager. For details, see Deploy a new Alertmanager.