Reporting SLA in Prometheus

I had a client requirement recently for creating good old SLA reports out of the monitoring metrics we’re receiving from Prometheus. I’m going to show you how I’ve done that. Bear in mind this post assumes you have a certain level of familiarity with Prometheus, however I’m going to do my best to link you to the right resources.

The idea behind our implementation is to have a way we can tell the state of our resources, and if they are under maintenance. To achieve this we will use two exporters: prometheus-mock-exporter & alertmanager-silences-exporter. The former is a small tool I wrote for mocking any metric, and exposing it in a Prometheus friendly format. And the latter is an exporter that connects to Alertmanager and tell us if a resource is under maintenance. We’ll be exposing two metrics with the mock exporter:

1. virtual_machine_up -> a gauge metric representing the state of our mocked resource, either giving the value 0 or 1
2. mock_tag_info -> this is also a gauge with the fixed value of 1. However, it will represent something the Prometheus world calls an “info metric": "Info metrics are useful for annotations such as version numbers and other build information that would be useful to query on, but it doesn’t make sense to use them as target labels” - taken directly from the [Prometheus Bible](http://shop.oreilly.com/product/0636920147343.do)

I’m using the following config to define these metrics:

---
label_metrics:
  - {
    "resource_name": "fake-vm-01",
    "resource_group": "fake-rg-01",
    "resource_type": "Microsoft.Compute/virtualMachines",
    "tag_client_name": "Big Bucks Corp",
    "tag_monitoring": "enabled",
    "instance": "prometheus-mock-exporter"
    }
  - {
    "resource_name": "fake-vm-02",
    "resource_group": "fake-rg-02",
    "resource_type": "Microsoft.Compute/virtualMachines",
    "tag_client_name": "The Laundry Guys",
    "tag_monitoring": "enabled",
    "instance": "prometheus-mock-exporter"
    }
  - {
    "resource_name": "fake-vm-03",
    "resource_group": "fake-rg-03",
    "resource_type": "Microsoft.Compute/virtualMachines",
    "tag_client_name": "Sugar Shack 2000",
    "tag_monitoring": "enabled",
    "instance": "prometheus-mock-exporter"
    }

mock_metrics:
  - name: "virtual_machine_up"
    type: "gauge"
    value: 1
    labels: {"resource_name": "fake-vm-01", "resource_group": "fake-rg-01"}

  - name: "virtual_machine_up"
    type: "gauge"
    value: 0
    labels: {"resource_name": "fake-vm-02", "resource_group": "fake-rg-02"}

  - name: "virtual_machine_up"
    type: "gauge"
    value: 0
    labels: {"resource_name": "fake-vm-03", "resource_group": "fake-rg-03"}

If we go to the exporter’s metrics page, we can see both metrics available:

# HELP mock_tag_info mock_tag_info
# TYPE mock_tag_info gauge
mock_tag_info{instance="prometheus-mock-exporter",resource_group="fake-rg-01",resource_name="fake-vm-01",resource_type="Microsoft.Compute/virtualMachines",tag_client_name="Big Bucks Corp",tag_monitoring="enabled"} 1
mock_tag_info{instance="prometheus-mock-exporter",resource_group="fake-rg-02",resource_name="fake-vm-02",resource_type="Microsoft.Compute/virtualMachines",tag_client_name="The Laundry Guys",tag_monitoring="enabled"} 1
mock_tag_info{instance="prometheus-mock-exporter",resource_group="fake-rg-03",resource_name="fake-vm-03",resource_type="Microsoft.Compute/virtualMachines",tag_client_name="Sugar Shack 2000",tag_monitoring="enabled"} 1
# HELP virtual_machine_up virtual_machine_up
# TYPE virtual_machine_up gauge
virtual_machine_up{resource_group="fake-rg-01",resource_name="fake-vm-01"} 1
virtual_machine_up{resource_group="fake-rg-02",resource_name="fake-vm-02"} 0
virtual_machine_up{resource_group="fake-rg-03",resource_name="fake-vm-03"} 1

Next up we’re going to configure the alertmanager-silences-exporter. This is pretty straight forward, as I’m only going to point it at my local Alertmanager instance:

---
alertmanager_url: "http://localhost:9093/"

The exporter will expose 3 new metrics. But the one we’re curious about is alertmanager_silence_info. Whilst being an info metric it will give us a value of 1 if a silence is active and 0 in all other cases. The next step is to tell Prometheus about our exporters.

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "prom-alerting.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'prometheus-mock-exporter'
    static_configs:
    - targets: ['localhost:2112']

  - job_name: 'alertmanager-silences-exporter'
    static_configs:
    - targets: ['localhost:9666']
    metric_relabel_configs:
      - source_labels: [matcher_resource_group]
        regex: '(.*)'
        replacement: '$1'
        target_label: resource_group
      - source_labels: [matcher_resource_name]
        regex: '(.*)'
        replacement: '$1'
        target_label: resource_name

You might have noticed that we have an extra tag here called 'metric_relabel_configs'. This is necessary because the alertmanager-silences-exporter prepends the word 'matcher_' to all labels that were entered during the creation of a silence. Because of this we wouldn’t be able to compare these labels with other metrics in our queries. But more on that in a bit.

As you can see this exporter is setup now too.

# HELP alertmanager_silence_end_seconds Alertmanager silence end time, elapsed seconds since epoch
# TYPE alertmanager_silence_end_seconds gauge
alertmanager_silence_end_seconds{id="7b7fc8ea-23f3-49fb-8851-8f8c13b9b584"} 1.5854976e+09
alertmanager_silence_end_seconds{id="e412ea8c-a9f4-4683-9d70-3719a8a90729"} 1.586218042e+09
# HELP alertmanager_silence_info Alertmanager silence info metric
# TYPE alertmanager_silence_info gauge
alertmanager_silence_info{comment="This is going to be long...",createdBy="codedumpster.io",id="e412ea8c-a9f4-4683-9d70-3719a8a90729",matcher_resource_group="fake-rg-02",matcher_resource_name="fake-vm-02",status="active"} 1
alertmanager_silence_info{comment="none",createdBy="me",id="7b7fc8ea-23f3-49fb-8851-8f8c13b9b584",matcher_resource_group="fake-rg-05",matcher_resource_name="fake-vm-05",status="expired"} 0
# HELP alertmanager_silence_start_seconds Alertmanager silence start time, elapsed seconds since epoch
# TYPE alertmanager_silence_start_seconds gauge
alertmanager_silence_start_seconds{id="7b7fc8ea-23f3-49fb-8851-8f8c13b9b584"} 1.5848928e+09
alertmanager_silence_start_seconds{id="e412ea8c-a9f4-4683-9d70-3719a8a90729"} 1.585613414e+09
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary

Now to the fun part! Our calculation needs to look at the state of each resource, but also be aware of the fact if there’s a maintenance in progress. If I look at the current list of resources I see that two of them are showing as down:

promql=# virtual_machine_up
virtual_machine_up{instance="localhost:2112",job="prometheus-mock-exporter",resource_group="fake-rg-01",resource_name="fake-vm-01"} 1
virtual_machine_up{instance="localhost:2112",job="prometheus-mock-exporter",resource_group="fake-rg-02",resource_name="fake-vm-02"} 0
virtual_machine_up{instance="localhost:2112",job="prometheus-mock-exporter",resource_group="fake-rg-03",resource_name="fake-vm-03"} 0

For the purpose of this demonstration lets create an active silence for fake-vm-02 in Alertmanager. Silences are a great way to create maintenance windows on our resources.

With that done we can see a new silence metric right away:

alertmanager_silence_info{comment="this is going to be long...", createdBy="codedumpster.io",
id="e412ea8c-a9f4-4683-9d70-3719a8a90729", instance="localhost:9666",
job="alertmanager-silences-exporter", matcher_resource_group="fake-rg-02",
matcher_resource_name="fake-vm-02", resource_group="fake-rg-02",
resource_name="fake-vm-02", status="active"}

So how can we use all this to create a real representation of our SLA? We’ll work this out in PromQL next.

First we want to know if a resource is up or down. However, even if it’s done it could be under maintenance so we want to cover those cases as well. To achieve this we need to use the following query:

sum by(resource_group, resource_name)
 (virtual_machine_up)
+
sum by(resource_group, resource_name)
 (alertmanager_silence_info)
OR
sum by(resource_group, resource_name)
 (virtual_machine_up)

We’re doing sum by to ensure the number of labels match for both metrics. This is necessary to perform any operation on two metrics. The query tells us the sum of a machine’s state and any corresponding silence, and if there’s no silence with the given labels it simply returns the state of the resource. The output shows us just that:

{resource_group="fake-rg-02", resource_name="fake-vm-02"} 1
{resource_group="fake-rg-01", resource_name="fake-vm-01"} 1
{resource_group="fake-rg-03", resource_name="fake-vm-03"} 0

Instead of a 0 we’re getting a 1 for the state of resource fake-vm-02.

So what happens if the machine is up and there’s also a maintenance window active? Glad you asked! By default this query would return a value of 2. This can complicate our calculations, so instead we’re going to ensure the query returns a bool value representing the state of our resource.

((
sum by(resource_group, resource_name)
 (virtual_machine_up)
+
sum by(resource_group, resource_name)
 (alertmanager_silence_info)
OR
sum by(resource_group, resource_name)
 (virtual_machine_up)
) >= bool 1)

Awesome! Now all we that’s left is to “join” the info metric to our results, so we can have more valuable labels added to our metric.

((
sum by(resource_group, resource_name)
 (virtual_machine_up)
+
sum by(resource_group, resource_name)
 (alertmanager_silence_info)
OR
sum by(resource_group, resource_name)
 (virtual_machine_up)
) >= bool 1) * on (resource_group, resource_name) group_left(resource_type, tag_client_name, tag_monitoring)
  mock_tag_info{resource_type="Microsoft.Compute/virtualMachines", exported_instance="prometheus-mock-exporter"}

And as my french neighbour would say Voilà! We have some reporting data.

{resource_group="fake-rg-02", resource_name="fake-vm-02", resource_type="Microsoft.Compute/virtualMachines", tag_client_name="The Laundry Guys", tag_monitoring="enabled"} 1
{resource_group="fake-rg-01", resource_name="fake-vm-01", resource_type="Microsoft.Compute/virtualMachines", tag_client_name="Big Bucks Corp", tag_monitoring="enabled"} 1
{resource_group="fake-rg-03", resource_name="fake-vm-03", resource_type="Microsoft.Compute/virtualMachines", tag_client_name="Sugar Shack 2000", tag_monitoring="enabled"} 0

We’re going to save this query as a recording rule and add another one that simply ranges over our query and gets the average for the last 30 days (our SLA percentage). So add the following to your Prometheus config:

groups:
- name: SLA Recording Rules
  rules:
    - record: with_silence:virtual_machine_up:bool
      expr: sum(avg_over_time(with_silence:virtual_machine_up:bool{tag_monitoring="enabled"}[30d])) / count(avg_over_time(with_silence:virtual_machine_up:bool{tag_monitoring="enabled"}[30d]))
    - record: with_silence:virtual_machine_up:bool
      expr: >
        ((
        sum by(resource_group, resource_name)
         (virtual_machine_up)
        +
        sum by(resource_group, resource_name)
         (alertmanager_silence_info)
        OR
        sum by(resource_group, resource_name)
         (virtual_machine_up)
        ) >= bool 1) * on (resource_group, resource_name) group_left(resource_type, tag_client_name, tag_monitoring)
          mock_tag_info{resource_type="Microsoft.Compute/virtualMachines", exported_instance="prometheus-mock-exporter"}

We can refer to these new rules directly and use them in our reports for our clients.

promql=# with_silence:virtual_machine_up:bool
with_silence:virtual_machine_up:bool{resource_group="fake-rg-01", resource_name="fake-vm-01", resource_type="Microsoft.Compute/virtualMachines", tag_client_name="Big Bucks Corp", tag_monitoring="enabled"} 1
with_silence:virtual_machine_up:bool{resource_group="fake-rg-02", resource_name="fake-vm-02", resource_type="Microsoft.Compute/virtualMachines", tag_client_name="The Laundry Guys", tag_monitoring="enabled"} 1
with_silence:virtual_machine_up:bool{resource_group="fake-rg-03", resource_name="fake-vm-03", resource_type="Microsoft.Compute/virtualMachines", tag_client_name="Sugar Shack 2000", tag_monitoring="enabled"} 0

promql=# with_silence:virtual_machine_up:bool_30d * 100
{} 65.47619047619048

I hope this overview gave you some good insights.