Working with Reliability metrics

Release version: Washingtondc

Updated February 1, 2024

3 minutes to read

Summarize

Summarized using AI

Summary of Working with Reliability metrics

The SRM metrics features enable you to define service level indicators (SLIs), service level objectives (SLOs), and error budget policies, allowing you to effectively track service performance and initiate necessary actions when issues arise. This process includes integrating signal aggregation and updating reliability indicators upon the generation of qualified alerts.

Show full answer Show less

Key Features

SLI Signal Aggregation: Aggregate signals for service level indicators.
Create Duration and Count Based SLOs: Set up SLOs based on specific metrics.
Calculate Error Budgets: Determine available error budgets for services.
Error Budget Policies: Establish policies to manage service disruptions.
Error Budget Visualization: Visualize error budgets for better tracking.

Key Outcomes

By utilizing the SRM reliability metrics, customers can view critical data regarding reliability and error budgets via the Services > Overview tab. The reliability metrics tab provides comprehensive details about each SLO, including its state, compliance period, performance objectives, and remaining error budgets. Changes to SLOs result in the creation of new records for accurate monitoring. Note that historical data is archived after one year, enhancing performance while maintaining data longevity.

Use the SRM metrics features to define service level indicators (SLI)s, service level objectives (SLO)s, and error budget policies to help you and your team track your services and take necessary actions when required.

High-level workflow

SRM leverages integrations for signal aggregation.
Reliability indicators containing SLIs and SLOs are created for the service in SRM.
When a qualified alert is generated for a service, the cumulative breach and the error budget values are updated for the reliability indicators in SRM.
An error budget policy is created for the service to trigger actions such as creating an incident or sending an email to remediate service issues. Error budgets are constrained by Category.

The key features of the SRM metrics are:

SLI signal aggregation
Create duration and count based service level objectives
Calculate error budgets (EB)
Error budget policies
Error budget visualization

Navigate to the Services > Overview tab to view all associated critical data for Reliability and Error Budget metrics. See Working with SRM services for more information.

Note:

Score are only visible when SLIs and SLOs and Error budgets have been created and there are affected. See Create an SLO, an SLI, and Error budget policies for SRM for more detailed information.

Reliability metrics tab

The Service > Reliability metrics tab contains the service level objectives (SLO).

Note:

Updating the SLO changes the state and results in retiring this SLO record and creating a new copy for accurate monitoring purposes.

Reliability metrics

Service Level Objective show the following details:

Service Level Objective: Name of the SLO.
SLO type: Duration, Count, Count by periods, Count by occurrences.
Service: Service the SLO is set on.
Compliance period:
How long the SLO is set to last.
- Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
- Rolling 7 days: The duration is considered to be 7 days from the current date.
- Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
- Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October.
State:
State of the SLO. Choices are:
- Draft: The SLO is not running in your instance yet. You can add new SLIs or update existing SLIs and you can delete the SLO.
- : The SLO is active in your instance. You can edit, retire or delete the SLO.
  Note:
  Editing an SLO in the running state retires it and a new copy is created. See Working with Reliability metrics
- Retired: The SLO is no longer running in your instance. You can reactive it.
Objective (percentage): Percentage of the desired SLI performance.
Limit (occurrences): Number of limit breaches that have occurred. (Used by Count SLO types.)
Service Level Indicator: Service Level Indicator (SLI) associated with this SLO.
Error budget: Displays, in days and time, how much error budget there is.
Error budget is calculated based on the provided Compliance period and Objective (percentage) when creating an SLO.
Remaining error budget: Displays, in days and time, how much error budget is left.
Remaining breach occurrences: Number of breaches left before the limit is reached.

Note:

Service level objective history [sn_sow_srm_slo_history] and Service level indicator metric [sn_sow_srm_sli_metric] records are archived after one year and destroyed five years after that. Doing this is expected to result in greater performance along with equal longevity data retention. No queries are run against archived tables.