Working with Reliability metrics
Summarize
Summary of Working with Reliability metrics
The SRM metrics features enable you to define service level indicators (SLIs), service level objectives (SLOs), and error budget policies, allowing you to effectively track service performance and initiate necessary actions when issues arise. This process includes integrating signal aggregation and updating reliability indicators upon the generation of qualified alerts.
Show less
Key Features
- SLI Signal Aggregation: Aggregate signals for service level indicators.
- Create Duration and Count Based SLOs: Set up SLOs based on specific metrics.
- Calculate Error Budgets: Determine available error budgets for services.
- Error Budget Policies: Establish policies to manage service disruptions.
- Error Budget Visualization: Visualize error budgets for better tracking.
Key Outcomes
By utilizing the SRM reliability metrics, customers can view critical data regarding reliability and error budgets via the Services > Overview tab. The reliability metrics tab provides comprehensive details about each SLO, including its state, compliance period, performance objectives, and remaining error budgets. Changes to SLOs result in the creation of new records for accurate monitoring. Note that historical data is archived after one year, enhancing performance while maintaining data longevity.
Use the SRM metrics features to define service level indicators (SLI)s, service level objectives (SLO)s, and error budget policies to help you and your team track your services and take necessary actions when required.
High-level workflow
- SRM leverages integrations for signal aggregation.
- Reliability indicators containing SLIs and SLOs are created for the service in SRM.
- When a qualified alert is generated for a service, the cumulative breach and the error budget values are updated for the reliability indicators in SRM.
- An error budget policy is created for the service to trigger actions such as creating an incident or sending an email to remediate service issues. Error budgets are constrained by Category.
- SLI signal aggregation
- Create duration and count based service level objectives
- Calculate error budgets (EB)
- Error budget policies
- Error budget visualization
Reliability metrics tab
The tab contains the service level objectives (SLO).
Reliability metrics
Service Level Objective show the following details:
- Service Level Objective: Name of the SLO.
- SLO type: Duration, Count, Count by periods, Count by occurrences.
- Service: Service the SLO is set on.
- Compliance period: How long the SLO is set to last.
- Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
- Rolling 7 days: The duration is considered to be 7 days from the current date.
- Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
- Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October.
- State: State of the SLO. Choices are:
- Draft: The SLO is not running in your instance yet. You can add new SLIs or update existing SLIs and you can delete the SLO.
- : The SLO is active in your instance. You can edit, retire or delete the SLO.Note:Editing an SLO in the running state retires it and a new copy is created. See Working with Reliability metrics
- Retired: The SLO is no longer running in your instance. You can reactive it.
- Objective (percentage): Percentage of the desired SLI performance.
- Limit (occurrences): Number of limit breaches that have occurred. (Used by Count SLO types.)
- Service Level Indicator: Service Level Indicator (SLI) associated with this SLO.
- Error budget: Displays, in days and time, how much error budget there is.
Error budget is calculated based on the provided Compliance period and Objective (percentage) when creating an SLO.
- Remaining error budget: Displays, in days and time, how much error budget is left.
- Remaining breach occurrences: Number of breaches left before the limit is reached.