Working with Reliability metrics

  • Release version: Washingtondc
  • Updated February 1, 2024
  • 3 minutes to read
  • Summarize
    Summarized using AI
    This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.

    Summary of Working with Reliability metrics

    The SRM metrics features enable you to define service level indicators (SLIs), service level objectives (SLOs), and error budget policies, allowing you to effectively track service performance and initiate necessary actions when issues arise. This process includes integrating signal aggregation and updating reliability indicators upon the generation of qualified alerts.

    Show full answer Show less

    Key Features

    • SLI Signal Aggregation: Aggregate signals for service level indicators.
    • Create Duration and Count Based SLOs: Set up SLOs based on specific metrics.
    • Calculate Error Budgets: Determine available error budgets for services.
    • Error Budget Policies: Establish policies to manage service disruptions.
    • Error Budget Visualization: Visualize error budgets for better tracking.

    Key Outcomes

    By utilizing the SRM reliability metrics, customers can view critical data regarding reliability and error budgets via the Services > Overview tab. The reliability metrics tab provides comprehensive details about each SLO, including its state, compliance period, performance objectives, and remaining error budgets. Changes to SLOs result in the creation of new records for accurate monitoring. Note that historical data is archived after one year, enhancing performance while maintaining data longevity.

    Use the SRM metrics features to define service level indicators (SLI)s, service level objectives (SLO)s, and error budget policies to help you and your team track your services and take necessary actions when required.

    High-level workflow

    1. SRM leverages integrations for signal aggregation.
    2. Reliability indicators containing SLIs and SLOs are created for the service in SRM.
    3. When a qualified alert is generated for a service, the cumulative breach and the error budget values are updated for the reliability indicators in SRM.
    4. An error budget policy is created for the service to trigger actions such as creating an incident or sending an email to remediate service issues. Error budgets are constrained by Category.
    The key features of the SRM metrics are:
    • SLI signal aggregation
    • Create duration and count based service level objectives
    • Calculate error budgets (EB)
    • Error budget policies
    • Error budget visualization
    Navigate to the Services > Overview tab to view all associated critical data for Reliability and Error Budget metrics. See Working with SRM services for more information.
    Note:
    Score are only visible when SLIs and SLOs and Error budgets have been created and there are affected. See Create an SLO, an SLI, and Error budget policies for SRM for more detailed information.

    Reliability metrics tab

    The Service > Reliability metrics tab contains the service level objectives (SLO).

    Figure 1. SRM reliability metrics list view
    The reliability metrics tab shows a list of the service level objectives for a selected service.
    Note:
    Updating the SLO changes the state and results in retiring this SLO record and creating a new copy for accurate monitoring purposes.

    Reliability metrics

    Service Level Objective show the following details:

    • Service Level Objective: Name of the SLO.
    • SLO type: Duration, Count, Count by periods, Count by occurrences.
    • Service: Service the SLO is set on.
    • Compliance period:
      How long the SLO is set to last.
      • Month: The duration is considered to be the current month. For example, if the current date is 26th January, the duration will be considered from 1st January till 31st January.
      • Rolling 7 days: The duration is considered to be 7 days from the current date.
      • Rolling 30 days: The duration is considered to be 30 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th December.
      • Rolling 90 days: The duration is considered to be 90 days from the current date. For example, if the current date is 26th January, the duration will be considered from 25th October.
    • State:
      State of the SLO. Choices are:
      • Draft: The SLO is not running in your instance yet. You can add new SLIs or update existing SLIs and you can delete the SLO.
      • : The SLO is active in your instance. You can edit, retire or delete the SLO.
        Note:
        Editing an SLO in the running state retires it and a new copy is created. See Working with Reliability metrics
      • Retired: The SLO is no longer running in your instance. You can reactive it.
    • Objective (percentage): Percentage of the desired SLI performance.
    • Limit (occurrences): Number of limit breaches that have occurred. (Used by Count SLO types.)
    • Service Level Indicator: Service Level Indicator (SLI) associated with this SLO.
    • Error budget: Displays, in days and time, how much error budget there is.

      Error budget is calculated based on the provided Compliance period and Objective (percentage) when creating an SLO.

    • Remaining error budget: Displays, in days and time, how much error budget is left.
    • Remaining breach occurrences: Number of breaches left before the limit is reached.
    Note:
    Service level objective history [sn_sow_srm_slo_history] and Service level indicator metric [sn_sow_srm_sli_metric] records are archived after one year and destroyed five years after that. Doing this is expected to result in greater performance along with equal longevity data retention. No queries are run against archived tables.