Working with reliability metrics

High-level workflow

SRM leverages integrations for signal aggregation.
Reliability metrics containing service level indicators (SLIs) and service level objectives (SLOs) are created for the service in SRM.
When a qualified alert is generated for a service, the cumulative breach and the error budget values are updated for the reliability metrics in SRM.
An error budget policy is created for the service to trigger actions such as creating an incident or sending an email to remediate service issues. Error budgets are constrained by Category.

The key features of SRM metrics are:

SLI signals aggregation
Duration- and count-based SLO creation
Error budget creation
Error budget policies definition
Error budget visualization

Navigate to the Services > Overview tab to view all associated critical data for Reliability and Error Budget metrics. See Working with SRM services for more information.

Nota:

Scores are only visible when SLIs and SLOs and Error budgets have been created and there are affected. See Create SLOs, SLIs, and error budget policies for more detailed information.

Service reliability dashboard

The Service reliability dashboard displays a customizable, high-level view of service performance. It helps you monitor and manage reliability using visualizations that track service states, error budgets, and service level objectives (SLOs) over time.

The dashboard displays information about all services in Service Reliability Management (SRM). You can access the dashboard in Service Operations Workspace in the following ways:

Navigate to Services () > Service reliability.
Navigate to Home () > Service reliability

For more details, see Visualizations in the Service reliability dashboard.

Nota:

You can also view SLO information for all services on the Services Overview tab. See Working with SRM services for more information.

Notification destinations

Notification destinations help keep teams informed about service reliability. Attach them to error budget policies to send notifications when a policy is breached.

To view and manage notification destinations in Service Operations Workspace, navigate to Teams > [Your team] > SLO Notification destinations.

Visit the following links to learn more about creating and working with notification destinations:

Reliability metrics tab

The Reliability metrics tab shows how well a specific service is meeting its reliability goals. Use it to track SLOs, service level indicators (SLIs), and error budgets for a service.

To view the Reliability metrics tab in Service Operations Workspace, navigate to Services () > [Your service] > Reliability Metrics.

See these links to learn more about what you can do in the Reliability metrics tab:

Service level objectives table

On the Reliability metrics tab, the Service level objectives table includes the following details about the service SLOs:

Name - Name of the SLO. You can select the name to view the SLO record.
Reliability - Current state of the SLO. For example, stable, at risk, or critical.
% Error budget remaining - Percentage of the error budget still available in the current compliance period.
Compliance period - Time window used to calculate performance:
- Month - Current month, for example, if the current date is January 26, the month is January 1 through January 31.
- Rolling 7, 30, or 90 days - Number of days from the current date. For example, for rolling 7 days, the duration is 7 days back from the current date.
SLI type - Performance category being measured:
- Availability - Percentage of time your service or configuration item is available, also known as uptime.
- Errors - Frequency of your service errors.
- Latency - Time that it takes to service a request.
- Saturation - Fullness of your system, focusing on resource usage.
Source type - Origin of the data used to calculate the SLIs for this SLO:
- Alert - Uses alerts from integrated monitoring tools.
- Outage - Uses outages detected by monitoring tools and reported by users. An outage indicates when the service was unavailable. This source type excludes planned outages, such as scheduled maintenance.
Updated - The date and time the SLO was last edited.
Updated by - The user name of the person who last edited the SLO.
State - Status of the SLO. For example, running or retired.

Service level objective: Name of the SLO. The SLO is a target value or the objective that your team must reach to meet your service level agreement (SLA).
SLI type: Performance category being measured:
- Availability: Percentage of time your service or configuration item is available, also known as uptime.
- Errors: Frequency of your service errors.
- Latency: Time that it takes to service a request.
- Saturation: Fullness of your system, focusing on resource usage.
Compliance period: Time window used to calculate performance:
- Month: Current month, for example, if the current date is January 26, the month is January 1 through January 31.
- Rolling 7, 30, or 90 days: Number of days from the current date. For example, for rolling 7 days, the duration is 7 days back from the current date.
State: Status of the SLO, such as draft, running, or retired.
Objective (percentage): Target percentage of SLI performance.
Limit occurrences: Number of limit breaches that have occurred. Used by count-based SLOs only.
Service level indicator: SLI associated with the SLO.
Error budget: Allowable failure time for the compliance period, calculated using the compliance period and objective (percentage).
Remaining error budget: Error budget still available.
Remaining breach occurrences: Number of breaches still available before the limit is reached.

Nota:

For performance purposes, SLO and SLI records ([sn_sow_srm_slo_history] and [sn_sow_srm_sli_metric]) are archived after one year and deleted five years later. Archived data is omitted from tables and visualizations.