Working with SRM services
Summarize
Summary of Working with SRM services
Service Reliability Management (SRM) enables teams to manage services that deliver functional outcomes, such as networking or HR services. Each service may consist of various technical components and relies on integrations to effectively route alerts to the appropriate responders, ensuring timely acknowledgment and follow-ups on alerts.
Show less
Key Features
- Service Integration: Add integrations via the Services module to monitor technical services and receive events.
- Reliability Metrics: Create metrics that help track the reliability and performance of each service.
- Service Management: Services are categorized by various metrics, including active incidents, critical alerts, open changes, and error budget status.
- Error Budget: Represents the amount of Service Level Objective (SLO) that can be utilized over time, aiding in release management.
- Customizable Views: The services landing page allows filtering, grouping, and sorting of service lists for better management visibility.
Key Outcomes
By utilizing SRM, customers can effectively manage their services, automate response routines, and gain insights into service performance through detailed metrics. This structured approach enables teams to respond promptly to incidents, prioritize critical services, and maintain operational reliability.
A service represents a functional outcome like networking, payments, or HR services, that is owned by a team. To deliver that outcome, a service can contain one or more technical components like a user authentication service, or a piece of shared infrastructure like a database.
You might want multiple tool integrations to monitor each technical service and receive events from those tools. Add an integration to SRM using the Services module. See Working with SRM integrations.
In addition, you can create reliability metrics for the service. See Working with Reliability metrics
Tying a team and policies to that service makes it easier to divide responsibilities and track technical outcomes. It also makes it easier to automate response routines and focus on who you notify and when.
The state of a exiting service is inherited. The state of a created service in SRM is None.
Services
- Your Services: Count of all the services you or your team manages and monitors for reliability.
- Services with active incidents: Services with one or more open incidents, sorted first by business criticality, most critical at the top; then sorted by number of active incidents, highest number at the top; and finally sorted by % of error budget remaining, lowest at the top.
- Services with critical alerts: Services with open alerts, sorted first by business criticality, most critical at the top; then sorted by number of alerts, highest number at the top; and finally sorted by % of error budget remaining, lowest at the top.
- Services with open changes: All the services your team manages and monitors reliability for.
- Services with low error budget: Services with error budget remaining < 25%
The error budget metric is represented as the amount of SLO that you can spend over a specified time. It can be used to manage release velocity.
Each column in the list can be grouped or filtered.
Each list can be edited, sorted or exported.
For more detailed information on individual services see View an SRM service.
Services list view metric definitions
- Service: Name of the service.
- Class: Application or Technical service.
- Business criticality: How important this service is the business.Choices are:
- 1 - most critical (default)
- 2 - somewhat critical
- 3 - less critical
- 4 - not critical
- Open alerts: Number of open alerts assigned to the service.
- Open incidents: Number of open incidents assigned to the service.
- Error budget remaining: Percentage of error budget remaining for the service.