Exploring Service Reliability Management

Release version: Yokohama

Updated January 30, 2025

3 minutes to read

Summarize

Summarized using AI

Summary of Exploring Service Reliability Management

Service Reliability Management (SRM) in ServiceNow Yokohama release offers a self-service, guided platform to help teams maintain and improve the health of their digital services. Built on the Service Operations Workspace, SRM integrates IT Operations Management (ITOM) and IT Service Management (ITSM) capabilities into a unified workflow. It enables teams to apply site reliability engineering (SRE) practices, respond to incidents promptly using on-call escalations, and onboard distributed teams with minimal central IT governance.

Show full answer Show less

SRM User Roles and Responsibilities

Administrator: Manages the SRM platform setup, configurations, user accounts, integrations (including Application Performance Monitoring tools), reliability metrics, and error budget policies. Distinct from the general ServiceNow admin role.
Manager: Oversees SRE teams by defining teams, scheduling on-call rotations, monitoring performance, managing users within teams, and maintaining integrations, metrics, and policies.
Responder: SREs who are on call and handle incident diagnosis and remediation. They manage services, teams, schedules, incidents, alerts, integrations, and reliability metrics within their permissions.

SRM Workflow

Teams continuously deliver new services (e.g., a customer billing portal) and define service level objectives (SLOs) to align with business goals (e.g., 95% monthly availability). Monitoring integrations collect real-time health data, creating service level indicators (SLIs) that trigger alerts for service degradation. When alerts indicate issues, incidents are created and on-call teams are notified to respond swiftly. Teams collaborate to diagnose, remediate, and improve system resilience based on incident learnings. Continuous management of SLO performance and error budgets ensures prioritized improvements and prevents risky changes.

Key Benefits and Features

Team-based experience: Supports SRM administrators, managers, and responders to collaborate effectively.
Service registration: Enables teams to register and manage their services within SRM.
Prebuilt integrations: Simplifies connecting with monitoring and application performance tools.
Reliability metrics: Facilitates monitoring and maintaining service health using metrics and error budgets.
On-call coverage: Enables creation and management of on-call schedules for timely incident response.
Incident remediation: Provides tools to track, collaborate, and resolve high-severity alerts and incidents efficiently.

Practical Outcomes for ServiceNow Customers

Implementing SRM empowers IT Operations and DevOps teams to enhance agility, uptime, and performance by aligning operational workflows with site reliability engineering principles. Customers can expect improved visibility into service health, streamlined incident management, and enhanced collaboration across teams responsible for service reliability and operational excellence.

Service Reliability Management (SRM) provides a self-serve, guided experience for teams to manage service health. The experience is built using the Service Operations Workspace application and combines ITOM and ITSM capabilities into a single workflow.

SRM overview

Optimize service health with site reliability engineering (SRE) practices. SRM is a single operations workspace that empowers teams to improve the reliability of digital services with SRE.

Use on-call escalations to respond to issues in a timely manner.
Reduce setup friction with guided self-service to onboard distributed teams with separated data, empowered access, and minimal governance from central IT.

When SRM is installed, several plugins and applications are also activated. For more information, see Plugins or applications installed with ITOM AIOps.

SRM users

Table 1. Users
Users	Description	Contains Roles
admin	A ServiceNow administrator is responsible for the administration, development, operation, education, and maintenance of the ServiceNow platform. Responsible for installation and can perform Service Operations Workspace Admin Center configuration of SRM.	All
Administrator [srm_admin] Note: This role differs from the ServiceNow admin role.	SRM administrators can manage account settings, configurations, and users. Administrators can perform the following actions: Access, create, edit, or delete all SRM configurations. Add or manage integrations. Create integrations with Application Performance Monitoring (APM) tools. Set up and maintain reliability metrics. Set up and maintain error budget policies.	Manager Responder
Manager [srm_manager]	Managers oversee a team of SREs. Managers assign SREs to the team on-call schedule, monitor their performance, and create procedures to handle incidents and develop solutions. Managers promote resilience across all the systems and the DevOps workflows. Managers can perform the following actions within the context of their teams: Define and set up teams, on-call schedules, and services. Add and delete users such as responders and managers for the teams they're a part of. Add or manage integrations. Create Integrations with Application Performance Monitoring (APM) tools. Set up and maintain reliability metrics. Set up and maintain error budget policies.	Responder
Responder [srm_responder]	A Service Reliability Engineer (SRE) that uses SRM to perform everyday tasks. Responders are the individuals who are on call and diagnose and remediate incidents. Responders can only access configurations that they’re a part of. They can only access the alerts or incidents for which they have permissions. SREs can perform the following actions, within the context of their teams: Set up services, teams, and integrations. Confirm their on-call schedules. Manage incident and alert records. Update teams that they’ve created. Add other responders. Create integrations with Application Performance Monitoring (APM) tools. Set up and maintain reliability metrics. Set up and maintain error budget actions.	Inherits 17 roles including the following: cmdb_read sn_sow.sow_user sn_sow_srm.srm_responder workspace_user slo_operator

For more information, see SRM roles and responsibilities.

SRM workflow

Infographic showing how responders, managers, and administrators manage teams, register services, define SLO, monitor integrations, respond to notifications, and remediate incidents. For details, refer to the following description.

Product teams in IT or Lines of Business continuously deliver new technical and application services. Example: New customer billing portal.
Along with SLO Management, teams can register services and define service level objectives (SLOs), helping them reach business outcomes. Example: 95% monthly availability for the billing portal.
Monitoring integrations are set up by the teams to collect the real-time health of these services. Example: Cloud Observability.
Monitoring creates service level indicators (SLIs) impacting alerts when services are underperforming. Automation groups and enriches. Example: Billing portal latency is exceeding 7 s.
When the alerts indicate an outage or customer-impacting degradation, incidents are created and on-call notifications notify appropriate team resources. Example: A Billing SRE team is notified via phone of a latency issue on the billing portal.
After teams collaboratively diagnose and remediate incidents, they identify action items for improving the system's resilience. Example: The Billing team decides to add additional web server capacity.
Management continually reviews SLO performance, helps to prevent changes when the error budget is exhausted, and prioritizes improvement initiatives for underperforming services.

SRM benefits


Benefit	Feature	Users
Team-based experience	Working with SRM teams	SRM administrators, managers, and responders
Service registration	Working with SRM services	SRM administrators, managers, and responders
Prebuilt integrations	Working with SRM integrations	SRM administrators, managers, and responders
Measure service health	Working with reliability metrics	SRM administrators, managers, and responders
On-call coverage	Create an SRM on-call schedule	SRM administrators, managers, and responders
Remediate high severity alerts and incidents	Working with SRM reliability tasks	SRM administrators, managers, and responders

What to explore next

To learn more about configuring and using SRM, see: