Exploring Service Reliability Management

  • Release version: Xanadu
  • Updated August 1, 2024
  • 3 minutes to read
  • Summarize
    Summarized using AI
    This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.

    Summary of Exploring Service Reliability Management

    Service Reliability Management (SRM) in ServiceNow offers a guided, self-serve experience designed for teams to autonomously manage the health of their technical services. Built within the Service Operations Workspace, SRM integrates IT Operations Management (ITOM) and IT Service Management (ITSM) capabilities into a unified service operations workflow. It supports IT teams adopting Site Reliability Engineering (SRE) practices by enabling improved digital service reliability, rapid response to issues through on-call escalations, and streamlined onboarding for distributed teams with minimal central IT governance.

    Show full answer Show less

    Activating SRM automatically installs related plugins and applications essential for its functionality.

    Key Features

    • User Roles and Responsibilities: SRM defines distinct roles tailored to operational needs:
      • Administrators: Manage SRM configurations, integrations, and user access across the platform.
      • Managers: Oversee SRE teams, manage on-call schedules, monitor performance, and maintain resilience workflows.
      • Responders (SREs): Handle daily incident diagnosis, remediation, and manage their team-related configurations and schedules.
    • Service and Team Management: Teams can register services, define service level objectives (SLOs), and set up on-call schedules to ensure business outcome alignment.
    • Monitoring and Alerts: Integration with monitoring tools (such as Application Performance Monitoring) enables real-time tracking of service health through Service Level Indicators (SLIs) and triggers alerts for service degradations.
    • Incident and Alert Handling: Automated incident creation and on-call notifications ensure timely response to outages or performance issues, supported by collaborative diagnosis and remediation workflows.
    • Reliability Metrics and Error Budgets: Teams can establish and maintain reliability indicators and error budget policies to monitor and improve service resilience continuously.
    • SRM Workspaces: Dedicated alert and incident workspaces provide detailed views and actionable panels for efficient incident management.

    Key Outcomes

    • Empowers IT and DevOps teams to proactively monitor and improve service reliability aligned with business goals.
    • Facilitates rapid incident response through structured on-call management and alert escalation.
    • Reduces setup complexity with guided onboarding and role-based access control, supporting scalable team operations.
    • Enables continuous service improvement by capturing post-incident action items and managing service health through SLOs and error budgets.
    • Supports a team-based approach to service reliability, fostering collaboration between responders, managers, and administrators.

    Next Steps for ServiceNow Customers

    To implement and leverage SRM effectively, customers should explore configuration and usage guides specific to Service Reliability Management, including:

    • Configuring Service Reliability Management
    • Using Service Reliability Management
    • Service Reliability Management reference materials
    • Getting started with Service Reliability Management

    These resources will help teams accelerate their ability to view service health in the context of SLOs and streamline incident resolution workflows, driving agility, performance, and uptime.

    Service Reliability Management (SRM) provides a self-serve, guided experience for teams to autonomously manage the health of their technical services. The experience is built using the Service Operations Workspace application and combines ITOM and ITSM capabilities into a single service operations workflow.

    SRM overview

    Optimize service health with Service Reliability Management (SRM) for teams in IT adopting site reliability engineering (SRE) practices. SRM is a single operations workspace that empowers teams to improve the reliability of digital services with SRE.
    • Use on-call escalations to respond to issues identified by your monitoring and ITOM alerts in a timely manner.
    • Reduce setup friction with guided self-service to onboard distributed teams with separated data, empowered access, and minimal governance from central IT.

    When SRM is activated, several plugins and applications are also installed. For more information, see Plugins or applications installed with ITOM Health.

    SRM users

    Table 1. Users
    Users Description Contains Roles
    admin

    A ServiceNow administrator is responsible for the administration, development, operation, education, and maintenance of the ServiceNow platform.

    Responsible for installation and can perform Service Operations Workspace Admin Center configuration of SRM.

    All
    Administrator [srm_admin]
    Note:
    Not the ServiceNow admin role

    SRM Administrators can manage account settings, configurations, and users.

    Administrators can perform the following actions:
    • Access, create, edit, or delete all SRM configurations.
    • Add or manage integrations.
    • Create Integrations with Application Performance Monitoring (APM) tools
    • Set up and maintain Reliability Indicators.
    • Set up and maintain Error Budget Policies.
    • Manager
    • Responder
    Manager [srm_manager] Managers oversee a team of SREs. Managers assign SREs to the team on-call schedule, monitor their performance, create procedures to deal with incidents, and develop solutions. Managers ensure resilience across all the systems and the DevOps workflows.
    Managers can perform the following actions within the context of their teams:
    • Define and set up and teams, on-call schedules, and services.
    • Add and delete users such as responders, and managers for the teams the are a part of.
    • Add or manage integrations.
    • Create Integrations with Application Performance Monitoring (APM) tools
    • Set up and maintain Reliability Indicators.
    • Set up and maintain Error Budget Policies.
    Responder
    Responder [srm_responder]

    A Service Reliability Engineer (SRE) that uses SRM to perform everyday tasks. Responders are the individuals who are on call and diagnose and remediate incidents.

    Responders can only access configurations that they’re a part of. They can only access the alerts or incidents for which they have permissions.

    SREs can perform the following actions, within the context of their teams:
    • Set up services, teams, and integrations
    • Confirm their on-call schedules
    • Manage incident and alert records
    • Update teams that they’ve created
    • Add other responders
    • Create Integrations with Application Performance Monitoring (APM) tools
    • Set up and maintain reliability metrics
    • Set up and maintain error budget actions
    Inherits 17 roles including the following:
    • cmdb_read
    • sn_sow.sow_user
    • sn_sow_srm.srm_responder
    • workspace_user
    • slo_operator

    For more information, see SRM roles and responsibilities.

    SRM workflow

    Infographic showing how responders, managers, and administrators manage teams, register services, define SLO, monitor integrations, respond to notifications, and remediate incidents. For details, refer to the following description.
    1. Product teams in IT or Lines of Business continuously deliver new technical and application services. Example: New customer billing portal.
    2. Along with SLO Management, teams have access to implement themselves on SRM to register these services and define service level objectives (SLO) to ensure business outcomes. Example: 95% monthly availability for billing portal.
    3. Monitoring integrations are set up by the teams to collect the real-time health of these services. Example: Cloud Observability.
    4. Monitoring creates service level indicator (SLI) impacting alerts when services are under-performing. Automation groups and enriches. Example: Billing portal latency is exceeding 7 s.
    5. When the alerts indicate an outage or customer impacting degradation, incidents are created and on-call notifications notify appropriate team resources. Example: A Billing SRE team is notified via phone call of a latency issue on the billing portal.
    6. After incidents are collaboratively diagnosed and remediated, action items for improved resilience are captured. Example: The Billing team decides to add additional web server capacity.
    7. Management continually reviews SLO performance, helps to prevent changes when the error budget is exhausted, and prioritizes improvement initiatives for under-performing services.

    SRM benefits

    Table 2. SRM benefits
    Benefit Feature Users
    Team-based experience Working with SRM teams Service Reliability Responder, Manager, and administrator
    Service registration Working with SRM services Service Reliability Responder, Manager, and administrator
    Prebuilt integrations Working with SRM integrations Service Reliability Responder, Manager, and administrator
    Measure service health Working with Reliability metrics Service Reliability Responder, Manager, and administrator
    On-call coverage Create your SRM On-call schedule Service Reliability Responder, Manager, and administrator
    Remediate high severity alerts and incidents Working with SRM reliability tasks Service Reliability Responder, Manager, and administrator