Service Reliability Management

Washington DC IT Operations Management

Release

washingtondc

ft:locale

en-US

ft:publication_title

Washington DC IT Operations Management

ft:clusterId

itom

bundleId

itom

workflow

Technology

Service Reliability Management

Release version: Washingtondc

Updated February 1, 2024

5 minutes to read

Summarize

Summarized using AI

Summary of Service Reliability Management

Service Reliability Management (SRM) enhances your organization’s ability to respond to alerts and incidents efficiently. It is available with ITSM Professional and ITOM Operator Professional subscriptions, ensuring improved service resilience through effective incident response management.

Show full answer Show less

Key Features

Incident Response Process: SRM integrates with existing infrastructure to streamline incident management, allowing teams to track alerts and incidents with service level objectives (SLOs) and indicators (SLIs).
Team Management: Administrators can create teams, add members, set on-call schedules, and define escalation policies to ensure timely incident response.
Service Integration: SRM supports integration with various monitoring systems to detect service issues early and maintain service health.
Alert Handling: Alerts from multiple sources can trigger automated actions and notifications to relevant team members, facilitating quick responses.
Postmortem Analysis: After resolving incidents, teams can conduct reviews to document root causes and lessons learned, promoting continuous improvement.

Key Outcomes

By implementing SRM, your organization can expect:

Enhanced collaboration among teams during incident resolution.
Improved ability to track service performance and alert management.
More effective problem-solving with clear insights into service health and past incidents.
Streamlined workflows and reduced downtime, ultimately leading to better service quality.

Service Reliability Management provides an easier way for your organization to respond, collaborate, track, and self-remediate when working on alerts and incidents. SRM (SRM) is available when you have both ITSM Professional and ITOM Operator Professional subscriptions.

Service Reliability Management Process flow

SRM integrates with your processes and existing infrastructure to ensure you can easily and effectively manage your incident response process resulting in improved service resilience. When services are disrupted, getting service restored is a top priority for your business. Track the volume, performance, and progress of alerts and incidents in your instance using service level objectives (SLO), service level indicators(SLI) and error budgets. From initial analysis and detection to resolution, reports filter by source, service, priority, assignment group, and assignee.

With SRM, schedule who's on call and when and to escalate, as needed. Teams can work together to solve problems and then analyze why it happened so that you can take steps to avoid the same issues in the future. Postmortems can be stored and easily retrieved for reference. With SRM.Your teams can find and fix service degradations and other issues in your data center or cloud infrastructure and applications. SRM empowers your development and operations teams to to track the health of services in context of service level objectives (SLO)s and efficiently resolve incidents.

Set up a team

The Service Operations Workspace administrator for SRM, can add internal users and can create teams, and define services.

SRM administrators, managers and responders can also create their own teams. As a part of the team creation, they can add team members, create on-call schedules, add escalation policies, and assign services which have associated integrations. This team is responsible for handling issues related to the associated services..

Set policies and escalations

You can set up an escalation policy for your team so that at least one team member is engaged in incident response.

Add Services

Services are integrated with various observability or monitoring systems. These monitoring systems continuously keep track of the status of the service to have the earliest warning of failures, defects, or problems. A system such as a server, app, microservice, or database can contribute to service degradation.

Set up Service Level Objectives (SLOs) and Service Level Agreements (SLAs) for your services

Establish goals for how well your service should operate. Also specify the maximum amount of time that a technical system can fail without contractual consequences.

Add Integrations

When monitoring systems that are configured to report system health, like network traffic, latency, saturation, or errors, detect a spike on a metric, they send an alert to SRM.

The following are the monitoring systems available in SRM.

- Custom Connector (Webhook)
- Amazon Web Services
- Catchpoint
- Datadog
- Dynatrace Monitor
- Google Cloud Platform
- Grafana
- Honeycomb
- Instana
- ServiceNow Cloud Observability
- Logic Monitor
- Microsoft Azure
- New Relic
- Oracle Cloud
- Prometheus
- Scout APM
- Sentry
- Sumologic
- ThousandEyes
- Transform Generic Events Instance
Collaboration Integrations
- Receive notifications in Slack channel for specific events in a service. These notifications are targeted to a specific team.
- Receive notifications in Microsoft Teams channel for specific events in a service. These notifications are targeted to a specific team.

Keep up with your collaborators on a timeline with updates from responders. Engage in real-time discussions with integrated Slack or MS Teams chat-ops that are linked directly to your response alerts.

View Alerts and incidents

Alerts come from these pre-built monitoring integrations, a generic Rest API, a CLI, or an email. The alert includes information on the type of issue and the affected component or system.

When you receive an alert and a error budget policy is triggered, you can notify a team member by email or you can manually create an incident.

When you receive an alert, response rules are triggered to find out if any post-processing is needed or if another automated action needs to be performed. A automated actions can route an alert to a team or contact an on-call team member notify a team member by email or you can manually create an incident. The system determines which on-call team member is to be notified based on who is currently on-call. The system identifies who is on call based on schedules and shifts that are defined while creating a team.

The on-call team member can acknowledge the alert using the mobile app,the desktop, an SMS, CLI, collaboration integration, or an email notification. If the team member does not acknowledge the alert within a specific time, the escalation policy associated with the team notifies the next level of recipients. If an incident is required, the on-call team member can promote an alert to an incident. Once an alert is assigned, the responder can look through logs, debug, log in to a remote system, or take other actions to find the root cause and resolve the issue. If additional expertise is needed, the on-callteam member can collaborate with cross-functional teams.

When an incident is created, you can associate alerts to it.

Consolidate View alerts and incidents to better manage and allocate resources and improve the performance of your teams. You have a single reliable source of information across teams and systems.

Alert automation rules

Alert automation rules are available for SRM in the Alert Automation app available upon request in the ServiceNow Store.

Resolution and postmortem

Once a root cause is identified, the on-call team member works to remediate the issue and restore service levels. This could involve rolling back a change or reconfiguring a system. While the service was degraded or suffering an outage, responders are updated on the progress of the alert or incident. When the service is restored, a postmortem review meeting can walk through what happened and capture lessons learned. The team can create a postmortem document to formally track the root cause analysis of the issue and the actions required to prevent this issue from occurring in the future.

Visit our community

For thought leadership, prescriptive guidance, and to interact with the product team and other customers using Service Operations Workspace, visit the Service Operations Workspace community.