An overview of alerts for Event Management operators
Summarize
Summary of An Overview of Alerts for Event Management Operators
This content provides essential information for Event Management operators about how alerts are generated from events, their characteristics, and how to manage them effectively. Alerts are indicators that require action when issues occur within your organization's network, processed by ServiceNow's Event Management application based on settings configured by administrators.
Show less
Key Features
- Alert Generation: Alerts are created when event monitoring tools detect issues, such as system failures, and send events to the ServiceNow instance.
- Alert Priority and Severity: Alerts are characterized by priority scores and severity levels, which indicate their impact on application services and the seriousness of the underlying issue. Common severity types include Critical, Major, Minor, Warning, OK, and Clear.
- Correlated Alerts: Alerts related to a single issue can be grouped together, establishing a hierarchy with a primary alert at the top and secondary alerts beneath it. This helps operators focus on the most critical alert.
- Alert Flapping: Alerts can enter a flapping state when they receive multiple open-close events rapidly, indicating uncertainty about the underlying issue. Operators may need to take action, such as creating incidents or reconfiguring systems.
Key Outcomes
By understanding alerts, operators can effectively monitor and manage incidents within their network. Recognizing alert priorities and severity helps in prioritizing responses, while managing correlated alerts streamlines the resolution process. Awareness of alert flapping allows operators to address potential configuration issues proactively.
As an Event Management operator, you need to understand how an alert is generated from an event, what to look for in an alert, and how alerts can be grouped together.
This is the first lesson in the Event Management tutorial.
| Lesson 1 | An overview of events and alerts |
|
| Lesson 2 | ||
| Lesson 3 | ||
| Lesson 4 |
Your organization already has an event monitoring tool in place, such as Microsoft System Center Operations Manager (SCOM), Nagios, SolarWinds, and so on. When an issue occurs on your network, such as a computer going down or a database failure, the event monitoring tools send events to your ServiceNow instance. The Event Management application processes the events according to the settings that your administrator configured, and then generates alerts. An alert is an indicator that the issue requires some type of action.
As an Event Management operator, your role is to view alerts and, depending on how Event Management is implemented in your organization, take an action to help resolve the underlying issue or notify someone who can. Later in this tutorial, you will see the phases of a typical alert management process.
Alert priority and severity
- The priority of an alert is a score that helps you determine how important the impact is to application services. Multiple factors determine the alert priority score. Your Event Management administrator can configure the algorithm that the Event Management application uses to calculate priority.
- The severity of an alert is an indicator of
how serious the underlying issue is. The event
monitoring tool in your organization usually sends
severity values with the event, which then gets
carried over in the alert. These are the default
severity types that you will see in this tutorial:
Severity Description Critical
The resource is either not functional or critical problems are imminent. Major
Major functionality is severely impaired or performance has degraded. Minor
Partial, non-critical loss of functionality or performance degradation occurred. Warning
Attention is required, even though the resource is still functional. OK
No severity. An alert is created. The resource is still functional. Clear
The alert no longer needs action.
Correlated alerts
Some alerts are related to each other. For example, if a router goes down, several separate alerts could be generated, one for each server connected to the router. All of these alerts are related, or correlated. To help you manage correlated alerts, Event Management can automatically group them and establish a two-level hierarchy with one root alert, called the primary alert, at the top, and other related alerts, called secondary alerts, under the primary alert. When you view alerts, primary alerts stand out by default so you know which alert to focus on without being distracted by the secondary alerts.
In our example, if a router goes down on your network, network communication is also affected for connected servers, assuming they cannot reach any other routers. The router outage becomes the primary alert and the alerts generated on the server are secondary alerts that are correlated under the router alert.
Depending on your organization's Event Management implementation, alerts might be grouped automatically based on correlation rules that your administrator sets up. Your instance can also learn how to improve the way it correlates alerts based on these rules. As an operator, you should still verify the accuracy of the correlation and, if necessary, manually correlate additional alerts with the primary alert. Later in the tutorial, you will learn how to do this.
In this tutorial, you will learn how to manually correlate alerts.
Alert flapping
An alert can flap, meaning that it gets multiple open-close events in rapid succession. Flapping indicates that Event Management does not know whether the underlying events are genuine or not. The events could indicate small issues with the way CIs are configured, or larger issues, like network outages.
For example, if a server that hosts a web service has too many active processes, it might trigger an event about excessive CPU usage. Since CPU usage can fluctuate rapidly depending on web service requests, several events might be triggered, leading to the alert being put in the flapping state. As an operator, you might need to create an incident to have the server restarted, or someone might have to reconfigure the CPU, or possibly make a hardware change on the device.
As another example, consider a loose network cable that causes momentary, repeated network outages. The thresholds that your administrator configures might not be optimal for this kind of alert and Event Management considers it a flapping alert.
Continue the tutorial
Proceed to the next lesson: Application services for Event Management operators.