Understanding Metric Intelligence

Xanadu IT Operations Management

Release

xanadu

ft:locale

en-US

ft:publication_title

Xanadu IT Operations Management

ft:clusterId

itom

bundleId

itom

workflow

Technology

Understanding Metric Intelligence

Release version: Xanadu

Updated August 1, 2024

8 minutes to read

Summarize

Summarized using AI

Summary of Understanding Metric Intelligence

Metric Intelligence in ServiceNow IT Operations Management (ITOM) Health leverages historical metric data to detect anomalies in Configuration Items (CIs) that traditional event monitoring might miss. This capability helps identify potential service outages early, enabling preventive actions through anomaly alerts that can be promoted to IT alerts and viewed in the Service Operations Workspace and service health dashboards.

Show full answer Show less

Metric Intelligence integrates with monitoring systems such as SCOM, SolarWinds, and Nagios XI to capture raw metric data. It maps this data to CIs using event rules and the CMDB identification engine, then applies statistical models to detect deviations from expected behavior.

Key Features

Anomaly Detection: Uses statistical models built from up to 32 days of historical data to project expected metric values and control bounds, identifying statistical outliers and scoring anomalies from 0 to 10.
Data Sources and Metrics: Supports multiple monitoring systems and allows selection of relevant metric types per data source for processing.
Insights and Visualization: Provides metric statistics and anomaly visualization through the Insights Explorer and Anomaly Map, showing correlated high anomaly scores across CIs over time.
Configurable Anomaly Detection: Enables disabling anomaly detection during system maintenance via a system property, and customization for metrics with near-constant values through customer support.
Plugin Activation: Metric Intelligence functionality is available by activating the Metric Intelligence (com.snc.sa.metric) plugin.

Statistical Models and Classifiers

Metric Intelligence employs various statistical models to characterize metric data patterns and detect anomalies:

Time Series Models: Include weekly and daily seasonal patterns, linear trends, noisy data, and several classifiers for handling different data behaviors (e.g., positive clipped noisy, skewed noisy, accumulator, near constant, multinomial, corrupt).
Kalman Filter Model: Enhances time series models, particularly for noisy data, by adapting to level changes in metric data over time.
Non-Parametric Model: Applies to positive noisy data where noise distribution is asymmetric and no seasonal pattern exists, producing control bounds fitting actual data.
Median Absolute Deviation (MAD) Model: Used with skewed noisy data exhibiting heavy-tailed distributions, improving anomaly detection efficiency by about 30%.

These models learn from historical metrics to establish control bounds; values outside these bounds trigger anomaly alerts. Some data patterns do not support anomaly detection due to insufficient data or lack of identifiable patterns.

Practical Implications for ServiceNow Customers

Enable Metric Intelligence to proactively detect CI anomalies that may lead to service disruptions, enhancing service reliability.
Leverage integration with existing monitoring tools to enrich your ITOM Health insights without duplicating data collection efforts.
Use anomaly scores and visualizations to prioritize investigation and preventive actions on at-risk CIs.
Adjust anomaly detection settings during maintenance windows to avoid irrelevant alerts.
Understand the nature of metric data patterns through statistical models to interpret anomaly alerts accurately.

Use Metric Intelligence to identify and prevent potential service outages. Metric Intelligence, based on historical metric data, indicates anomalous behavior of CIs which events might not capture. Anomaly alerts can be promoted to regular IT alerts and appear on the Service Operations Workspace and service health dashboard for preventive actions.

Starting with the New York release, Metric Intelligence is part of ITOM Health in the IT Operations Management product.

Anomaly detection

Metric data is collected by various data sources such as SCOM, SolarWinds monitoring system, or Nagios XI server (some partially configured for metric collection by default). These monitoring systems collect metric data from the source environment regularly. Metric Intelligence captures the raw data from these monitoring systems, and uses event rules and the CMDB identification engine to map data to existing CIs and their resources. The data is then analyzed to detect anomalies and to provide other statistical scores.

Metric Intelligence uses historical metric data to build statistical models. These models facilitate projection of expected metric values along with upper and lower bounds. Metric Intelligence then uses these projections to detect statistical outliers and to calculate anomaly scores. Anomalies are scored on a range of 0-10. High anomaly scores for CI metrics can indicate that a CI is at risk of causing a service outage.

After processing, the Insights Explorer shows metric statistics and charts, and the Anomaly Map shows correlated scores for CIs with the highest anomaly scores, across a timeline.

You may want to disable anomaly detection during system maintenance, as anomalies may be irrelevant when detected while maintenance is in progress. To do so, set the mid.mi.anomaly_detection.disable property to true.

To customize displaying anomalies for metrics classified as near-constant, contact customer support.

Metric Intelligence is available when you activate the Metric Intelligence (com.snc.sa.metric) plugin.

Terms used with Metric Intelligence

Source metric type: A metric such as '% Free Space' or 'Current Bandwidth' that can be measured by a data source for a CI. For each data source, you can choose which of all possible source metric types are processed. For example, there are about 380 source metric types that are active by default for the SCOM data source.
Anomaly: Data that is outside the control bounds is considered a statistical outlier. These outliers are used to compute an anomaly score, which is a value between 0–10 that indicates the degree to which the metric appears unlikely. When an anomaly score is above a threshold, an anomaly alert is generated. Anomaly alerts are reported separately from regular IT alerts.
Resource: A component of a CI that consists of multiple individual components of similar type, where each subcomponent can be monitored separately. For example, individual Web pages, or specific disks such as 'Disk C:' and 'Disk D:'.
Time series: A series of values (such as metric values) over a time range, associated with a CI and a metric type. Because an anomaly score is evaluated for each metric, the series of anomaly scores over a period of time are also a time series. Time series are computed by the statistical model built for a metric data series, and are used with metric data values, anomaly scores, and upper and lower control bounds.

Statistical models

Metric Intelligence jobs learn from past metric data (up to 32 days old). A model training process analyzes historical data to construct a model that projects future values. Typically, models are in effect until the next time the model learning process runs. These models are used to calculate upper and lower bounds. Incoming values that are beyond those bounds, and that deviate with statistical significance from expected values, generate anomalies. Each model is uniquely patterned and is labeled with a classifier that illustrates the general behavior of the model. This classification determines if anomaly detection can be applied. For most models, it is possible to project which future values deviate from expected values. Such models are associated with control bounds and anomaly detection can be applied (if enabled).

However, for some models, there is insufficient data to determine which values are anomalous and anomaly detection cannot be applied without additional information (even if anomaly detection is enabled).

The learned data models are stored in the Metric Time Series Models [sa_time_series] table.

The following statistical models and classifiers are used in anomaly detection:

Time Series statistical model

After it is established, a time series model does not adjust to changes in the incoming metric data. Therefore, if the pattern of incoming data changes, those changes are likely to be identified as anomalous. Upper and lower control bounds, after they are learned, persist until the next time the learning process runs (data is learned every day).

Weekly

Data with a pattern that repeats itself over weekly intervals (seasonal model).

Requires a minimum of 15 days of data in the series, as set by the weekly_model_min_days configuration setting.

Weekly classifier

Daily

Data with a pattern that repeats itself over a daily interval (seasonal model).

Requires a minimum of 3 days of data in the series, as set by the daily_model_min_days configuration setting.

Daily classifier

Trendy

Data that has a linear trend with some slope and with some noise.

Requires a minimum of 30 data points in the series, as set by the corrupt_data_count_threshold configuration setting.

Trendy classifier

Noisy

Typical noisy data that is a basic pattern classification in a data model. The pattern cannot be identified with a specific trend or seasonality.

Requires a minimum of 30 data points in the series, as set by the corrupt_data_count_threshold configuration setting.

Noisy classifier

Positive clipped noisy

Similar to the noisy classifier other than the lower bound that is fixed on 0.

Requires a minimum of 30 data points in the series, as set by the corrupt_data_count_threshold configuration setting.

Positive clipped noisy classifier

Centered noisy

Noisy data that typically spreads symmetrically between user-specified upper and lower bounds. The formula that is used to set bounds and width values, ignores the statistical data, and the lower and the upper widths have an identical value.

Requires that the number of data points in the series is zero.

See Specify custom upper and lower metric bounds for more information.

Centered noisy classifier

Skewed noisy

Noisy data that is not evenly spread between user-specified upper and lower bounds, but instead tends to concentrate closer to one of the bounds. The median of the data is used to separately compute an upper width and a lower width.

Requires a minimum of one data point in the series.

See Specify custom upper and lower metric bounds for more information.

Skewed noisy classifier

Skewed noisy - Generalized Extreme Value (GEV) Distribution

Noisy data that is spread unevenly between user-specified upper and lower bounds, and concentrates closer to one of the bounds. In addition, the data distribution demonstrates a long or heavy tail. The median of data derived from the tail of the distribution is used to separately compute an upper width and a lower width. There must be at least one data point in the series.

Accumulator

Data pattern similar to the trendy classifier but with a monotonous increase and without noise. For this classifier, there is no data model and no anomaly detection.

Requires a minimum of 30 data points in the series, as set by the corrupt_data_count_threshold configuration setting.

Diagram of the Accumulator classifier.

Near Constant

Nearly constant data, in which most values are a specific constant value. For this classifier, there is no data model and no anomaly detection.

Requires a minimum of 30 data points in the series, as set by the corrupt_data_count_threshold configuration setting.

Diagram of the Near Constant classifier.

Multinomial

Data pattern in which all values are one of a relatively small number of values. For example, values are always 100 or 99.9. For this classifier, there is no data model and no anomaly detection.

Requires a minimum of 400 data points in the series, calculated as 10 times the value of the multinomial_count_threshold configuration setting.

Multinomial classifier

Corrupt

Data has insufficient data points to identify a pattern. For this classifier, there is no data model and no anomaly detection.

Requires that the number of data points in the series is less than the value of the corrupt_data_count_threshold configuration setting (30 by default).

Kalman Filter statistical model

Add on to the time series statistical model and applicable only to the noisy and positive noisy classifiers. This model is a general method of estimating model parameters from a stream of data where level is the only parameter in the model. The Kalman Filter model can adjust to new values in incoming metric data. When there are no clear patterns in the noise or if there is too much noise, the Kalman Filter model is not used.

Local level

When incoming data clusters around a new value according to the current control bounds, the Learner adjusts the data model to accommodate a permanent change. This clustering is detected as a new value in the data model so that most incoming data is again within the control bounds rather than anomalous. Such change detection is useful when for example, cores or memory are added to a server, which impact the baselines.

Requires a minimum of 30 data points in the series, as set by the corrupt_data_count_threshold configuration setting.

Diagram of the Kalman Filter Local Level classifier.

Unrecognized

When data does not fit the local level classifier, time series classifiers are used. This happens when it is not possible to adjust the variance ratio in a learned local level model to reasonable values.

Non-Parametric statistical model

Add on to the positive noisy classifier. In the nonparametric model, noise distribution is not symmetrical and does not fit any seasonal pattern. The nonparametric model creates control bounds that better fit the actual data, and once learned, the control bounds persist until the next learning cycle. This model does not adjust itself to changes in the data, and it takes longer for a deviation to be identified as an anomaly.

Stationary Non-Parametric

Data that is not time-dependent meaning that there is no significant shift in parameters such as mean and variance when shifting data in time.

Requires a minimum of 5000 data points in the series, as set by the snpm_minimum_data_count configuration setting.

Diagram of the Non-Parametric Stationary classifier.

Unrecognized

When data does not fit the stationary classifier, time series classifiers are used.

Median Absolute Deviation (MAD) statistical model

An add-on to the skewed noisy classifier. In this type of data, noise distribution is not symmetrical and does not fit any seasonal pattern. In addition, data reflects a heavy or long-tailed distribution. The MAD statistical model creates control bounds that better fit the data, and once learned, the control bounds persist until the next learning cycle. Using this model enhances the deciphering of data collection with approximately 30% greater efficiency.