MID Server resource threshold alerts
Summarize
Summary of MID Server resource threshold alerts
ServiceNow instances monitor MID Server CPU and JVM memory usage, generating warnings when resource thresholds are exceeded. These alerts help administrators proactively manage MID Server performance, preventing downtime caused by resource exhaustion. Alerts are recorded in theMID Server Issue [eccagentissue]table and can trigger email notifications or custom scripts for response automation. By default, resource threshold alerts are disabled and must be enabled via system properties.
Show less
Key Features
- Resource Monitoring and Alerting: Every 10 minutes, MID Servers report CPU and memory metrics to the instance, which averages these over configurable intervals to detect threshold breaches.
- Configurable Thresholds and Sampling Intervals: Administrators can customize CPU and memory usage thresholds and sampling periods globally or per MID Server using system properties or configuration parameters.
- Issue Tracking: Breaches are logged in the eccagentissue table with states such as New, Acknowledged, and Resolved, allowing administrators to track and manage ongoing resource issues without duplicate records.
- Automated Notifications: The system triggers the mid.threshold.resource.breach event on breaches, supporting creation of email alerts or custom workflows for timely incident management.
- Resource Usage Reporting: MID Server dashboards provide 30-day trending reports on average CPU usage and max memory usage, aiding capacity planning and resource allocation.
Practical Guidance for ServiceNow Customers
- Enabling Alerts: Set system properties
mid.threshold.resource.breach.enable.cpu.alertsandmid.threshold.resource.breach.enable.memory.alertsto true to activate alerting. - Responding to Alerts: When alerted, administrators should review issue states, acknowledge ongoing breaches, and resolve them by appropriately increasing MID Server JVM memory, adding MID Servers to share workload, or reducing concurrent processing.
- CPU Optimization: Reduce host load by migrating MID Servers to dedicated machines or upgrading CPU resources, especially if high CPU usage occurs during Discovery activities.
- Managing Issue Records: Monitor the breach count field to assess the effectiveness of remediation; when increments cease, resource levels are likely sufficient.
- Maintenance: The system automatically deletes unresolved issues older than 30 days to keep the issue table current.
Tables and Business Rules Involved
- MID Server Issue [eccagentissue]: Stores breach records with details such as count, last detected time, message, and MID Server name.
- ECC Agent Scalar Metric [eccagentscalarmetric]: Records CPU usage metrics every 10 minutes.
- ECC Agent Memory Metric [eccagentmemorymetric]: Stores memory usage data every 10 minutes.
- Business Rules: After metric insertion, business rules trigger script includes to evaluate CPU and memory usage against thresholds and update issue records accordingly.
The instance displays warnings when a MID Server breaches its resource thresholds for CPU and JVM memory usage, enabling users to create email notifications or custom scripts when a breach occurs.
The MID Server Issue [ecc_agent_issue] table warns users when a MID Server exceeds configured thresholds of its allocated CPU and memory resources. These warnings are published before the MID Server experiences performance degradation or an out-of-memory error, enabling the administrator to increase resources and avoid downtime. Administrators can use a registered event to send email notification to selected recipients, advising them of any threshold breaches, or to create a custom script to do some other type of work. The instance continues to update the MID Server Issue [ecc_agent_issue] table to keep unresolved issues current.
- mid.threshold.resource.breach.enable.cpu.alerts
- mid.threshold.resource.breach.enable.memory.alerts
Evaluation process
- Every 10 minutes, each MID Server transmits its CPU and memory consumption metrics to the instance. The instance inserts CPU metrics into the Mean CPU used % field of the ECC Agent Scalar Metrics [ecc_agent_scalar_metric] table and memory metrics into the Max memory used % field of the ECC Agent Memory Metrics [ecc_agent_memory_metric] table.
- After a successful insert, the following business rules run on each table, invoking a
script include that calls an appropriate function. Each function takes an average of the
metric sets inserted into the tables, based on the configured sampling intervals.
- Update cpu mean on MID Server Status: Calls the MIDResourceThresholdBreach.checkCpuUsage() script include.
- Update max memory on MID Server Status: Calls the MIDResourceThresholdBreach.checkMemoryUsage script include.
Each function takes an average of the metric sets inserted into the tables, based on the configured thresholds and sampling intervals. The instance first looks at each MID Server for configuration parameters that set custom threshold values or sampling intervals for that MID Server. If no configuration parameters for these attributes are found, the instance looks in the System Properties [sys_properties] table for custom values to use. If no properties are found, the instance uses the default threshold and interval values from the code.Note:Both the threshold percentages and the sampling intervals are configurable. See Configuring thresholds and sampling intervals for details.
Alerting process
- If the aggregated average metric value equals or exceeds the configured percent threshold, the instance triggers the mid.threshold.resource.breach event. Administrators can use this event to create email notifications for threshold breach alerts or to create a custom script.
- The instance inserts a record of the breach into the MID Server Issue
[ecc_agent_issue] table with a State value of
New and a Count of 1, and then publishes
a message containing all the pertinent details of the breach. An example of this message
is
Mean CPU used % has exceeded threshold (96>=91) for a 40 minute interval span, occurring after start date 2017-01-11 14:25:19. This message appears in the Short description field of the MID Server Issue form and in the event. You can copy any part of the message into your email notifications.
MID Server issue states
Recommendations for resolving resource issues
- JVM memory:
- Allocate more max memory to the MID Server. For more information, see Set the MID Server JVM memory size.
- Add additional MID Servers to share the workload. For more information, see MID Server clusters.
- Reduce the amount of concurrent processing for the MID Server. This includes segmenting IP Address ranges into smaller segments for a Discovery schedule or loading smaller segments of data within an import job.
- CPU: Reduce the activity on the host or migrate the MID Server
to a new host with more available resources. Note:MID Server can create a resource usage spike during Discovery, especially discovering against a large number of targets or executing multiple Power Shell sessions concurrently. The MID Server host’s resource utilization automatically returns to normal after the Discovery execution successfully stops. To decrease CPU utilization, host the MID Server on a dedicated machine. If you encounter resource usage issues, make sure only one MID Server is run on each dedicated host machine. If the MID Server is hosted on a public cloud, add more CPU resources and avoid the noisy neighbor issue. For more information, see High CPU Usage on Host with MID Server(s) [KB0597639].
Tables used for resource threshold evaluation
| Table | Description |
|---|---|
| MID Server Issue [ecc_agent_issue] | Stores data on various types of MID Server issues, including breaches of
configured CPU and memory thresholds. Fields used for resource threshold breaches
are:
|
| MID Server Status [ecc_agent_status] | Stores the percentages used for the CPU and memory resources, averaged over
configurable intervals for each resource. The fields used are:
|
| ECC Agent Scalar Metric [ecc_agent_scalar_metric] | Stores the CPU usage data inserted by each MID Server every 10 minutes. The table field used by resource threshold alerting is mean. |
| ECC Agent Memory Metric [ecc_agent_memory_metric] | Stores the memory usage data inserted by each MID Server every 10 minutes. The table field used by resource threshold alerting is max_used_pct. |
Business rules that check for threshold breaches
| Business rule | Description |
|---|---|
| Update cpu mean on MID Server Status | Runs after the MID Server inserts a record into the ECC Agent Scalar Metric [ecc_agent_scalar_metric] table. This business rule triggers the MIDResourceThresholdBreach script include function that evaluates threshold settings to determine if the MID Server has breached its configured CPU resource thresholds. |
| Update max memory on MID Server Status | Runs after the MID Server inserts a record into the ECC Agent Memory Metric [ecc_agent_memory_metric] table. This business rule triggers the MIDResourceThresholdBreach script include function that evaluates threshold settings to determine if the MID Server has breached its configured memory resource thresholds. |
Configuring thresholds and sampling intervals
- Add system properties to the instance and change the default values for all MID Servers.
- Add configuration parameters to change the default resource values for individual MID Servers.
| Property/configuration parameter | Description |
|---|---|
| mid.threshold.mean_cpu.aggregate_interval_span | Number of 10 minute units in the interval for sampling CPU usage data. The
default interval is 30 minutes (3 x 10 min.) Default: 3 |
| mid.threshold.mean_cpu.percent | Usage percentage of the total CPU resources that initiates a threshold breach
alert. Default: 95 |
| mid.threshold.mean_max_memory.aggregate_interval_span | Number of 10 minute units in the interval for sampling memory usage data. The
default interval is 30 minutes (3 x 10 min.) Default: 3 |
| mid.threshold.mean_max_memory.percent | Usage percentage of the total memory resources that initiates a threshold
breach alert. Default: 95 |
MID Server resource reporting
- Avg Percentage of CPU Used: Trending the daily average on CPU usage helps illustrate the amount of CPU processing that the MID Server host consumes. MID Servers deployed on the same host will report the same CPU usage.
- Avg Percentage of Max Memory Used: The maximum used percentage (max_used_pct) is a useful metric for determining if the MID Server has enough memory resources. This metric is a percentage of the max used memory over the total available memory. Trending this over time provides a visualization of how much memory is needed by the MID Server.