RabbitMQ default checks and policies

  • Release version: Yokohama
  • Updated January 30, 2025
  • 2 minutes to read
  • Summarize
    Summarized using AI
    This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.

    Summary of RabbitMQ default checks and policies

    This content describes the default health monitoring checks and policies for RabbitMQ provided by the Agent Client Collector in the Yokohama release (updated January 30, 2025). These checks are designed for use in a Windows environment and require RabbitMQ discovery to be performed beforehand. They enable ServiceNow customers to monitor RabbitMQ server health, cluster status, consumer and message metrics, network partitions, node status, queue synchronization, and protocol responsiveness.

    Show full answer Show less

    Default Checks and Events

    The default event checks cover various critical health indicators of RabbitMQ servers:

    • Server Availability: Checks if the RabbitMQ server is alive using the REST API, triggering alerts if down.
    • Cluster Health: Monitors whether cluster nodes are running and alerts if nodes are down.
    • Consumers and Messages: Tracks the number of consumers and queued messages, alerting based on configured thresholds.
    • Network Partitions: Detects network partition events and triggers alerts accordingly.
    • Node Health and Usage: Monitors node running state, resource usage (memory, file descriptors, sockets, processes, disk), and allows thresholds for warnings and critical alerts.
    • Queue Drain Time: Estimates time to drain queues based on message exit rates, alerting if critical thresholds are exceeded.
    • Queue Synchronization: Verifies synchronization status of mirrored queues with secondary queues.
    • Protocol Responsiveness: Checks if RabbitMQ responds to STOMP protocol requests.

    Each check is executed via specific scripts with parameters for host, port, virtual host, queue, and customizable warning/critical thresholds to tailor monitoring sensitivity.

    Default Metrics

    Two primary metric checks provide detailed statistical data:

    • Overview Metrics: General RabbitMQ server statistics for performance monitoring.
    • Queue Metrics: Detailed metrics on individual queues, optionally filtered by virtual host.

    These metrics use parameterized commands that allow specifying host, port, and vhost to retrieve targeted data.

    Practical Benefits for ServiceNow Customers

    By leveraging these default RabbitMQ checks and policies, customers can:

    • Gain comprehensive visibility into RabbitMQ server health and performance within their Windows environments.
    • Receive proactive alerts on critical issues like server downtime, cluster failures, resource exhaustion, and messaging bottlenecks.
    • Customize alert thresholds to align monitoring with operational requirements.
    • Ensure reliable message processing through monitoring of queue synchronization and drain times.
    • Monitor protocol-level responsiveness (STOMP) to confirm service availability.

    Implementing these checks supports maintaining high availability and performance of RabbitMQ messaging infrastructure managed via ServiceNow.

    Agent Client Collector provides the following default checks and policies for RabbitMQ health monitoring. You must perform RabbitMQ discovery before executing the checks. RabbitMQ checks are available only in a Windows environment.

    Table 1. RabbitMQ Events policy
    Type Check Description Command
    Event check-rabbitmq-alive Verifies whether the RabbitMQ server is alive, using the REST API. If the server is down, an alert triggers. check-rabbitmq-alive.rb --host {{.labels.params_host}} --port {{.labels.params_port}} -v {{.labels.params_vhost}}
    Event check-rabbitmq-cluster-health Verifies whether the RabbitMQ server's cluster nodes are running. If the nodes are down, an alert triggers. check-rabbitmq-cluster-health.rb --host {{.labels.params_host}} --port {{.labels.params_port}}
    Event check-rabbitmq-consumers Verifies the number of consumers on the RabbitMQ server and triggers an alert based on the configured threshold. check-rabbitmq-consumers.rb {{if .labels.params_warn}} --warn {{.labels.params_warn}} {{end}} {{if .labels.params_critical}} --critical {{.labels.params_critical}} {{end}} --host {{.labels.params_host}} --port {{.labels.params_port}}
    Event check-rabbitmq-messages Verifies the total number of messages queued on the RabbitMQ server and triggers an alert based on the threshold. check-rabbitmq-messages.rb --critical {{.labels.params_critical}} --port {{.labels.params_port}} --warn {{.labels.params_warn}} --host {{.labels.params_host}}
    Event check-rabbitmq-network-partitions Verifies whether the RabbitMQ network partition has occurred and triggers an alert based on the threshold. check-rabbitmq-network-partitions.rb --host {{.labels.params_host}} --port {{.labels.params_port}}
    Event check-rabbitmq-node-health Verifies whether the RabbitMQ server node is in a running state.
    check-rabbitmq-node-health.rb --host {{.labels.params_host}} {{if .labels.params_watchalarms}} --alarms {{.labels.params_watchalarms}} {{end}} {{if .labels.params_socketwarn}} --swarn {{.labels.params_socketwarn}} {{end}} {{if .labels.params_memcrit}} --mcrit {{.labels.params_memcrit}} {{end}} {{if .labels.params_fdcrit}} --fcrit {{.labels.params_fdcrit}} {{end}} {{if .labels.params_socketcrit}} --scrit {{.labels.params_socketcrit}} {{end}} --port {{.labels.params_port}} {{if .labels.params_memwarn}} --mwarn {{.labels.params_memwarn}} {{end}} {{if .labels.params_fdwarn}} --fwarn {{.labels.params_fdwarn}} {{end}}
    Event check-rabbitmq-node-usage Checks and displays usage of the RabbitMQ server node.
    check-rabbitmq-node-usage.rb {{if .labels.params_procwarn}} --pwarn {{.labels.params_procwarn}} {{end}} --port {{.labels.params_port}} {{if .labels.params_socketwarn}} --swarn {{.labels.params_socketwarn}} {{end}} --type {{.labels.params_type}} {{if .labels.params_diskcrit}} --dcrit {{.labels.params_diskcrit}} {{end}} {{if .labels.params_fdcrit}} --fcrit {{.labels.params_fdcrit}} {{end}} {{if .labels.params_proccrit}} --pcrit {{.labels.params_proccrit}} {{end}} {{if .labels.params_diskwarn}} --dwarn {{.labels.params_diskwarn}} {{end}} {{if .labels.params_socketcrit}} --scrit {{.labels.params_socketcrit}} {{end}} --host {{.labels.params_host}} {{if .labels.params_memcrit}} --mcrit {{.labels.params_memcrit}} {{end}} {{if .labels.params_fdwarn}} --fwarn {{.labels.params_fdwarn}} {{end}} {{if .labels.params_memwarn}} mwarn {{.labels.params_memwarn}} {{end}}
    Event check-rabbitmq-queue-drain-time Verifies the time it will take for each queue on the RabbitMQ server to drain, based on the current message exit rate.

    For example, if a queue has 1,000 messages in it but only 1 message exits per second, an alert generates because the default critical level of 360 seconds has been exceeded.

    check-rabbitmq-queue-drain-time.rb --host {{.labels.params_host}} --port {{.labels.params_port}} --warn {{.labels.params_warn}} --critical {{.labels.params_critical}}
    Event check-rabbitmq-queues-synchronised Verifies that all mirrored queues with secondary queues are synchronised. check-rabbitmq-queues-synchronised.rb --host {{.labels.params_host}} --port {{.labels.params_port}}
    Event check-rabbitmq-stomp-alive Verifies whether the RabbitMQ server is alive and responding to STOMP. check-rabbitmq-stomp-alive.rb --host {{.labels.params_host}} --queue {{.labels.params_queue}} --port {{.labels.params_port}}
    Table 2. RabbitMQ Metrics policy
    Type Check Description Command
    Metric metrics-rabbitmq-overview Provides RabbitMQ overview statistics. metrics-rabbitmq-overview.rb --port {{.labels.params_port}} --host {{.labels.params_host}}
    Metric metrics-rabbitmq-queue Provides RabbitMQ metrics per queue. metrics-rabbitmq-queue.rb --port {{.labels.params_port}} --host {{.labels.params_host}} {{if .labels.params_vhost}} --vhost {{.labels.params_vhost}} {{end}}