Apache Kafka default checks and policies
Summarize
Summary of Apache Kafka default checks and policies
ServiceNow’s Agent Client Collector offers a set of default policies and checks specifically designed for Apache Kafka health monitoring. These policies are compatible with both Windows and Linux environments and help you monitor Kafka components such as Zookeeper, brokers, topics, and related metrics to ensure optimal cluster health and performance.
Show less
Key Features
- Zookeeper Status Check: Detects if the Kafka Zookeeper instance is down and raises critical events accordingly. Parameters allow specifying the Zookeeper port.
- Topic Replica Checks: Identifies partitions with unknown replicas and allows filtering topics by include or exclude lists, supporting wildcards for flexible targeting.
- Replication Factor Verification: Ensures that topics maintain the expected replication factor, alerting if any topic’s replication is above or below the configured threshold.
- Topic Leader Validation: Flags partitions with unknown leaders or unpreferred replicas acting as leaders, with options to include detailed partition-level information.
- Partition Count Monitoring: Checks if topics have fewer partitions than a specified minimum, alerting when partition counts are insufficient for workload distribution.
- Broker Status Check: Monitors Kafka brokers at the host level, raising critical alerts if brokers are down, with configurable broker ports.
- Broker Metrics Collection: Gathers detailed Kafka broker performance metrics via JMX, such as request rates and leader election statistics, customizable with Java executable path and JMX port.
- Zookeeper Metrics Collection: Collects Zookeeper performance metrics including outstanding requests, latency, connection counts, and file descriptor usage, configurable by admin server port.
Practical Use and Configuration
Each check can be invoked using the commonchecks command-line utility with flags to customize parameters such as ports, topic filters (include/exclude with wildcard support), detailed output, and threshold values like replication factor or minimum partitions. Examples illustrate how to tailor checks to specific Kafka cluster configurations and monitoring needs.
For instance, to verify topic replication factors with detailed output while excluding certain topics, you can run:
commonchecks check-kafka-rf -H localhost -p 2181 -r 2 -i "accMetrics,Topic" -e "testTopic" -d
Similarly, broker and Zookeeper metric collections help you proactively monitor cluster health and performance trends.
Benefits for ServiceNow Customers
- Enables proactive detection of Kafka cluster issues such as broker downtime, replication inconsistencies, and partition problems.
- Facilitates targeted monitoring through flexible topic filtering and detailed reporting.
- Supports both Windows and Linux platforms, ensuring broad applicability.
- Integrates seamlessly into ServiceNow’s health monitoring workflows, improving operational visibility and incident response.
Agent Client Collector provides the following policies for Apache Kafka health monitoring. Policies come with the checks specified in the indicated table. Policies and checks are available for both Windows and Linux.
| Check | Description | Usage | Output |
|---|---|---|---|
| kafka.check-zookeeper-status | Raises a critical event if the hosted Kafka Zookeeper is down. | commonchecks check-kafka-zk-status [flags]Where the flags are: -p, --port = Zookeeper Port (default "2181").Usage
example: |
Kafka Zookeeper Status OK: Kafka Zookeeper is Up! |
| kafka.check-topic-replicas | Raises critical event if any topic has partitions with unknown replicas. | commonchecks check-kafka-replicas [flags]Where the flags are:
|
<topic> has partitions with unknown replicas. Unknown replicas are: {"0":["0"],"1":["0"],"2":["0"]}. <topic> has partitions with unknown replicas. Unknown replicas are: {"0":["0"]}. |
| kafka.check-topic-replication-factor | Raises critical event if replication factor of at least one topic is above or below provided replication factor param. | commonchecks check-kafka-rf [flags]Where the flags are:
Examples: |
TestTopic has replication factor 1, which is less than expected: 2. accMetrics has replication factor 1, which is less than expected: 2. |
| kafka.check-topic-leader | Raises critical event if any topic has partitions with unknown leaders or unpreferred replica as leader. | commonchecks check-kafka-leader [flags]Where the flags are
Examples:
|
<topic> contains, partitions with unpreferred replica as leader.(partitions with unpreferred replicas are [0]). <topic> contains, partitions with unpreferred replica as leader.(partitions with unpreferred replicas are [0]). |
| kafka.check-topic-partitions | Raises critical events if number of partitions for a topic is less the min_partitions param. | commonchecks check-kafka-partitions [flags]
Where the flags are:
|
|
Usage example 1: |
<topic> has 1 partitions, expected at least 3. <topic> has 1 partitions, expected at least 3. <topic> has 1 partitions, expected at least 3. |
||
| Usage example 2: commonchecks check-kafka-partitions -H localhost -p 2181 -P 3 -i "accMetrics,*Topic" -e "testTopic" | <topic> has 1 partitions, expected at least 3. <topic> has 1 partitions, expected at least 3. |
| Check | Description | Usage | Output |
|---|---|---|---|
| kafka.check-broker-status | Raises critical event if Kafka Broker on the host is down. | commonchecks check-kafka-broker-status [flags]Where the flags are: -p, --port = Kafka Broker port (default
"9092").Usage example: |
Kafka Broker Status OK: Kafka Broker ubuntu20:9092 is Up! |
| Check | Description | Usage | Output |
|---|---|---|---|
| kafka.metrics.broker | Collects Kafka Broker Metrics from the host. | commonchecks metric-kafka-broker [flags]Where the flags
are:
Usage example: |
hostname.Kafka.Broker.ReplicaManager.IsrExpandsPerSec.OneMinuteRate 0.000 hostname.Kafka.Broker.DelayedOperationPurgatory.PurgatorySize.Fetch.Value 627.000 hostname.Kafka.Broker.ControllerStats.UncleanLeaderElectionsPerSec.OneMinuteRate 0.000 hostname.Kafka.Broker.RequestMetrics.RequestsPerSec.Produce.OneMinuteRate 0.000 |
| Check | Description | Usage | Output |
|---|---|---|---|
| kafka.metrics.zookeeper | Collects Zookeeper Metrics from the host. | commonchecks metric-kafka-zookeeper [flags]Where the flag
is: Usage example: |
hostname.Kafka.Zookeeper.outstanding_requests 2.000 1648183249 hostname.Kafka.Zookeeper.avg_latency 1.05 1648183249 hostname.Kafka.Zookeeper.num_alive_connections 1.000 1648183249 hostname.Kafka.Zookeeper.open_file_descriptor_count 124.000 1648183249 |