AWS MSK¶
Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that enables building and running applications that process streaming data using Apache Kafka.
Use the script market "Guance cloud synchronization" series of script packages to synchronize cloud monitoring and cloud asset data to Guance
Configuration¶
Install Func¶
It is recommended to activate Guance integration - extension - hosted Func: all prerequisites are automatically installed, please continue with the script installation.
If you deploy Func yourself, refer to Self-deployed Func
Install Script¶
Note: Please prepare an Amazon AK that meets the requirements in advance (for simplicity, you can directly grant global read-only permission
ReadOnlyAccess
)
To synchronize MSK monitoring data, we install the corresponding collection script: "Guance Integration (AWS-Managed Streaming for Kafka Collection)" (ID: guance_aws_kafka
)
After clicking 【Install】, enter the corresponding parameters: Amazon AK, Amazon account name.
Click 【Deploy Startup Script】, the system will automatically create a Startup
script set and configure the corresponding startup script automatically.
In addition, you can see the corresponding automatic trigger configuration in "Management / Automatic Trigger Configuration". Click 【Execute】 to immediately execute once without waiting for the scheduled time. After a short while, you can view the execution task records and corresponding logs.
We default collect some configurations, for details see the metrics column Custom Cloud Object Metrics Configuration
Verification¶
- In "Management / Automatic Trigger Configuration", confirm whether the corresponding task has the corresponding automatic trigger configuration, and at the same time, you can check the corresponding task records and logs for any abnormalities.
- In Guance, under "Infrastructure / Custom", check if there is any asset information.
- In Guance, under "Metrics", check if there is any corresponding monitoring data.
Metrics¶
After configuring Amazon - CloudWatch, the default metric sets are as follows, more metrics can be collected through configuration Amazon CloudWatch Metrics Details
DEFAULT
Level Monitoring¶
The metrics described in the table below are available at the DEFAULT
monitoring level. These metrics are free.
Metrics available at the DEFAULT monitoring level |
|||
---|---|---|---|
Name | When visible | Dimensions | Description |
ActiveControllerCount |
After the cluster enters the ACTIVE state. | Cluster name | At any given time, only one controller per cluster can be active. |
BurstBalance |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | The remaining balance of input/output burst credits for EBS volumes in the cluster. Use it to investigate latency or reduced throughput. BurstBalance is not reported for EBS volumes when the baseline performance exceeds maximum burst performance. For more information, see I/O Credits and Burst Performance |
BytesInPerSec |
After topic creation. | Cluster name, broker ID, topic | Number of bytes received from clients per second. This metric applies to each broker and also to each topic. |
BytesOutPerSec |
After topic creation. | Cluster name, broker ID, topic | Number of bytes sent to clients per second. This metric applies to each broker and also to each topic. |
ClientConnectionCount |
After the cluster enters the ACTIVE state. | Cluster name, broker ID, client authentication | Number of authenticated active client connections. |
ConnectionCount |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of authenticated active connections, unauthenticated connections, and broker-to-broker connections. |
CPUCreditBalance |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | This metric helps you monitor the CPU credit balance for brokers. If your CPU usage consistently exceeds 20% of the baseline utilization, you may deplete the CPU credit balance, which could negatively impact cluster performance. You can take steps to reduce the CPU load. For example, you can reduce the number of client requests or update the broker type to M5 broker type. |
CpuIdle |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Percentage of CPU idle time. |
CpuIoWait |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Percentage of CPU idle time during pending disk operations. |
CpuSystem |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Percentage of CPU in kernel space. |
CpuUser |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Percentage of CPU in user space. |
GlobalPartitionCount |
After the cluster enters the ACTIVE state. | Cluster name | Number of partitions across all topics in the cluster (excluding replicas). Since GlobalPartitionCount does not include replicas, the sum of PartitionCount values may exceed GlobalPartitionCount when the replication factor for a topic is greater than 1. |
GlobalTopicCount |
After the cluster enters the ACTIVE state. | Cluster name | Total number of topics across all brokers in the cluster. |
EstimatedMaxTimeLag |
After consumer groups consume topics. | Consumer group, topic | Estimated time (in seconds) to exhaust MaxOffsetLag . |
KafkaAppLogsDiskUsed |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Percentage of disk space used for application logs. |
KafkaDataLogsDiskUsed (Cluster Name, Broker ID size) |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Percentage of disk space used for data logs. |
KafkaDataLogsDiskUsed (Cluster Name size) |
After the cluster enters the ACTIVE state. | Cluster name | Percentage of disk space used for data logs. |
LeaderCount |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of partition leaders per broker excluding replicas. |
MaxOffsetLag |
After consumer groups consume topics. | Consumer group, topic | Maximum offset lag across all partitions in a topic. |
MemoryBuffered |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Size of buffered memory for the broker in bytes. |
MemoryCached |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Size of cached memory for the broker in bytes. |
MemoryFree |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Size of available memory for the broker in bytes. |
HeapMemoryAfterGC |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Percentage of heap memory used after garbage collection relative to total heap memory. |
MemoryUsed |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Size of memory used by the broker in bytes. |
MessagesInPerSec |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of incoming messages per second for the broker. |
NetworkRxDropped |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of dropped receive packets. |
NetworkRxErrors |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of network receive errors for the broker. |
NetworkRxPackets |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of packets received by the broker. |
NetworkTxDropped |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of dropped transmit packets. |
NetworkTxErrors |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of network transmit errors for the broker. |
NetworkTxPackets |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of packets transmitted by the broker. |
OfflinePartitionsCount |
After the cluster enters the ACTIVE state. | Cluster name | Total number of partitions in the cluster that are offline. |
PartitionCount |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Total number of topic partitions per broker including replicas. |
ProduceTotalTimeMsMean |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Average production time in milliseconds. |
RequestBytesMean |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Average number of request bytes for the broker. |
RequestTime |
After request throttling is applied. | Cluster name, broker ID | Average time spent by the broker's network and I/O threads handling requests in milliseconds. |
RootDiskUsed |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Percentage of root disk used by the broker. |
SumOffsetLag |
After consumer groups consume topics. | Consumer group, topic | Aggregated offset lag across all partitions in a topic. |
SwapFree |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Size of swap memory available to the broker in bytes. |
SwapUsed |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Size of swap memory used by the broker in bytes. |
TrafficShaping |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Advanced metric indicating the number of packets formed (dropped or queued) due to exceeding network allocation. PER_BROKER metrics provide more detailed information. |
UnderMinIsrPartitionCount |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of partitions not fully managed by the broker. |
UnderReplicatedPartitions |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Number of partitions not fully replicated by the broker. |
ZooKeeperRequestLatencyMsMean |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Average latency in milliseconds for Apache ZooKeeper requests from the broker. |
ZooKeeperSessionState |
After the cluster enters the ACTIVE state. | Cluster name, broker ID | Connection state of the broker's ZooKeeper session, which may be one of the following states: NOT_CONNECTED: '0.0', associated: '0.1', connecting: '0.5', CONNECTEDREADONLY: '0.8', connected: '1.0', closed: '5.0', AUTH_FAILED: '10.0'. |
PER_BROKER
Level Monitoring¶
When setting the monitoring level to PER_BROKER
, in addition to all DEFAULT
level metrics, you will also obtain the metrics described in the table below. You need to pay for the metrics in this table, while DEFAULT
level metrics remain free. The metrics in this table have the following dimensions: Cluster name, broker ID.
Additional metrics provided starting at the PER_BROKER monitoring level |
||
---|---|---|
Name | When visible | Description |
BwInAllowanceExceeded |
After the cluster enters the ACTIVE state. | Number of packets formed due to inbound aggregated bandwidth exceeding the broker's maximum bandwidth. |
BwOutAllowanceExceeded |
After the cluster enters the ACTIVE state. | Number of packets formed due to outbound aggregated bandwidth exceeding the broker's maximum bandwidth. |
ConnTrackAllowanceExceeded |
After the cluster enters the ACTIVE state. | Number of packets formed due to connection tracking exceeding the broker's maximum value. Connection tracking is related to security groups, which track each established connection to ensure that return packets are delivered as expected. |
ConnectionCloseRate |
After the cluster enters the ACTIVE state. | Number of connections closed per second for each listener. This number is aggregated per listener and then filtered for client listeners. |
ConnectionCreationRate |
After the cluster enters the ACTIVE state. | Number of new connections established per second for each listener. This number is aggregated per listener and then filtered for client listeners. |
CpuCreditUsage |
After the cluster enters the ACTIVE state. | This metric helps you monitor CPU credit usage on the instance. If your CPU usage consistently exceeds 20% of the baseline level, you may deplete the CPU credit balance, which could negatively impact cluster performance. You can monitor this metric and set alerts to take corrective actions. |
FetchConsumerLocalTimeMsMean |
After producer/consumer is provided. | Average time spent processing consumer requests at the leader in milliseconds. |
FetchConsumerRequestQueueTimeMsMean |
After producer/consumer is provided. | Average time consumer requests spend waiting in the request queue in milliseconds. |
FetchConsumerResponseQueueTimeMsMean |
After producer/consumer is provided. | Average time consumer requests spend waiting in the response queue in milliseconds. |
FetchConsumerResponseSendTimeMsMean |
After producer/consumer is provided. | Average time spent sending consumer responses in milliseconds. |
FetchConsumerTotalTimeMsMean |
After producer/consumer is provided. | Total average time consumers spend extracting data from the broker in milliseconds. |
FetchFollowerLocalTimeMsMean |
After producer/consumer is provided. | Average time spent processing follower requests at the leader in milliseconds. |
FetchFollowerRequestQueueTimeMsMean |
After producer/consumer is provided. | Average time follower requests spend waiting in the request queue in milliseconds. |
FetchFollowerResponseQueueTimeMsMean |
After producer/consumer is provided. | Average time follower requests spend waiting in the response queue in milliseconds. |
FetchFollowerResponseSendTimeMsMean |
After producer/consumer is provided. | Average time spent sending follower responses in milliseconds. |
FetchFollowerTotalTimeMsMean |
After producer/consumer is provided. | Total average time followers spend extracting data from the broker in milliseconds. |
FetchMessageConversionsPerSec |
After topic creation. | Number of times the broker converts fetch messages per second. |
FetchThrottleByteRate |
After bandwidth throttling is applied. | Number of throttled bytes per second. |
FetchThrottleQueueSize |
After bandwidth throttling is applied. | Number of messages in the throttling queue. |
FetchThrottleTime |
After bandwidth throttling is applied. | Average fetch throttling time in milliseconds. |
NetworkProcessorAvgIdlePercent |
After the cluster enters the ACTIVE state. | Average percentage of time the network processor is idle. |
PpsAllowanceExceeded |
After the cluster enters the ACTIVE state. | Number of packets formed due to bidirectional PPS exceeding the broker's maximum value. |
ProduceLocalTimeMsMean |
After the cluster enters the ACTIVE state. | Average time leaders spend processing requests in milliseconds. |
ProduceMessageConversionsPerSec |
After topic creation. | Number of message conversions produced per second by the broker. |
ProduceMessageConversionsTimeMsMean |
After the cluster enters the ACTIVE state. | Average time spent converting message formats in milliseconds. |
ProduceRequestQueueTimeMsMean |
After the cluster enters the ACTIVE state. | Average time request messages spend in the queue in milliseconds. |
ProduceResponseQueueTimeMsMean |
After the cluster enters the ACTIVE state. | Average time response messages spend in the queue in milliseconds. |
ProduceResponseSendTimeMsMean |
After the cluster enters the ACTIVE state. | Average time spent sending response messages in milliseconds. |
ProduceThrottleByteRate |
After bandwidth throttling is applied. | Number of throttled bytes per second. |
ProduceThrottleQueueSize |
After bandwidth throttling is applied. | Number of messages in the throttling queue. |
ProduceThrottleTime |
After bandwidth throttling is applied. | Average produce throttling time in milliseconds. |
ProduceTotalTimeMsMean |
After the cluster enters the ACTIVE state. | Average production time in milliseconds. |
RemoteBytesInPerSec |
After producers/consumers exist. | Total number of bytes transmitted from tiered storage in response to consumer fetches. This metric includes all topic partitions affecting downstream data transfer traffic. Category: Traffic and Error Rate. This is a KIP-405 metric. |
RemoteBytesOutPerSec | After producers/consumers exist. | Total number of bytes transmitted to tiered storage, including data from log segments, indexes, and other auxiliary files. This metric includes all topic partitions affecting upstream data transfer traffic. Category: Traffic and Error Rate. This is a KIP-405 metric. |
RemoteLogManagerTasksAvgIdlePercent | After the cluster enters the ACTIVE state. | Average percentage of time remote log managers are idle. Remote log managers transfer data from brokers to tiered storage. Category: Internal Activity. This is a KIP-405 metric. |
RemoteLogReaderAvgIdlePercent | After the cluster enters the ACTIVE state. | Average percentage of time remote log readers are idle. Remote log readers transfer data from remote storage to brokers in response to consumer fetches. Category: Internal Activity. This is a KIP-405 metric. |
RemoteLogReaderTaskQueueSize | After the cluster enters the ACTIVE state. | Number of tasks responsible for reading from tiered storage and waiting to be scheduled. Category: Internal Activity. This is a KIP-405 metric. |
RemoteReadErrorPerSec | After the cluster enters the ACTIVE state. | Total error rate responding to read requests sent by specified brokers to tiered storage to retrieve data in response to consumer fetches. This metric includes all topic partitions affecting downstream data transfer traffic. Category: Traffic and Error Rate. This is a KIP-405 metric. |
RemoteReadRequestsPerSec | After the cluster enters the ACTIVE state. | Total number of read requests sent by specified brokers to tiered storage to retrieve data in response to consumer fetches. This metric includes all topic partitions affecting downstream data transfer traffic. Category: Traffic and Error Rate. This is a KIP-405 metric. |
RemoteWriteErrorPerSec | After the cluster enters the ACTIVE state. | Total error rate responding to write requests sent by specified brokers to tiered storage to transfer data upstream. This metric includes all topic partitions affecting upstream data transfer traffic. Category: Traffic and Error Rate. This is a KIP-405 metric. |
ReplicationBytesInPerSec |
After topic creation. | Number of bytes received per second from other brokers. |
ReplicationBytesOutPerSec |
After topic creation. | Number of bytes sent per second to other brokers. |
RequestExemptFromThrottleTime |
After request throttling is applied. | Average time spent by the broker's network and I/O threads handling requests exempt from throttling in milliseconds. |
RequestHandlerAvgIdlePercent |
After the cluster enters the ACTIVE state. | Average percentage of time request handler threads are idle. |
RequestThrottleQueueSize |
After request throttling is applied. | Number of messages in the throttling queue. |
RequestThrottleTime |
After request throttling is applied. | Average request throttling time in milliseconds. |
TcpConnections |
After the cluster enters the ACTIVE state. | Displays the number of incoming and outgoing TCP segments with the SYN flag set. |
TotalTierBytesLag | After topic creation. | Total number of bytes of data eligible for tiering on the broker but not yet transferred to tiered storage. These metrics show the efficiency of upstream data transfer. As latency increases, the amount of data not present in tiered storage also increases. Category: Archive Lag. This is not a KIP-405 metric. |
TrafficBytes |
After the cluster enters the ACTIVE state. | Shows network traffic between clients (producers and consumers) and brokers in total bytes. Traffic between brokers is not reported. |
VolumeQueueLength |
After the cluster enters the ACTIVE state. | Number of read and write operation requests waiting to complete within the specified time period. |
VolumeReadBytes |
After the cluster enters the ACTIVE state. | Number of bytes read within the specified time period. |
VolumeReadOps |
After the cluster enters the ACTIVE state. | Number of read operations within the specified time period. |
VolumeTotalReadTime |
After the cluster enters the ACTIVE state. | Total number of seconds spent completing all read operations within the specified time period. |
VolumeTotalWriteTime |
After the cluster enters the ACTIVE state. | Total number of seconds spent completing all write operations within the specified time period. |
VolumeWriteBytes |
After the cluster enters the ACTIVE state. | Number of bytes written within the specified time period. |
VolumeWriteOps |
After the cluster enters the ACTIVE state. | Number of write operations within the specified time period. |
PER_TOPIC_PER_BROKER
Level Monitoring¶
When setting the monitoring level to PER_TOPIC_PER_BROKER
, in addition to all metrics from PER_BROKER
and DEFAULT levels, you will also obtain the metrics described in the table below. Only DEFAULT
level metrics are free. Metrics in this table have the following dimensions: Cluster name, broker ID, topic.
Important: For Amazon MSK clusters using Apache Kafka 2.4.1 or newer versions, the metrics in the table below only appear after their values first become non-zero. For example, to view BytesInPerSec
, one or more producers must first send data to the cluster.
Additional metrics provided starting at the PER_TOPIC_PER_BROKER monitoring level |
||
---|---|---|
Name | When visible | Description |
FetchMessageConversionsPerSec |
After topic creation. | Number of converted fetched messages per second. |
MessagesInPerSec |
After topic creation. | Number of received messages per second. |
ProduceMessageConversionsPerSec |
After topic creation. | Number of generated message conversions per second. |
RemoteBytesInPerSec | After you create a topic and that topic is producing/consuming. | Number of bytes transmitted from tiered storage in response to fetches by consumers for the specified topic and broker. This metric includes all partitions in the topic affecting the specified broker's upstream and downstream data transfer traffic. Category: Traffic and Error Rate. This is a KIP-405 metric. |
RemoteBytesOutPerSec | After you create a topic and that topic is producing/consuming. | Number of bytes transmitted to tiered storage for the specified topic and broker. This metric includes all partitions in the topic affecting the specified broker's upstream data transfer traffic. Category: Traffic and Error Rate. This is a KIP-405 metric. |
RemoteReadErrorPerSec | After you create a topic and that topic is producing/consuming. | Error rate responding to read requests sent by the specified broker to tiered storage to retrieve data in response to fetches by consumers for the specified topic. This metric includes all partitions in the topic affecting the specified broker's upstream and downstream data transfer traffic. Category: Traffic and Error Rate. This is a KIP-405 metric. |
RemoteReadRequestsPerSec | After you create a topic and that topic is producing/consuming. | Number of read requests sent by the specified broker to tiered storage to retrieve data in response to fetches by consumers for the specified topic. This metric includes all partitions in the topic affecting the specified broker's upstream and downstream data transfer traffic. Category: Traffic and Error Rate. This is a KIP-405 metric. |
RemoteWriteErrorPerSec | After you create a topic and that topic is producing/consuming. | Error rate responding to write requests sent by the specified broker to tiered storage to transfer data upstream. This metric includes all partitions in the topic affecting the specified broker's upstream data transfer traffic. Category: Traffic and Error Rate. This is a KIP-405 metric. |
PER_TOPIC_PER_PARTITION
Level Monitoring¶
When setting the monitoring level to PER_TOPIC_PER_PARTITION
, in addition to all metrics from PER_TOPIC_PER_PARTITION
, PER_TOPIC_PER_BROKER
, and DEFAULT levels, you will also obtain the metrics described in the table below. Only DEFAULT
level metrics are free. Metrics in this table have the following dimensions: Consumer group, topic, partition.
Additional metrics provided starting at the PER_TOPIC_PER_PARTITION monitoring level |
||
---|---|---|
Name | When visible | Description |
EstimatedTimeLag |
After consumer groups consume topics. | Estimated time (in seconds) to exhaust partition offset lag. |
OffsetLag |
After consumer groups consume topics. | Partition-level consumer lag in terms of offsets. |
Objects¶
AWS MSK object data is temporarily unavailable.