MSK 모니터링 지표정리

브로커 리소스 모니터링에 필요한 유의미한 메트릭조사 정리, 프로비저닝용

요구사항

broker의 시스템 리소스 사용량 및 여유자원
producer에서의 메시지 딜레이
consumer에서의 메시지 딜레이
producer와 consumer의 생존여부 확인
broker와 producer, consumer 간의 주고받은 메시지 수

참고자료 AWS MSK monitoring

메트릭 자료

Default 추출

Name	When it visible	Dimensions	Desc
ZooKeeperRequestLatencyMsMean	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	Mean latency in milliseconds for ZooKeeper requests from broker.
ZooKeeperSessionState	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	Connection status of broker's ZooKeeper session which may be one of the following: NOT_CONNECTED: '0.0', ASSOCIATING: '0.1', CONNECTING: '0.5', CONNECTEDREADONLY: '0.8', CONNECTED: '1.0', CLOSED: '5.0', AUTH_FAILED: '10.
CpuIdle	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The percentage of CPU idle time.
KafkaAppLogsDiskUsed	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The percentage of disk space used for application logs.
KafkaDataLogsDiskUsed	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The percentage of disk space used for data logs.
MemoryUsed	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The size in bytes of memory that is in use for the broker.
OfflinePartitionsCount	After the cluster gets to the ACTIVE state.	Cluster Name	Total number of partitions that are offline in the clus
RootDiskUsed	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The percentage of the root disk used by the broker.
NetworkRxPackets	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The number of packets received by the broker.
NetworkRxPackets	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The number of packets received by the broker.

의견

ZooKeeper 관련은 최우선 적용
memory는 50% 1단계, 70%단계 Alert
CPU는 idle process 차지비중이 70% 이하로 떨어질때 Wanring, 40% 이하로 떨어질때 Alert
DiskUsed 계통은 70%에서 알람 후 증설
RequestLatency는 0.3초는 Alert, 10초 이상은 서비스다운으로 판단해서 관리
OfflinePartitionsCount 를 통해서 service 다운 이전에 파티션 단위의 비정상 감지가능
NetworkRx/Tx를 스펙 테스트 트래픽 기준 70% 알람
파티션카운트 항목도 있지만 자동으로 스케일링아웃 되지않는 항목이라 제외

PerBroker 계통 메트릭에서 추출

Name	When it visible	Dimensions	Desc
FetchConsumerTotalTimeMsMean	After there's a producer/consumer.	Cluster Name, Broker ID	The mean total time in milliseconds that consumers spend on fetching data from the broker.
FetchFollowerTotalTimeMsMean	After there's a producer/consumer.	Cluster Name, Broker ID	The mean total time in milliseconds that followers spend on fetching data from the broker.
MessagesInPerSec	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The number of incoming messages per second for the broker.
FetchFollowerTotalTimeMsMean	After there's a producer/consumer.	Cluster Name, Broker ID	The mean total time in milliseconds that followers spend on fetching data from the broker.
ProduceTotalTimeMsMean	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The mean produce time in milliseconds.
requestTime	After request throttling is applied.	Cluster Name, Broker ID	The average time spent in broker network and I/O threads to process requests.
UnderMinIsrPartitionCount	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The number of under minIsr partitions for the broker.
UnderReplicatedPartitions	After the cluster gets to the ACTIVE state.	Cluster Name, Broker ID	The number of under-replicated partitions for the broker.

Topic 계통 메트릭에서 추출

Name	When it visible	Dimensions	Desc
MessagesInPerSec	After you create a topic.	Cluster Name, Broker ID, Topic	The number of messages received per second.

의견

requestTime Idle에 가까운 처리시간에 맞춰서 설정 (예를 들어 0.3초)
이후 컨슈밍, 프로듀스, 팔로우 타임은 그것보다 작게 (예를 들어 0.2초) 지정하면 특정 페이즈에서 과도하게 시간을 소모하는걸 측정가능
MessagesInPerSec 항목은 스펙 테스트 후 측정된 가용 count에서 약간 작게 잡아 부하대비 (메시지 처리 퍼포먼스가 떨어지면 메트릭이 의미없게 될 수 있음)
UnderMinIsrPartitionCount, UnderReplicatedPartitions 항목은 디폴트의 OfflinePartitionsCount으로 파티션 오류를 1차적으로 잡아낼수도 있어 불필요 할수도 있으나 서비스에 치명적 상태로 따로 판단하는데 활용가능 (2순위)