# MSK 모니터링 지표정리 - 브로커 리소스 모니터링에 필요한 유의미한 메트릭조사 정리, 프로비저닝용 ## 요구사항 - broker의 시스템 리소스 사용량 및 여유자원 - producer에서의 메시지 딜레이 - consumer에서의 메시지 딜레이 - producer와 consumer의 생존여부 확인 - broker와 producer, consumer 간의 주고받은 메시지 수 [참고자료 AWS MSK monitoring](https://docs.aws.amazon.com/ko_kr/msk/latest/developerguide/monitoring.html#metrics-details) ## 메트릭 자료 ### Default 추출 | Name |When it visible | Dimensions | Desc | |-------- | -------- | -------- | -------- | | ZooKeeperRequestLatencyMsMean |After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | Mean latency in milliseconds for ZooKeeper requests from broker.| | ZooKeeperSessionState |After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | Connection status of broker's ZooKeeper session which may be one of the following: NOT_CONNECTED: '0.0', ASSOCIATING: '0.1', CONNECTING: '0.5', CONNECTEDREADONLY: '0.8', CONNECTED: '1.0', CLOSED: '5.0', AUTH_FAILED: '10. | |CpuIdle |After the cluster gets to the ACTIVE state. |Cluster Name, Broker ID| The percentage of CPU idle time.| | | KafkaAppLogsDiskUsed | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of disk space used for application logs.| | KafkaDataLogsDiskUsed |After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of disk space used for data logs.| | MemoryUsed |After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of memory that is in use for the broker.| | OfflinePartitionsCount | After the cluster gets to the ACTIVE state. | Cluster Name | Total number of partitions that are offline in the clus| |RootDiskUsed| After the cluster gets to the ACTIVE state.| Cluster Name, Broker ID| The percentage of the root disk used by the broker.| |NetworkRxPackets| After the cluster gets to the ACTIVE state.| Cluster Name, Broker ID |The number of packets received by the broker.| |NetworkRxPackets| After the cluster gets to the ACTIVE state.| Cluster Name, Broker ID |The number of packets received by the broker.| ### 의견 - ZooKeeper 관련은 최우선 적용 - memory는 50% 1단계, 70%단계 Alert - CPU는 idle process 차지비중이 70% 이하로 떨어질때 Wanring, 40% 이하로 떨어질때 Alert - DiskUsed 계통은 70%에서 알람 후 증설 - RequestLatency는 0.3초는 Alert, 10초 이상은 서비스다운으로 판단해서 관리 - OfflinePartitionsCount 를 통해서 service 다운 이전에 파티션 단위의 비정상 감지가능 - NetworkRx/Tx를 스펙 테스트 트래픽 기준 70% 알람 - 파티션카운트 항목도 있지만 자동으로 스케일링아웃 되지않는 항목이라 제외 ### PerBroker 계통 메트릭에서 추출 | Name | When it visible | Dimensions | Desc | | -------- | -------- | -------- | -------- | |FetchConsumerTotalTimeMsMean |After there's a producer/consumer. |Cluster Name, Broker ID |The mean total time in milliseconds that consumers spend on fetching data from the broker.| |FetchFollowerTotalTimeMsMean |After there's a producer/consumer. |Cluster Name, Broker ID |The mean total time in milliseconds that followers spend on fetching data from the broker.| |MessagesInPerSec |After the cluster gets to the ACTIVE state. |Cluster Name, Broker ID |The number of incoming messages per second for the broker.| |FetchFollowerTotalTimeMsMean |After there's a producer/consumer. |Cluster Name, Broker ID |The mean total time in milliseconds that followers spend on fetching data from the broker.| |ProduceTotalTimeMsMean |After the cluster gets to the ACTIVE state. |Cluster Name, Broker ID |The mean produce time in milliseconds.| |requestTime|After request throttling is applied.|Cluster Name, Broker ID |The average time spent in broker network and I/O threads to process requests.| |UnderMinIsrPartitionCount |After the cluster gets to the ACTIVE state.| Cluster Name, Broker ID |The number of under minIsr partitions for the broker.| |UnderReplicatedPartitions |After the cluster gets to the ACTIVE state.| Cluster Name, Broker ID |The number of under-replicated partitions for the broker.| ### Topic 계통 메트릭에서 추출 | Name | When it visible | Dimensions | Desc | | -------- | -------- | -------- | -------- | |MessagesInPerSec |After you create a topic.| Cluster Name, Broker ID, Topic |The number of messages received per second.| ### 의견 - requestTime Idle에 가까운 처리시간에 맞춰서 설정 (예를 들어 0.3초) - 이후 컨슈밍, 프로듀스, 팔로우 타임은 그것보다 작게 (예를 들어 0.2초) 지정하면 특정 페이즈에서 과도하게 시간을 소모하는걸 측정가능 - MessagesInPerSec 항목은 스펙 테스트 후 측정된 가용 count에서 약간 작게 잡아 부하대비 (메시지 처리 퍼포먼스가 떨어지면 메트릭이 의미없게 될 수 있음) - UnderMinIsrPartitionCount, UnderReplicatedPartitions 항목은 디폴트의 OfflinePartitionsCount으로 파티션 오류를 1차적으로 잡아낼수도 있어 불필요 할수도 있으나 서비스에 치명적 상태로 따로 판단하는데 활용가능 (2순위)