# MSK 모니터링 지표정리
- 브로커 리소스 모니터링에 필요한 유의미한 메트릭조사 정리, 프로비저닝용
## 요구사항
- broker의 시스템 리소스 사용량 및 여유자원
- producer에서의 메시지 딜레이
- consumer에서의 메시지 딜레이
- producer와 consumer의 생존여부 확인
- broker와 producer, consumer 간의 주고받은 메시지 수
[참고자료 AWS MSK monitoring](https://docs.aws.amazon.com/ko_kr/msk/latest/developerguide/monitoring.html#metrics-details)
## 메트릭 자료
### Default 추출
| Name |When it visible | Dimensions | Desc |
|-------- | -------- | -------- | -------- |
| ZooKeeperRequestLatencyMsMean |After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | Mean latency in milliseconds for ZooKeeper requests from broker.|
| ZooKeeperSessionState |After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | Connection status of broker's ZooKeeper session which may be one of the following: NOT_CONNECTED: '0.0', ASSOCIATING: '0.1', CONNECTING: '0.5', CONNECTEDREADONLY: '0.8', CONNECTED: '1.0', CLOSED: '5.0', AUTH_FAILED: '10. |
|CpuIdle |After the cluster gets to the ACTIVE state. |Cluster Name, Broker ID| The percentage of CPU idle time.| |
| KafkaAppLogsDiskUsed | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of disk space used for application logs.|
| KafkaDataLogsDiskUsed |After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of disk space used for data logs.|
| MemoryUsed |After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of memory that is in use for the broker.|
| OfflinePartitionsCount | After the cluster gets to the ACTIVE state. | Cluster Name | Total number of partitions that are offline in the clus|
|RootDiskUsed| After the cluster gets to the ACTIVE state.| Cluster Name, Broker ID| The percentage of the root disk used by the broker.|
|NetworkRxPackets| After the cluster gets to the ACTIVE state.| Cluster Name, Broker ID |The number of packets received by the broker.|
|NetworkRxPackets| After the cluster gets to the ACTIVE state.| Cluster Name, Broker ID |The number of packets received by the broker.|
### 의견
- ZooKeeper 관련은 최우선 적용
- memory는 50% 1단계, 70%단계 Alert
- CPU는 idle process 차지비중이 70% 이하로 떨어질때 Wanring, 40% 이하로 떨어질때 Alert
- DiskUsed 계통은 70%에서 알람 후 증설
- RequestLatency는 0.3초는 Alert, 10초 이상은 서비스다운으로 판단해서 관리
- OfflinePartitionsCount 를 통해서 service 다운 이전에 파티션 단위의 비정상 감지가능
- NetworkRx/Tx를 스펙 테스트 트래픽 기준 70% 알람
- 파티션카운트 항목도 있지만 자동으로 스케일링아웃 되지않는 항목이라 제외
### PerBroker 계통 메트릭에서 추출
| Name | When it visible | Dimensions | Desc |
| -------- | -------- | -------- | -------- |
|FetchConsumerTotalTimeMsMean |After there's a producer/consumer. |Cluster Name, Broker ID |The mean total time in milliseconds that consumers spend on fetching data from the broker.|
|FetchFollowerTotalTimeMsMean |After there's a producer/consumer. |Cluster Name, Broker ID |The mean total time in milliseconds that followers spend on fetching data from the broker.|
|MessagesInPerSec |After the cluster gets to the ACTIVE state. |Cluster Name, Broker ID |The number of incoming messages per second for the broker.|
|FetchFollowerTotalTimeMsMean |After there's a producer/consumer. |Cluster Name, Broker ID |The mean total time in milliseconds that followers spend on fetching data from the broker.|
|ProduceTotalTimeMsMean |After the cluster gets to the ACTIVE state. |Cluster Name, Broker ID |The mean produce time in milliseconds.|
|requestTime|After request throttling is applied.|Cluster Name, Broker ID |The average time spent in broker network and I/O threads to process requests.|
|UnderMinIsrPartitionCount |After the cluster gets to the ACTIVE state.| Cluster Name, Broker ID |The number of under minIsr partitions for the broker.|
|UnderReplicatedPartitions |After the cluster gets to the ACTIVE state.| Cluster Name, Broker ID |The number of under-replicated partitions for the broker.|
### Topic 계통 메트릭에서 추출
| Name | When it visible | Dimensions | Desc |
| -------- | -------- | -------- | -------- |
|MessagesInPerSec |After you create a topic.| Cluster Name, Broker ID, Topic |The number of messages received per second.|
### 의견
- requestTime Idle에 가까운 처리시간에 맞춰서 설정 (예를 들어 0.3초)
- 이후 컨슈밍, 프로듀스, 팔로우 타임은 그것보다 작게 (예를 들어 0.2초) 지정하면 특정 페이즈에서 과도하게 시간을 소모하는걸 측정가능
- MessagesInPerSec 항목은 스펙 테스트 후 측정된 가용 count에서 약간 작게 잡아 부하대비 (메시지 처리 퍼포먼스가 떨어지면 메트릭이 의미없게 될 수 있음)
- UnderMinIsrPartitionCount, UnderReplicatedPartitions 항목은 디폴트의 OfflinePartitionsCount으로 파티션 오류를 1차적으로 잡아낼수도 있어 불필요 할수도 있으나 서비스에 치명적 상태로 따로 판단하는데 활용가능 (2순위)