Try   HackMD

MSK 모니터링 지표정리

  • 브로커 리소스 모니터링에 필요한 유의미한 메트릭조사 정리, 프로비저닝용

요구사항

  • broker의 시스템 리소스 사용량 및 여유자원
  • producer에서의 메시지 딜레이
  • consumer에서의 메시지 딜레이
  • producer와 consumer의 생존여부 확인
  • broker와 producer, consumer 간의 주고받은 메시지 수

참고자료 AWS MSK monitoring

메트릭 자료

Default 추출

Name When it visible Dimensions Desc
ZooKeeperRequestLatencyMsMean After the cluster gets to the ACTIVE state. Cluster Name, Broker ID Mean latency in milliseconds for ZooKeeper requests from broker.
ZooKeeperSessionState After the cluster gets to the ACTIVE state. Cluster Name, Broker ID Connection status of broker's ZooKeeper session which may be one of the following: NOT_CONNECTED: '0.0', ASSOCIATING: '0.1', CONNECTING: '0.5', CONNECTEDREADONLY: '0.8', CONNECTED: '1.0', CLOSED: '5.0', AUTH_FAILED: '10.
CpuIdle After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The percentage of CPU idle time.
KafkaAppLogsDiskUsed After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The percentage of disk space used for application logs.
KafkaDataLogsDiskUsed After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The percentage of disk space used for data logs.
MemoryUsed After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The size in bytes of memory that is in use for the broker.
OfflinePartitionsCount After the cluster gets to the ACTIVE state. Cluster Name Total number of partitions that are offline in the clus
RootDiskUsed After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The percentage of the root disk used by the broker.
NetworkRxPackets After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of packets received by the broker.
NetworkRxPackets After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of packets received by the broker.

의견

  • ZooKeeper 관련은 최우선 적용
  • memory는 50% 1단계, 70%단계 Alert
  • CPU는 idle process 차지비중이 70% 이하로 떨어질때 Wanring, 40% 이하로 떨어질때 Alert
  • DiskUsed 계통은 70%에서 알람 후 증설
  • RequestLatency는 0.3초는 Alert, 10초 이상은 서비스다운으로 판단해서 관리
  • OfflinePartitionsCount 를 통해서 service 다운 이전에 파티션 단위의 비정상 감지가능
  • NetworkRx/Tx를 스펙 테스트 트래픽 기준 70% 알람
  • 파티션카운트 항목도 있지만 자동으로 스케일링아웃 되지않는 항목이라 제외

PerBroker 계통 메트릭에서 추출

Name When it visible Dimensions Desc
FetchConsumerTotalTimeMsMean After there's a producer/consumer. Cluster Name, Broker ID The mean total time in milliseconds that consumers spend on fetching data from the broker.
FetchFollowerTotalTimeMsMean After there's a producer/consumer. Cluster Name, Broker ID The mean total time in milliseconds that followers spend on fetching data from the broker.
MessagesInPerSec After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of incoming messages per second for the broker.
FetchFollowerTotalTimeMsMean After there's a producer/consumer. Cluster Name, Broker ID The mean total time in milliseconds that followers spend on fetching data from the broker.
ProduceTotalTimeMsMean After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The mean produce time in milliseconds.
requestTime After request throttling is applied. Cluster Name, Broker ID The average time spent in broker network and I/O threads to process requests.
UnderMinIsrPartitionCount After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of under minIsr partitions for the broker.
UnderReplicatedPartitions After the cluster gets to the ACTIVE state. Cluster Name, Broker ID The number of under-replicated partitions for the broker.

Topic 계통 메트릭에서 추출

Name When it visible Dimensions Desc
MessagesInPerSec After you create a topic. Cluster Name, Broker ID, Topic The number of messages received per second.

의견

  • requestTime Idle에 가까운 처리시간에 맞춰서 설정 (예를 들어 0.3초)
  • 이후 컨슈밍, 프로듀스, 팔로우 타임은 그것보다 작게 (예를 들어 0.2초) 지정하면 특정 페이즈에서 과도하게 시간을 소모하는걸 측정가능
  • MessagesInPerSec 항목은 스펙 테스트 후 측정된 가용 count에서 약간 작게 잡아 부하대비 (메시지 처리 퍼포먼스가 떨어지면 메트릭이 의미없게 될 수 있음)
  • UnderMinIsrPartitionCount, UnderReplicatedPartitions 항목은 디폴트의 OfflinePartitionsCount으로 파티션 오류를 1차적으로 잡아낼수도 있어 불필요 할수도 있으나 서비스에 치명적 상태로 따로 판단하는데 활용가능 (2순위)