# On-Call Runbook
When you are on call, you are responsible for answering and resolving any alerts that are triggered in our services. These alerts will be sent to one of two slack channels:
* `#stitch-smartservices-nonprod-alerts`
* `#stitch-smartservices`
The first channel is for nonprod alerts. This means they are not urgent, you can look at them when you are working, no need to address during off-hours.
The second channel is for urgent alerts. These alerts go off when something goes wrong in our prod environment, which means you will get a call and possibly be woken up.
### Anatomy of an alert:
`<Alert name>-<environment>-<centralized or regional>`
ex: ` SmartGate5XXResponse - aws-at - centralized`
## On-Call resolution process
When you get an alert, first click on the opsgenie link so you can view the relevant information. The most useful fields are:
* environment
* pod name
With the environment information, go to our Grafana dashboard which has relevant metrics for all our services in each environment. The master document with metrics dashboards is [here](https://github.com/Talend/sre-documentation/tree/a9f33936dcbe0a652926559f7f69c373cff1968c/Infrastructure/Observability/Monitoring).
For your convenience, here are the dashboards in every environment:
* prod:
* smart gate: https://grafana.admin.central.cloud.talend.com/d/8rx4fXWVk/smart-gate?orgId=1
* msk: https://grafana.admin.central.cloud.talend.com/d/CwiSfXWVk/smart-msk?orgId=1
* centralized-at:
* smart-gate: https://grafana.admin.eap-central.cloud.talend.com/d/hQ1Z-uZVz/smart-gate?orgId=1
* msk: https://grafana.admin.eap-central.cloud.talend.com/d/jJoW-XZ4z/smart-msk?orgId=1
* centralized-dev:
* smart-gate: https://grafana.admin.dev-central.cloud.talend.com/d/naKsKpn4k/smart-gate?orgId=1
* msk: https://grafana.admin.dev-central.cloud.talend.com/d/tQSsFpnVz/smart-msk?orgId=1
On these dashboards, note how widespread the error is. If it is a 5XX alert, then are there a lot of them over the past hour? If memory is spiking, then has it been increasing for a long time? When did it start?
If it is just one error, then you can probably acknowledge the alert and go back to sleep. If there are a lot, then you should keep investigating.
The next resource you should use is the error logs dashboard in Kibana. We recently stopped using Loki in favor of Kibana. For each of our services, there should be a dashboard for error logs. See the links below:
* centralized-dev
* [smart-gate](https://aws-integration-central.kb.us-east-1.aws.found.io:9243/app/discover#/view/5a89ed70-5baa-11ed-874d-dff8fcfd874b?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-24d,to:now))&_a=(columns:!(message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:log-app,key:message,negate:!f,type:exists,value:exists),query:(exists:(field:message)))),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-gate)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* [smart-flow](https://aws-integration-central.kb.us-east-1.aws.found.io:9243/app/discover#/view/55b66f90-5bae-11ed-874d-dff8fcfd874b?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-24d,to:now))&_a=(columns:!(message),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-flow)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* [smart-consumer](https://aws-integration-central.kb.us-east-1.aws.found.io:9243/app/discover#/view/2a8d5a50-5bad-11ed-874d-dff8fcfd874b?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-24d,to:now))&_a=(columns:!(message),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-consumer)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* [smart-inference-depot](https://aws-integration-central.kb.us-east-1.aws.found.io:9243/app/discover#/view/93660fd0-5bae-11ed-874d-dff8fcfd874b?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-24d,to:now))&_a=(columns:!(message),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-inference-depot)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* centralized-at
* [smart-gate](https://aws-at-central-talend.kb.eu-central-1.aws.cloud.es.io:9243/app/discover#/view/07240940-5baf-11ed-9ca3-41a3ea830240?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-20h,to:now))&_a=(columns:!(),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-gate)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* [smart-flow](https://aws-at-central-talend.kb.eu-central-1.aws.cloud.es.io:9243/app/discover#/view/6adbd670-5baf-11ed-a7a4-43fe6ae5b0fe?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-22d,to:now))&_a=(columns:!(message),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-flow)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* [smart-consumer](https://aws-at-central-talend.kb.eu-central-1.aws.cloud.es.io:9243/app/discover#/view/71c808e0-5bb0-11ed-9ca3-41a3ea830240?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-22d,to:now))&_a=(columns:!(message),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-consumer)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* [smart-inference-depot](https://aws-at-central-talend.kb.eu-central-1.aws.cloud.es.io:9243/app/discover#/view/a6e1c8e0-5bb0-11ed-a7a4-43fe6ae5b0fe?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-22d,to:now))&_a=(columns:!(message),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-inference-depot)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* centralized-prod
* [smart-gate](https://aws-eu-production-central-talend.kb.eu-central-1.aws.cloud.es.io:9243/app/discover#/view/2f013260-5bb1-11ed-9c3a-b33a2a969850?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-21d,to:now))&_a=(columns:!(message),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-gate)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* [smart-flow](https://aws-eu-production-central-talend.kb.eu-central-1.aws.cloud.es.io:9243/app/discover#/view/519d5d80-5bb1-11ed-b4c8-1d9403f8df0a?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-21d,to:now))&_a=(columns:!(message),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-flow)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* [smart-consumer](https://aws-eu-production-central-talend.kb.eu-central-1.aws.cloud.es.io:9243/app/discover#/view/71644890-5bb1-11ed-9c3a-b33a2a969850?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-21d,to:now))&_a=(columns:!(message),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-consumer)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
* [smart-inference-depot](https://aws-eu-production-central-talend.kb.eu-central-1.aws.cloud.es.io:9243/app/discover#/view/97ed1730-5bb1-11ed-b4c8-1d9403f8df0a?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-21d,to:now))&_a=(columns:!(message),filters:!(),grid:(),hideChart:!f,index:log-app,interval:auto,query:(language:kuery,query:'(kubernetes.container.name%20:%20smart-inference-depot)%20and%20(message%20:%20*error*%20or%20message%20:%20*Error*%20or%20message%20:%20*exception*%20or%20message%20:%20*ERROR*%20or%20message%20:%20*Exception*)'),sort:!(!('@timestamp',desc))))
### Frequently seen alerts
#### 1. SmartGate5XXResponse - aws-at - centralized, vaultToken Error.
Sometimes this happens when the vault token is expired. You can usually dismiss these alerts after ensuring that the rest of the pod is healthy (by looking at the logs). A refresh is required, there's nothing else that we can do.
#### 2. MSKCPUUtilization - aws-at - centralized, Service smart-gate MSK is approaching maximum CPU Utlization (<40% remaining) over last 1h.
This means the MSK cluster CPU usage is high. We set the threshold at 40% in compliance with the MSK best practices [doc](https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html).
Please check the Grafana dashboards, which should tell you which topics are using the most bytes. If the topics are prefixed by `connect-component`, and it shows a sudden spike, it is usually because the observability team is running performance tests on the cluster. This happens most frequently in centralized-at. Please post in #observability-kafka-integration slack channel and ask them to decrease the load.
If they are not`connected-component` topics, you can post a question in the #observability-kafka-integration slack channel to ask who owns those topics.
If the CPU raises to be higher than 80%, we should be concerned. Lower than that is typically okay.
If it is not a sudden spike, then it's possible that we need to increase the size of the clusters. We have done this before, and it was pretty successful. We did not lose that many messages in the process. However, it would be best to do this with a friend just in case something goes terribly wrong, so you should wait until others are available.
#### 3. SmartGateTopicsPostError - aws-dev - centralized
These errors happen when someone hits smart-gate with a topic that does not exist. They may have made a typo, or they forgot to add the new topic. Either way, it's good to check the [dashboard](https://grafana.admin.eap-central.cloud.talend.com/d/hQ1Z-uZVz/smart-gate?orgId=1) to see which topic it is and perhaps look through the logs.
If it only happened once or twice, you can probably dismiss it. Someone probably made a mistake. If theres lots of them, try and check with the owner of the topic to see what's wrong.
#### 4. SmartGateHighLatencyBuckets - aws-production - centralized
This happens when smart-gate is experiencing high latency. This may or may not cause stress on the MSK cluster. Since the MSK cluster is behind an elb, it may be able to take a high load better than smart-gate itself.
Make sure to look at the smart-gate dashboard to see which topics are hitting smart-gate at high volume.
;; need to flesh this out a bit more...
# Kafka Ecosystem overview
[Here](https://lucid.app/lucidchart/1aadf955-5b8f-4ad1-aa32-15214dd045f0/edit?viewport_loc=-677%2C-205%2C4273%2C2350%2CYS8Uvcj3vTlE&invitationId=inv_73149a80-a2a4-4f24-9db0-f847a0429350) is a high level overview of our Kafka Ecosystem, including Smart-Gate and Smart-Consumer.
`smart-gate` is essentially a proxy for a kafka producer
`smart-consumer` is essentially a proxy for a kafka consumer.
All the kafka topic/subject scripts are in `smart-consumer`.