Cluster Walkthrough `2020-08-21`

# Cluster Walkthrough `2020-08-21` ## Prod: - We need to check the webhook receiver if it's working or not. See [DPLS-13639](https://devstack.vwgroup.com/jira/browse/DPLS-13639) for more details. - `10-application-nodes` have been [scaled down](https://stormreply.slack.com/archives/CJCLL34VB/p1597831594021400?thread_ts=1597823410.018200&cid=CJCLL34VB) to 3 (was 6) in this weeks deployment There is [a ticket DPLS-13942](https://devstack.vwgroup.com/jira/browse/DPLS-13942) about (not being able to) remov(ing/e) this ASG. - We need to adjust the grafana dashboards for using them in Kubernetes 1.16 clusters, See [DPLS-13469](https://devstack.vwgroup.com/jira/browse/DPLS-13469) for more details. - We keep on getting the following alert on prod: * `dp-sharedservices-prod: Kubernetes API server client 'kube-controller-manager/11.220.138.116:10252' is experiencing 1.557% errors.'` since [more than a month](https://hackmd.io/0OrYw9FhS9qjuRlabEcRUw) ## Live: - weave-npc writing DEBUG level logs issue: [this](https://github.com/weaveworks/weave/issues/2628#issuecomment-510202188) would be the solution, maybe. try it. - We have a lot of errors on the prometheus-k8s pods, such as: ``` │ level=error ts=2020-08-21T08:47:27.823Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cann │ │ ot list resource \"endpoints\" in API group \"\" at the cluster scope" │ │ level=error ts=2020-08-21T08:47:27.824Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list res │ │ ource \"pods\" in API group \"\" at the cluster scope" │ │ level=error ts=2020-08-21T08:47:28.223Z caller=manager.go:123 component="scrape manager" msg="error creating new scrape pool" err="error creating HTTP client: unable to load specified CA cert /etc/prometheus/secrets/etcd-certs/etcd-ca.crt: open /etc/prometheus/secrets │ │ /etcd-certs/etcd-ca.crt: no such file or directory" scrape_pool=monitoring/prometheus-operator-etcd-master-main/0 │ │ level=error ts=2020-08-21T08:47:28.223Z caller=manager.go:123 component="scrape manager" msg="error creating new scrape pool" err="error creating HTTP client: unable to load specified CA cert /etc/prometheus/secrets/etcd-certs/etcd-ca.crt: open /etc/prometheus/secrets │ │ /etcd-certs/etcd-ca.crt: no such file or directory" scrape_pool=monitoring/prometheus-operator-etcd-master-event/0 │ │ level=error ts=2020-08-21T08:47:28.823Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot │ │ list resource \"services\" in API group \"\" at the cluster scope" │ │ level=error ts=2020-08-21T08:47:28.824Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cann │ │ ot list resource \"endpoints\" in API group \"\" at the cluster scope" ``` * Still, no Limits on `mbbb-86060-aio-nar-val-aggregator-report` Cronjob. There is no running job, all the jobs have finished, but it would be helpful to remind the app developers to set the limits (AFAIR, they already implemented it, but the solution should be deployed on Live) ## Approval: - No remarks for the infrastructural PoV. All looks good. ## Preprod: - We need to adjust the grafana dashboards for using them in Kubernetes 1.16 clusters, See [DPLS-13469](https://devstack.vwgroup.com/jira/browse/DPLS-13469) for more details. * No remarks for the infrastructural PoV. All looks good. ## Tui: - No application node, waiting for the 1.16 upgrade on Wednesday. ## Tui-dev: - During the deployment this week, we've observed that the node distribution is not even among the AZs. What would be the reason for this? ![](https://i.imgur.com/gTpNRxE.png) - *Answer: On all dev environments applications are deployed with `replicas=1`, which is not reflected in their affinity configs.* - `kube-state-metrics` limits should be increased: CPU: 100 -> 400, MEM: 350 -> 1Gi or 2Gi? (It's used [2Gi on tui-demo](https://apmaas-grafana-dp-p.mls-apm.prd.eu.gs.aws.cloud.vwgroup.com/d/a164aaf0349f99e89cea5cb47e9be617/kubernetes-compute-resources-workload?orgId=1&from=now-7d&to=now&refresh=30m&var-datasource=shs-prometheus-preprod&var-cluster=dp-corebe-tui-demo&var-namespace=monitoring&var-workload=kube-state-metrics&var-type=deployment)) Check the values again. ## Tui-demo: ## Plint-shs: ## Plint-001: ## Plint-002: # General Remarks and action points (create a ticket etc.)