20200706 - Bootstrap not finishing

# 20200706 - Bootstrap not finishing ###### tags: `CI` https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1874/pull-ci-openshift-machine-config-operator-master-e2e-openstack/1280022119370985472 ``` level=info msg="Waiting up to 30m0s for bootstrapping to complete..." E0706 06:34:01.231102 87 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.hny68n02-bbcf1.shiftstack.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=9&timeoutSeconds=521&watch=true: dial tcp 128.31.24.248:6443: connect: connection refused ... ``` ## Questions ### Which node holds the API VIP? -> master-2 ### Why are we not seeing the same failure with periodic jobs? Could be that there is a bit of delay after code is merged and when it enters the ocp builds we're testing in the periodics? The same thing happened with https://bugzilla.redhat.com/show_bug.cgi?id=1853298 where the failure appeared first in pre-submit jobs before propagating to periodics. ## Bootstrap kube-scheduler can't schedule the pods: ``` 0/3 nodes are available: 3 node(s) had taint {node.kubernetes.io/unreachable: } ``` ``` I0706 06:48:39.196114 1 node_lifecycle_controller.go:889] Node hny68n02-bbcf1-6lldc-master-1 is unresponsive as of 2020-07-06 06:48:39.196101072 +0000 UTC m=+33.038063570. Adding it to the Taint queue. I0706 06:48:39.196266 1 node_lifecycle_controller.go:889] Node hny68n02-bbcf1-6lldc-master-0 is unresponsive as of 2020-07-06 06:48:39.196260785 +0000 UTC m=+33.038223283. Adding it to the Taint queue. I0706 06:48:39.196375 1 node_lifecycle_controller.go:889] Node hny68n02-bbcf1-6lldc-master-2 is unresponsive as of 2020-07-06 06:48:39.196313206 +0000 UTC m=+33.038275701. Adding it to the Taint queue. ``` kube-controller-manager restarts at `06:48:06.250058` [::1]:6443 stopped responding since `06:48:39.296357` ``` E0706 06:48:39.296357 1 reflector.go:382] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: Get https://localhost:6443/apis/controlplane.operator.openshift.io/v1alpha1/podnetworkconnectivitychecks?allowWatchBookmarks=true&resourceVersion=1888&timeout=5m14s&timeoutSeconds=314&watch=true: dial tcp [::1]:6443: connect: connection refused ``` kube-apiserver restarts at `06:48:41.232190` ## 10.0.0.27 master-0 keepalived: ``` Mon Jul 6 06:32:27 2020: (hny68n02-bbcf1_API) Entering BACKUP STATE (init) ``` ## 10.0.0.16 master-1 keepalived backup state haproxy-monitor ``` time="2020-07-06T06:33:44Z" level=info msg="Failed to get master Nodes list" err="Get https://api-int.hny68n02-bbcf1.shiftstack.devcluster.openshift.com:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D: dial tcp 10.0.0.5:6443: connect: connection refused" time="2020-07-06T06:33:44Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig ``` ## 10.0.0.21 master-2 keepalived: ``` Mon Jul 6 06:33:33 2020: (hny68n02-bbcf1_API) Entering MASTER STATE ``` Why is it not starting kubeapi-server? kubelet.log shows that haproxy fails. The following appears a total of 53 times: ``` Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: time="2020-07-06T06:33:32Z" level=info msg="Runtimecfg rendering template" path=/etc/haproxy/haproxy.cfg Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: time="2020-07-06T06:33:32Z" level=error msg="Failed to write reload to HAProxy master socket" socket=/var/run/haproxy/haproxy-master.sock Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: Error: write unix @->/var/run/haproxy/haproxy-master.sock: write: broken pipe Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: Usage: Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: monitor path_to_kubeconfig path_to_haproxy_cfg_template path_to_config [flags] Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: Flags: Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: --api-port uint16 Port where the OpenShift API listens at (default 6443) Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: --api-vip ip Virtual IP Address to reach the OpenShift API Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: --check-interval duration Time between monitor checks (default 6s) Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: -h, --help help for monitor Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: --lb-port uint16 Port where the API HAProxy LB will listen at (default 9443) Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: --stat-port uint16 Port where the HAProxy stats API will listen at (default 50000) Jul 06 06:56:44 hny68n02-bbcf1-6lldc-master-2 hyperkube[1814]: time="2020-07-06T06:33:32Z" level=fatal msg="Failed due to write unix @->/var/run/haproxy/haproxy-master.sock: write: broken pipe" ``` This bug: https://bugzilla.redhat.com/show_bug.cgi?id=1849583 Is is cause or consequence? Or just red herring... The log is repeting with the same timestamp, so likely red herring. ``` time="2020-07-06T06:33:32Z" level=error msg="Failed to write reload to HAProxy master socket" socket=/var/run/haproxy/haproxy-master.sock ``` haproxy containers on master2: - 2f30fbb01f329de212f7a84b3f3aed5555cae2c64b54f478606815520e9c36f5 - "createdAt": "2020-07-06T06:32:46.993887768Z", "startedAt": "2020-07-06T06:32:47.027387286Z", "finishedAt": "2020-07-06T06:33:21.036342717Z", "exitCode": 143, - af6a0f0afc383e577e94afd9cbd53c1a7139903c2764e02b4a175583ab454008 - "state": "CONTAINER_RUNNING", "createdAt": "2020-07-06T06:33:29.386959883Z", "startedAt": "2020-07-06T06:33:29.455525286Z", "finishedAt": "1970-01-01T00:00:00Z", "exitCode": 0, haproxy monitor containers on master2: - 86974c17c294070d3906e0bd5e23cf3e138ace4f280664c755547da4f042a964 - "state": "CONTAINER_EXITED", "createdAt": "2020-07-06T06:33:20.300075124Z", "startedAt": "2020-07-06T06:33:20.364094299Z", "finishedAt": "2020-07-06T06:33:32.483546758Z", "exitCode": 1, - 1055e7bf4475be97fb706144522b6224d47ba11fb060853475a0a7fde5a9a5b1 - "state": "CONTAINER_RUNNING", "createdAt": "2020-07-06T06:33:46.939762783Z", "startedAt": "2020-07-06T06:33:46.9747791Z", "finishedAt": "1970-01-01T00:00:00Z", "exitCode": 0, haproxy-monitor loops over: ``` time="2020-07-06T06:58:49Z" level=info msg="Failed to get master Nodes list" err="Get https://api-int.hny68n02-bbcf1.shiftstack.devcluster.openshift.com:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D: dial tcp 10.0.0.5:6443: connect: connection refused" time="2020-07-06T06:58:49Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig time="2020-07-06T06:58:49Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig ``` Is etcd up?