Try   HackMD

Office Hours for April 2020

West Coast Edition

Panelists

  • Jorge Castro, VMware (Host)
  • Samudrala Vamshi, American Airlines
  • Monica Rodriguez, VMware
  • Tunde Olu-Isa, VMware
  • Erik Osterman, Cloud Posse
  • Jeremy Rickard, VMware
  • Dave Strebel, Microsoft
  • Bhargav Bhikkaji(BB), Independent Consultant.
  1. Question:

  2. Question:

    • Text: Do you have any suggestions on how to setup prometheus for cluster monitoring to keep it highly available? ie, would you put it in the same cluster, a separate cluster or an external vm?

    Do you have any resources on how to setup a "master" prometheus? I am not sure how to send from one prometheus to another.

  3. Question:

  4. Question:

  5. Question:

  6. Question:

  7. Question:

    • Text: I am currently setting up a bash script to setup our dev/staging GKE cluster so we can shut them down at night and restart them on mornings at will, but Im wondering if there's any solution that exists that would allow me to bundle up all the resources needed, the cluster, the dns, the various public help charts (mongo, redis, etc..) with custom config and our own applications (happens to be also in custom helm charts) rather than a custom bash script? I know AWS has something called CloudFormation that might have allowed me to do this but we are using GKE. If many tools exists, which one you recommend?

    • Asker: Christian Roy

    • Answer: Terraform ,flux, Argo CD https://github.com/cloudposse/terraform-aws-eks-cluster
      https://eksctl.io/

  8. Question:

    • Text:earlier today, I was troubleshooting an issue and noticed that in a cluster there are deployment using pvc with a storage class local-storage and others with local-storage-local, however there is SC object when I do "get sc" that presents the storage class, how did that happen? also tried to check the node for annotation/labels related, there was one label localstorage=true but nothing corresponding in the deployment
    • Asker: Walid
  9. Question:

  10. Question:

  11. Question:

    • Text: What’s the purpose of setting a CPU limit if the container might be allowed to exceed it and won’t be killed or is it just to override the Limit Range admission controller? How does k8s knows whether if it will allow or not, does it depends on the allocatable resources of the node?
    • Asker: Javier
    • Answer: Its a cgroup behavior -https://github.com/kubernetes-sigs/descheduler
  12. Question:

    • Text:What’s the best persistent storage for on-premise workloads? I have tried rook-ceph and Longhorn. What are you guys using?
    • Asker: Jawiz
    • Answer: rook, minio, portworx, NFS, and investigating vsphere(VSan) CSI
    • as a database geek, I do local volumes when I'm performance-constrained (and rely on DB replication), and when I'm not I do Ceph/Rook

URL's:

EU Edition

Panelists

  • Jorge Castro, VMware (Host)
  • Jeffrey Sica, Red Hat
  • Pierre Humberdroz, Spectrm
  • Puja Abbassi, Giant Swarm
  • Chris Carty, Google
  • Dave Strebel, Microsoft
  • Mario Loria, StockX
  • Povilas Versockas, UW
  • Yourname, Affiliation

https://github.com/tekliner/rabbitmq-operator https://github.com/indeedeng/rabbitmq-operator https://kubernetes.io/docs/tasks/run-application/run-replicated-stateful-application/ https://github.com/influxdata/telegraf/issues/6959#issuecomment-608318315 https://relnotes.k8s.io/?markdown=balancer&releaseVersions=1.18.0 https://stackoverflow.com/questions/61113732/how-to-reuse-k8s-validation-in-my-own-crd-controller https://engineering.dollarshaveclub.com/kubernetes-fixing-delayed-service-endpoint-updates-fd4d0a31852c https://github.com/kubernetes/kubernetes/blob/master/go.mod#L124 https://kubernetes.io/docs/tasks/tls/certificate-rotation/ https://github.com/kubernetes/community/tree/master/sig-cluster-lifecycle https://github.com/kubernetes/kubernetes/issues/88553 https://github.com/lensapp/lens

Questions

  1. Question:

    • Recently I've been playing with a 3-node RabbitMQ cluster. The cluster is deployed as a StatefulSet with replica=3. For some reason, one of the pods refused to join the cluster; this led to readyness, liveness and entrypoints to fail, so one of the pods refused to start. According to RabbitMQ docs, the way to solve this is to issue a "reset" command, but how could I do that if the pod refused to run? I also thought of changing the entrypoint or the readyness probes, but this would have impacted the other pods since they are replicas (removing the readyness probe could prevent them to start in the right order for example).
    • I ended up resetting the RabbitMQ cluster state in a somewhat-less-orthodox way, but this is not something I could have done in a production environment; I wondered if I could have recovered it in some other way (possibly less application-specific) like for example, if I could get a command to run in a pod that has problems on starting, or if I could override the readyness/liveness/entrypoint for a specific replica, or something else entirely?
    • Asker: Simone Baracchi
  2. Question:

  3. Question:

    • I have an external netty server (not in cluster). How can I share same websocket connection from my pods(clients) to this external server?
    • I need multiple pods so that messages sent by the service can be distributed but don't want service to have separate connection for each pod.
    • Asker: prashant rathore
  4. Question:

  5. Question:

    • Hello all Since kubernetes v1.18 was released I encountered a lot of problems with MetalLoadBlancer External IP. All the services of type LoadBalancer doesn't get the address, it remain in "pending" state. I've checked the metallb logs and I didn't find anything that appeared meaningful for the problem. I'm working on a virtual environment deployed automatically, so I'm 100% sure that nothing changed in the configs, the only things changed is the kubernetes version. I've tried to deploy the cluster from scratch many times but the result was the same. I tried a kubernetes manual installation too, but the result is the same. I've used this configuration for 3 months without any problem, so I know that before kube 1.18 all worked as expected. Did you have any idea? Did you experienced a similar issue?
    • Asker: Emanuel
  6. Question:

  7. Question:

    • Hi All, I am getting the following errors from the APIServer logs does any one have any idea what these errors and how to fix. {code="500",component="apiserver",contentType="application/json",endpoint="https",instance="172.31.111.155:443",job="apiserver",namespace="default",resource="events",scope="namespace",service="kubernetes",verb="POST",version="v1"} Error from apiserver logs: I0415 12:47:08.913951 1 log.go:172] http: TLS handshake error from 172.31.111.240:41740: remote error: tls: bad certificate
    • Asker: Ravi
  8. Question:

    • Why can't the kubedm certificate be valid for ten or one hundred years? It's too painful. When we deliver kubernetes to the customer's production environment, we need to manually compile the kubedm source code or manually create a large number of certificates. Can't you give us an optional parameter or variable to flexibly configure this option
    • Asker: zhang
  9. Question:

  10. Question:

    • Just wondering why kubernetes is using make and bazel I made a simple PR for an error handling and I had several CI stages failed because of bazel
    • Asker: Rodrigo Villablanca
  11. Question:

    • Text
    • Asker:
  12. Question:

    • Text
    • Asker: