Try   HackMD

Office Hours June 2020

West Coast US Edition- Cancelled

Raffle:

Panelists

  • Strebel Dave
  • Vamshi Samudrala
  • Jorge Castro

EU Edition

Panelists

URL's


Person: marcoceppi

Question:

Okay. I have a question about stoarge that I’m hoping panelists have experienced / have sage advice on. We currently run Kubernetes on baremetal in edge locations. Due to physical constraints these bare metal hosts (6-10 nodes) are unconventional, small, and unreliable. Each unit in a cluster can also experience independent power issues. Because of that it can be up to a week before a powered off node is turned back on. So far, because of how we run Kube/etcd we haven’t really lost a single site, things fail over appropriately, but we’re loosing workloads due to storage lockups.

We run rook/ceph and have a single statefulset and a single deployment that use storage (more apps, but they’re all ephemeral/stateless). We’re hitting the issue where if a node becomes NotReady / NoExecute Node Taint (when the node-controller can’t communicate with kubelet - 90% due to node losing power) storage will never “unlock” from that node and when the pod controller eventually reschedules the statefulset/pod its stuck because Ceph RBD is RWO and the storage - to Kubernetes - is still attached to the offline node.

To remedy this, we wrote a script that after a node is unavailable in k8s for 60s just removes it from the node. That reschedules the workload quicker but still doesn’t release the storage in a timely fashion so we’re stuck in the same situation - only this time the remedy is to force delete the statefulset pod a few times and 75% of the time it’ll get storage again.

All of what we’ve done are hacks so far - we’re grappling with hardware issues that we’ll replace but not for another 6mo but more importantly the way storage, CSI, and Kubernetes all intersect. To us it seems the only options are RWX storage which outside of NFS don’t seem to have many performant options or to continue developing this node-removal script to also repeatedly force delete objects until a desired state is met.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
2

Answer:


Person: Ari

Question: One thing that everybody uses Kubernetes for is to make sure that pods have enough CPU and memory on nodes through resource requests and limits, and the Kubernetes scheduler does a good job of making sure that nodes aren't overscheduled in terms of CPU and memory.

My question is - what about disk I/O and networking? The underlying node storage and networking will have a real limit to IOPS etc. Are there plans to get Kubernetes to allow pods to define storage and networking requirements, and not over-schedule storage and networking requirements on nodes which cannot handle their requests?

Answer:


Person: Dimitrije M

Question: Good morning! I have a question about cluster utilization, I am setting resource requests and limits for our services and I was curious how to approach these. Should I set requests based on peak load or should I allow limits to handle those? My current request setup leaves my cluster at <20% utilization and I was wondering how I should approach increasing my utilization

Answer:


Person: Joel Davis

Question: Is there anything in RBAC that allows you to select against certain namespaces? For example, if I want to give someone the ability to create new namespaces (of which they have access to * verbs on * resources) but not interact with a given set of namespaces is that possible to do through RBAC?

Answer: Potential custom webhook or operator.


Person: cten

Question: How are you monitoring and managing the cost to run your clusters? (currently running in AWS)

Answer: https://kubecost.com/ is probably the best solution out there right now


Person: Joel Speed

Question: Something that's come up with my work recently, when hosting metrics endpoints in an application, do people/should people reqiure authn/authz to access them? ie should something like Kube-Rbac-Proxy be put in front or the functionality be implemented into the application

Answer:


Person: Baluwii

Question: Is there any storage backup recommended solution?

Answer:


Person: Andrei

Question: Hi, volume question on the "Volumes" doc page we can read: Kubernetes supports several types of Volumes generic: what does it mean k8s supports ?

particular: seems glusterfs client comes with hyperkube to k8s node; also seems hyperkube is going to be deprecated; how will k8s will continue support such kind of volumes?

Will be great to hear how to bring latest (most fresh) glusterfs client onto node and unlock developers, so they can use latest glusterfs (which deployed somehow, somewhere) from our k8s cluster as a client.

resource question would you recommend to limit cpu/mem for kube-api and etcd running as a static pod?

Answer:


knabben: question: Hi, Is there an official/recommended Grafana template for monitoring the entire Kubernetes ecosystem (metrics on apiserver/kubelet/scheduler)? What other tools people are using to get a big picture of it?


Person: Mihir Shah

Question: How to set Custom metrics for HPA?

Answer:


Person: athavan kanpuli ****

Question: Hi, I have a question. Is there way to specify certain pods in a deployment to get killed when horizontal autoscaler scales down the pods of a deployment?

Answer:


Person 11: Jojo Pad

Question: Is there a tool for testing k8s service latency?

Answer:


Person 12: Karoline Pauls

Question: Given a currently updating deployment, rolling in a new replicaset and and rolling out the previous one, which replicaset are pods taken away from if a user or a scaler decreases the deployment's replicaCount?

Answer:


Person 13: athavan kanapuli 9:23 AM Hi, I have a question. Is there way to specify certain pods in a deployment to get killed when horizontal autoscaler scales down the pods of a deployment?

Question:

Answer:


P 14: Vishnu Prasad *** Q: Is there a project or tools that would us help configure how and when to autoscale nodes up and down. Like in EKS node-groups when to scale down the nodes especially. Mainly cause certain loads can’t use one of those metrics like cpu,mem etc all the time for scaling up and down. A:

P15: Long Q: is the stable/metrics-server supposed to work out of the box or is it absolutely required to add the kubelet-insecure-tls flag

Appendix

Intro Script

Welcome everyone to today’s Kubernetes Office Hours, where we answer your user questions live on the air with our esteemed panel of experts. You can find us in [#office-hours] on slack, and check the topic for the URL for the information.

  • Before we begin let’s start by introducing ourselves: (Give each panelist about a minute)
  • Before we start here are the ground rules:
    • This is a Kubernetes event so the Code of Conduct is in effect, please be excellent to each other.
    • This is a judgement-free zone, everyone had to start from somewhere so please help out your buddy by having a supportive environment in the channel.
    • While we will do our best to answer your questions the panel doesn’t have access to your cluster, so live debugging is off topic, but we will do our best to get you moving down the next step.
    • Panelists, you’re encouraged to expand on answers with your experiences and pro-tips.
    • Audience, you can help by pasting in URLs to official docs, blogs, or anything that might be relevant to the topic at hand.
    • Post your questions on [discuss.kubernetes.io].
    • You can also help us out by tweeting, spreading the word, and paying it forward.
    • This panel is made entirely of volunteers, if you want to rotate in please let us know, we love to have new people rotate in and help out.

Contest

The hack.md notes document will have a list of who has asked questions, roll a dice to see who won the shirts. On occasion if someone from the audience has been helpful feel free to give them a shirt as well, we want to reward people for helping others. Note: Multi-sided dice not included.

Outro

(Note, the companies will change over time depending on the hosts)

  • Thanks to the following companies for supporting the community with developer volunteers: Giant Swarm, StockX, Pivotal, Pusher.com, Weaveworks, VMware, University of Michigan, Red Hat, and Utility Warehouse. Special thanks to CNCF for sponsoring the t-shirt giveaway.

And lastly, feel free to hang out in [#office-hours] afterwards, if the other channels are too busy for you and you’re looking for a friendly home, you’re more than welcome to pull up a chair and hang out.