[WIP] Kubecon EU 2020 Conference Notes

# [WIP] Kubecon EU 2020 Conference Notes ## 1K’ View and Random Learnings * Conference held virtually this year due to Global Pandemic (Covid-19) * Hours are generally 5 AM – 12 PM MT * Scheduled Keynotes in the middle of the day instead of the beginning, although many are just sponsor advertisements (*though the CNCF at least calls out which ones our "sponsored" in sched) * Had to use Chromium (Firefox wasn’t working and had to disable pop-up blockers in browser settings) * CNCF used a virtual webcast and collab platform called onlinexperiences.com * The breakout talks/sessions are all pre-recorded, but when the session (recording) is complete, the webcast platform switches to the speaker(s) live for Q&A * Experienced some audio dropouts and awkward latency/lag on Day-0, but seemed to be better the rest of the conference * Breakout sessions, virtual booth crawl, hallway track, etc. are accessed from the tabs at the top of the UI menu * Kubecon (in general) at the very least is good for rekindling interest in the ecosystem, related tooling, and the cncf landscape in general * Lots of ESL speakers with different accents using different microphones at different input levels (*as expected during a firs time virtual Europe Kubecon)* * A significant amount of the presentations had titles/abstracts that portrayed a more advanced dive into the material, though often covered K8s basics (have a plan B breakout session/talk to attend if it doesn't meet expectations) * It’s super easy to bounce out of a room if the presenter is suboptimal, the pace is slow, or the material isn’t what was advertised, etc. The downside is that it’s easy to split your attention between different presos, which dilutes the effectiveness or value of the conference. * Q&A didn’t work for me. I’d submit a question, but it wouldn't display nor the presenter didn't address it (though easy enough to follow up in the appropriate CNCF slack channel). ## Day 0 ### “From Minikube to Production, Never Miss a Step in Getting Your K8s Ready (from OVHcloud – they run k8s on top of OpenStack, F5 LB) * many things running as root, connecting to storage and networks * security team adds ingress rules * dns *something* (need to rewatch) * network example: kubeproxy – 3 modes: userspace, IPTables, IPVS. As clusters grow, network will become slower and slower. IPVS on the other hand is based on hash tables which give a “constant” time. Network pugins (flannel, Calico, Weave…) * The Storage Dilemma * PV network storage – NFS? Ceph FS? Cinder? * Volumes are handled through CSI – interface between k8s and storge tech. CSI Storage errors and state shifts or maintenance can cause problems (e.g. “failed to attach”, “volume not found,” etc.) * ETCD * how to backup restore? How to sync? * Security * Open ports, k8s api (e.g. tesla hack), expolits, rbac, etc. By default it’s not secure. This is a feature not a bug. Not everybody has the same security needs. It’s up to the k8s admin to secure. Disable ---anonymous-auth, and –authorization-mode (default is always allow). Close all ports and only open as necessary. Follow the least privileged model. Define and implement network poilicies. Don’t use the RBAC admin persona. Use RBAC and Netowrk policies to isolate your workload. Keep k8s up to date, lots of security fixes/patchs. K8s is a big target. * Extensibility * K8s is modular. Modules are available to extend k8s. Check for existing functionality in the ecosystem before creating your own custom tool. * Package manager (e.g. Helm) * Istio – ### "Help! My Cluster Is On The Internet: Container Security Fundamentals" (Samuel Davidson) * workload security, cluster security, user security * Workload Security * Assume you will be owned * Use a distroless base image * Debian 10 has a bunch of features, a distroless would be just the bare bones chasis of Debian 10 (aka Distroles Debian10) * Containers are easy to rebuild and deploy * workload security – use CI/CD to make process easier (speaker didn’t provide much detail here) * Trust your containers with signatures! (I think he means image tags) * Use your CI/CD platform to run dependency validations, vulnerablity scanning, tests * Pod-level Security * Don’t use hostPath * Don’t even use the hostNetwork: true field in a podSpec * Don’t use hostNetwork * Be conscious of your pod’s Service Account * Every pod is bound to a ServiceAccount * Bind a diff SA that’s unique to the pod’s use case * put the pod in a diff ns * set automountServiceAccountToken to false * Cluster Security * Keep cluster up to date * isolate your cluster from the internet (should be in a private network, not scannable from the internet, no public Ips) * Log devs and bots into the networks * Use a LB to fwd valid traffic which have beefy protections * If the cluster needs internet access, then allow egress (allow/deny lists aka white/blacklists, etc) * Secrets – for secrets use Secret objects * Audit * Node * isolate from the internet * User Security * Use RBAC and groups. * Groups would act as a layer between roles and the users (aka subjects), defined through group memberships. * Use a policy agent (admission controller) (e.g. OPA Gatekeeper) Tried attending Random Service Mesh Talk toward end but webcast video froze 4-8 min into preso and had a internal meeting conflict) I did here Daniel Berg, Distinguished Eng at Goog say > If you don’t know why you need it, don’t use it > ...regarding service mesh ## Day 1 ### Kubernetes Patterns (Dr. Roland Huß - Principal Software Engineer at Red Hat) * Configuration * How to manage large and complex data? * Use an init container * Behavioral patterns * service discovery * for services outside the cluster, create a manual endpoint resource with the same name as the service, or use service type of ExternalName * Advanced Patterns * Controller Operator * getting from current state to desired state (reconciliation) * api server > controller > api server > node components (kubelet, kube-proxy, etc.) * ConfigMap Watch Controller (bash script that does a curl while watching) * Operator * Looking at Jimmy Zelinskie’s def: “An operator is a k8s controller that understands 2 domains: k8s and “something else”. By combining knowledge of “both” areas, it can automate tasks…” * Classify your CRDs between Installation CRDs (prom, kafka, etc.) and Application CRDs (your specific app crd’s – e.g. custom ServiceMonitor) ### Handling Container Vulnerabilities with Open Policy Agent (aqua security) *(demo on CLI showing adding a admission controller webhook, then bringing down an image, using Harbor which already has image scanning built in). * First need to define a custom policy for vuln handling * OPA integration * trivy * trivy enforcer * They use NVD and Security Info from the vendors like RH, etc. ### Day 1 Keynote * End to End: The Foundation of Doers - Priyanka Sharma * Falco with shopify and Sysdig (seemed sales-y) * Cisco sponsored talk from VP (definitely sales-y) * Constance from Splunk on new CNCF projects and development * Liz Rice – Aqua Security on CNCF TOC * Operating Enterprise Grade Kubernetes Clusters at Salesforce on Bare Metal (Salesforce) * Deep Dive into infra * On prem (github > puppetmaster > k8s, docker, etcd, systemd, puppet, flannel, HAProxy) * (diving into puppet modules) * Operationalizing k8s at Salesforce * etcd and k8s is HA * mtls for everything * using “watchdog” (i.e. agents) for monitoring which go to pagerduty * api-server, controller-manager, and etcd re monitored * Open source “Sloop” for historic visibility * War Stories * war story #1 * Perils of mounting hostPath * Pods were stuck in ContainerCreating for a long time * Root cause: some pods were mounting the root filesystem (paths weren’t found and conflicting with emptyDir, etc.) * war story #2 * Intermittent connectivity failure for microservices communicating through a k8s svc. * Deeper analysis pointed to failures only when client and server pods are on the same host * root cause: another team set bridge-nf-call-iptables=0 in attempt to optimize. * This messed up iptables * war story #3 * etcd state was wiped out in the R&D clusters * fix: change the etcd flags from new to existing after “initializing period” * war story #4 * r&d cluster control plane servers going down with oom * cause: a new mutating webhook was leaking pods * Found out with alerting * mutating webhook admission controllers need validation and canarying * add limits is better than making the node unusable * war story #5 * inconsistent api servers flags * symptom: RS controller was failing to create pods * cause: puppet canarying of svc acct flags across masters * sync the rollout of api server and controoler manager flags everywhere * fix: sync the rollout of api server and controller manager flags everywhere * learning: staged rollout does not always work when rolling out new feature flags. * Summary: * Roll your own k8s requires a lot of expertise and investments * Invested in: * fully automated, HA, and secure Docker, etcd, and k8s infra * integrations with networking, security, monitoring (logs, metrics, alerts) * robust on call rotation for infra and runbooks * Watchdog monitoring and visibility pipeline * Cost-to-serve visibility at container, ns, and team * Ongoing and Future projects * PaaS layer ### "Hey, Did You Hear About This New CVE?"" - A Vulnerability Response Playbook Alexandr Tcherniakhovski - Google GKE Security Engineer ) * Checklist * RBAC * Network * Pod-level * Image * Best approach is to be proactive and harden. * Use cissecurity benchmark * Least privileged RBAC Profile * Kube-apiserver audit log enabled * secret success and deny access requests * alerts trigger on access-deny * NetworkPolicy (at least in audit mode) ### Using Kubernetes to Make Cellular Data Plans Cheaper for 50 Million Users * Grafana, prom, elastic for log export, authz framework (?) * Integrated with Mirantis Cloud Platform * 5G to WiFi * Edge * container pods and vm pods coexist and have seamless networking * Using Virtlet * Virtlet is a CRI plugin that can run Vms as pods and run with other CNIs simultanously * Magma (fb project) – cloud svc * (moving way to fast through advanced materials – demo was a pre-recorded video, presented within a ) ## Day 2 ### Hubble - eBPF Based Observability for Kubernetes * eBPF-based monitoring * Loaded by userspace agent * Filter via eBPF maps * Cilium and Hubble both use eBPF * Cilium is a CNI for pod-to-pod connectivity, but also * service-based LB * kube-proxy replacement * network policies * port blocking * node-to-node encryption * Network observability dependency maps * History * Metrics and CLI – released in 2019 * 1.7 of Cilium designed with Hubble in mind * Hubble API is integrated into the Cilium agent – more scalable * Hubble can display Svc Dependency Maps, flow display and filtering, and network policy viewer (all from UI). Also integrates to prom stack. * The Hubble API provides access to recent flows, streams current flows, cluster-wide visibility (via “Hubble Relay”), accessed by CLI and UI * Flow visibility (e.g. hubble observe –follow -l class=xwing). “follow” mode shows you the flows. * Pods are labeled via annotations * Metrics for http * Dashboards for Grafana available * Easy to write your own metrics on top of what’s already available * (Demo of UI, CLI, and Grafana dash – ideal for cluster security operators as well as cluster application operators). * Showing latency metrics via the CLI – hubble observe –since=3m…. * Q: what does “to-stack” mean (opposed to “to-endpoint’) when looking at the “Type” column from a “Hubble Observe in the CLI? Is that the direction? * Cilium is required when using Hubble, but Hubble can be used with other CNIs as well. ### Managing Multi-Cluster/Multi-Tenant Kubernetes with GitOps - Chris Carty - Independent (CI/CD talk) * GitOps * Citing a Kelsey quote “Gitops is the best thing since config as code” * GitOps Intro: * GitOps is a workflow that empasizes git as the source of truth * There’s an operator of some kind (e.g. flux) that sits in your cluster that forces state declaratively. It uses a manifest (and presumably a for loop cluster) * The flow would be submitting a PR to your git repo (e.g. github) * A few tools in the CI pipeline used are OPA, Conftest, Kubeval *(presenter is flying through the slides, hard to grok) Demo – but the resolution is too low and out of focus)* ### Day 2 Keynotes *No breaking news. Many were Sponsored segments* ### Seccomp Security Profiles and You: A Practical Guide – Security Talk with Duffie Cooley * Root vs. non-Root * Kernal 2.2 gave us capabilities via SECCOMP * Capabilites are per process – they granting a chunk of permission to a container * CAP NETADMIN allows network administrative functionality, BUT within a container context, you can manipulate the underlying node * CAP Sysadmin – effectively root * Example – granting NET_ADMIN and SYS_TIME * amicontained by jess frazelle * What can a k8s pod do? * What is the point? * Attacks happen * Images from a trusted source? (e.g. supply chain attacks) * application bugs can allow exploitation (e.g. reverse tunnel, can an attacker get a bash shell?, etc.) * syscalls – containers are just process isolation, still sharing the same kernal * What syscalls are being used * Need to create a SECCOMP profile that will protect, but how? * Use the strace command * DockerSlim * Profile (output) your containers to get more info for designing your custom profile * Demo * Looking at profiles via audit.json and violation.json, etc. * Operator to assist Panel: Kubernetes and Cloud Native Security: A State of the Union ### In-place Upgrade Noway! Blue/Green Your Way to a New Kubernetes Version *Turned out to be a high-level Gitops talk, bailed midway through* ### Making Compliance Cloud Native Google Talk on Security * Data Protection * Have strong key mgmt procedures when using encryption * Need to protect data in flight, at rest and secrets * Secrets are not encrypted in etcd * Tips * Have a TTL on your data * maintain an asset inventory * Familirize with shred responsibility * key mgmt (set rotation policies) * make sure the images are trusted * manage the lifecycle of your conatiners * Stages of Software Supply Chain * Infra and policy as code * OPA plus Terraform * Grafeas and Kritis ## Day 3 ### Keynote #### Building a Service Mesh From Scratch - The Pinterest Story - Derek Argueta, Senior Software Engineer, Pintrest (Now Tesla) *How we built a service mesh at Pintrest* * the Great mTLS Migratin * used Java since that was most prevalant at Pintrest * looked at Envoy for other apps * one is a LB, 2 is a mesh * The Control Plane * leverage jinja templates * " envoy's schema validation tool * The Fundamentals * Ability to deploy Envoy * Ability to config Envoy * Templates to simplify config * A build system allowed to write C++ * A static analysis system * Internal Web Envoy * TLS termination * Clickjacking and XSS prevention headers * CSRF * CORS * Just another *node* in the mesh * Runs in front of all the internal web services * Phabrcator * Jenkins * Teletraan (internal deployment tool) * New Mesh Use-Cases * Web infra Team - advanced infra specific routing * SRE team - Generic SLI (svc layer indicator) Monitoring for Error Budget tracking * Privacy and Legal Team - HTTP Cooking Monitoring * How did we get here? * Solving Business Probs First * Incremental progress * Delivery at each step * Unification of Traffic and Service Framework * Envoy extensions * Buy-in from other teams and orgs * ### SUSE sponsored keynote talk from president of Engineering * Cloud Native is Edge native too * kubedge, k3s are ideal for edge * (presenter is using a forest analogy to represent edge computing) ### Observing Kubernetes Without Losing Your Mind Vicki Cheung, KubeCon + CloudNativeCon Europe 2020 Co-Chair & Engineering Manager, Lyft * Complexity from an Operator's view and how to simplify * Need to deploy a lot of packages * Onboard your apps * Things go wrong * Postmortems conducted * Add monitoring! * how to deploy monitoring? (each app, all the apps? etc.) * alert fatigue? ability to triage goes down * Infra teams are concerned * Test the user experience * Note the signal your receiving ### Managing Cluster Sprawl - Keith from Rancher *Sponsored Keynote: Managing Cluster Sprawl – Are You Ready? Keith Basil, VP, Edge Solutions, Rancher* * 451 Research believes that in 3 years 73% of enterprises will standardize on k8s * "Anything below k8s is overhead for us" - *some oil and gas engineer* ### My Stint as a Chameleon *Constance Caramanolis, KubeCon + CloudNativeCon Europe 2020 Co-Chair & Principal Software Engineer, Splunk* ### Advanced Persistence Threats: The Future of Kubernetes Attacks Internal meeting conflict - TODO: watch recording replay TL;DR - rbac, network policy, admission control, hygene, think like how an attacker would misuse, find the boundaries or weak spots in the system