Kubernetes Patterns reading notes

# PART 1: Foundational Patterns ## Chapter 2: Predictable Demands - This pattern indicates how we should declare application requiremnts ### Problem - From resource consumption POV, the important aspects are domain, bussiness logic and the actual implementation details - Other dependencies: platform-managed capabilities like data storage or application configuration ### Solution ### Runtime Dependencies - Example: storage volumes, config maps, secrets ### Resources Profiles - Compressible resource (can be throttled): such as CPU - Incompressible resource (can't be throttled): such as memory - Minum amount: `requests` and maximum amount `limits` - Keys for `requests` and `limits` - `memory`: incompressible - `cpu`: compressible - `ephemeral-storage`: filesystem space on nodes for storing logs, layers, emptyDir, in compressible - `hugepage-<size>`: contiguous pre-allocated pages of memory, must be the same for `requests` and `limits` - 3 types of Quality of Service (**QoS**) - **Best-Effort**: Pods don't have `requests` and `limits` set, lowest priority, killed first when node run out of incompressible resources - **Burstable**: unequal amount for `requests` and `limits`, killed after Best-Effort - **Guaranteed**: equal amount of `requests` and `limits`, killed last, avoid OOM triggered evictions - ***Recommendations*** - For memory, always set `requests` equal to `limits` - For CPU, set `requests` but no `limits` ### Pod Priority - `Pod priority` indicates the importance of a Pod relative to other Pods -> order of pods are scheduled - Specified by create `PriorityClass` object, then assign it to field `priorityClassName` of Pod - **IMPORTANT**: Low priority pods may be preempted from the node that does not have enough capacity - The Kubelet consider QoS then PriorityClass of Pods before eviction, but, the scheduler eviction logic ignores the QoS of Pods entirely ![](https://i.imgur.com/ZvbhPrR.png) ### Project Resources - `ResourceQuota`: provides constraints for limiting the resource in a namespace ![](https://i.imgur.com/SQexTSw.png) - `LimitRange`: specifies min, max, default values for different resource types, also allows control on the ratio between `requests` and `limits` (`overcommit` level) ![](https://i.imgur.com/DZhOfkj.png) ### Capacity Planning - On nonproduction: should set QoS of `Best-Effort` and `Burstable` - On production: only `Guaranteed` and some `Burstable` ![](https://i.imgur.com/EiRIHH3.png) ## Chapter 3: Declarative Deployment ### Problem - Have something manage Pods, restart them, upgrade version or rollback them ### Solution - If a container covers two areas well then it can be shutdowned and replaced cleanly - Listens and honors lifecycle events (such as SIGTERM) - Provides health-check endpoints - Use `kubectl rollout` for performing update, restart and rollback events ### Rolling Deployment - Zero downtime - Important parameters in `Rolling Update`: `maxSurge`, `maxUnavailable` and `minReadySeconds` - Declarative: shows how the deployed state should look rather than the steps to get there - **Downside**: 2 versions running, conflict ### Fixed Deployment - `Rolling` side effect: 2 versions running, may be harmful with service consumers, especially if the changes make backward-incompatible - Use `Recreate` strategy instead -> set `maxUnavailable` = `replicas` - **Downside**: downtime happens ### Blue-Green Release - Deployment can be used together with other K8S primitives to implement this advanced strategy - Can be implemented by creating 2 Deployment then switch Service endpoint - **Downside**: x2 capacity, database state drifts,... ### Canary Release - Can be implemented by create a small replicas Deployment ![](https://i.imgur.com/aHWQo37.png) ### Discussion - Kubernetes Deployment doesn't support advanced deployment strategy like canary or Blue-Green - We can manage them by script, but this is imperative, and not match the declarative nature of K8S - **Operator** comes, many platforms implements it for performing these deployments: Flagger (FluxCD), Argo Rollout (ArgoCD), Knative ## Chapter 4: Health Probe ### Problem - K8S checks for process status -> insufficient to determine the health of an application ### Solution - Restart if health checks failed ### Process Health Checks - The simplest health check the Kubelet constantly performs on the container processes ### Liveness Probes - Performed by Kubelet agent, there are 4 kinds of probe action - *HTTP probe*: HTTP GET to the container IP address and expects a status code between 200 and 399 - *TCP Socket probe*: Successful TCP connection - *Exec probe*: execute command in the container and expects a successful exit code (0) - *gRPC probe*: gRPC support for health checks - Other parameters: - initialDelaySeconds - periodSeconds - timeoutSeconds - failureThreshold ### Readiness Probes - Options and probe actions are the same as liveness probe ### Startup Prboes - When applications take much time to start (Jakarta EE for example), use `startup probes` - Liveness and readiness probes are called only after startup probes success - A successful startup is often indicated by a marker file ![](https://i.imgur.com/45a9cmk.png) ### Dicussion - Applications should implement APIs for health checks, e.g: /healthz ## Chapter 5: Managed Lifecycle ### Problem - Issue commands to the applications and expect the react. ### Solution ### SIGTERM Signal - K8S sends SIGTERM signals to pods with failed liveness probes, need to be shut down,... - The application shuts down ASAP, some need more time to clean up stuff ### SIGKILL Signal - If a container not shut down after SIGTERM -> forced by SIGKILL, 30 sec after SIGTERM - Can be defined by `terminationGracePeriodSeconds` ### PostStart Hook - Async with the container's process - Prevent a container from starting when the Pod does not fulfill certain preconditions - Support two action types: `exec` and `httpGet` - Aware of async start with the container's process and duplicated execution, does not perform any retry attempts ### PreStop Hook - Should be used to initiate a graceful shutdown of the container when reacting to SIGTERM is not possible ### Other Lifecycle Controls ![](https://i.imgur.com/GkQpauH.png) - Wrapping entrypoint ![](https://i.imgur.com/9aYP0dm.png) ## Chapter 6: Automated Placement - Assigning new Pods to nodes that match resource requests and honor scheduling policies ### Problem - Many factors for placing pods to nodes ### Solution - Assigning pods to nodes is done by the scheduler - For scheduler to do its job correctly and allow declarative placement, it needs nodes with available capacity and containers with declared resource profiles and guiding policies in place ### Available Node Resources ![](https://i.imgur.com/vSthc41.png) ### Scheduler Configurations - Newer version of K8S (after 1.23) moved to scheduling profiles to configure priorities ![](https://i.imgur.com/9wvTDRG.png) - Be default, the scheduler uses the default-scheduler profile with default plugins - It's possible to define multiple schedulers and multiple scheduler profile on each - Use on Pod with `.spec.schedulerName` ### Scheduling Process ![](https://i.imgur.com/MUudOUX.png) - Can use `nodeSelector` to assign pods to nodes with specific labels ### Node affinity ![](https://i.imgur.com/dXrLXc9.png) ![](https://i.imgur.com/pcWI6mp.png) ### Pod Affinity and Anti-Affinity - Pod Affinity is a more powerful way of scheduling and should be used when `nodeSelector` is not enough - Pod affinity is to specify to place a pod close to an already running pod, and Pod Anti Affinity is to specify to place a pod far from an already running pod - To express high availability for pods, packed and colocated together,... use Pod affinity and anti-affinity ![](https://i.imgur.com/Lb7Vdip.png) ![](https://i.imgur.com/IZBvQTF.png) ### Topology Spread Constraints - `Pod affinity` and `anti-affinity`: 1 topology, `topology spread contraints` help distributing Pods on cluster -> better ultilization or high availability - Example: with pod anti affinity, 2 replicas on 2 nodes -> can't rolling update -> replace with `recreate` strategy ![](https://i.imgur.com/XiFg06e.png) ### Deschedulers - Recreate pods to ultilize nodes's resources - Except these kinds of pods - Node or cluster-critical Pods - Pods not managed by ReplicaSet, DeploymentSet or Job (will not be created again) - Pods managed by a Daemon Set - Pods that have local storage - Pods with PodDisruptionBudget, where eviction would violate its rules - Pods that have a non-nil DeletionTimestamp field set - Deschedule Pod itself (achieved by marking itself as a critical Pod) - Evictions respect QoS: Best-Effort -> Burstable -> Guaranteed # Part 2: Behavioral Patterns - Focus on communications and interactions between the Pods and th managing platform ## Chapter 7: Batch Job ### Problem - Sometimes, there is a need to perform a predefined finite unit of work reliably and then shut down the container -> `Job` ### Solution ![](https://i.imgur.com/VONBACt.png) - Based on `completions` and `parallelism` parameters, there are many types of Jobs: - *Single Pod Jobs*: both fields are empty -> default 1, one Pod is started - *Fixed completion cound Jobs*: set `completions`, the other field can be set or unset - *Work queue Jobs*: `completions` unset, `parallelism` greater than one - *Indexed Jobs*: like queue Jobs but with indexes, set with `.spec.completionMode`: `Indexed` ## Chapter 8: Periodic Job ### Problem - Need a periodic task done ### Solution ![](https://hackmd.io/_uploads/By15gKmV3.png) - Other attributes - `.spec.startingDeadlineSeconds`: delay time accepted if the job missed the schedule - `.spec.concurrencyPolicy` - `.spec.suspend`: field of suspending all subsequent executions - `.spec.successfulJobsHistoryLimit` and `.spec.failedJobsHistoryLimit`: how many history should be kept for auditing ### Discussion ## Chapter 9: Daemon Service (skip) ## Chapter 10: Singleton Service - Only one instance of an application is active at a time and is highly available ### Problem - Some service need to be run only one at a time. E.g: a database ### Solution - `active-active` topology: mutiple instances of a service are active - `active-passive` topology: one instance is active and all other instances are passive - *out-of-application* locking - *in-application* locking ### Out-of-Application Locking - The application is not aware of the singleton instance ![](https://hackmd.io/_uploads/rJPp197Vn.png) - Singleon: replica = 1, highly available: Statefulset or ReplicaSet - ReplicaSet: at least 1, but not most 1 replicas, for some reasons, maybe there are more pods running - We should use StatefulSet for singleton (although it's not exact `active-passive`) - Better network perfomance: headless service, no proxy - Strict singleton: at most 1 instance ### In-Application Locking ![](https://hackmd.io/_uploads/rkSeAKVE2.png) - The application knows that it's singleton, and it stops the instantiation of multiple instances of the same process, regardless of the number of Pod instances - An implementation with Dapr, ZooKeeper, etcd, or any other distributed lock imple‐ mentation would be similar to the one described: only one instance of the application becomes the leader and activates itself, and other instances are passive and wait for the lock. This ensures that even if multiple Pod replicas are started and all are healthy, up, and running, only one service is active and performs the business functionality as a singleton, and other instances wait to acquire the lock in case the leader fails or shuts down. ### Pod Disruption Budget - `PodDisruptionBudget`: limits the number of instances that simultaneously down for maintenance - `PodDisruptionBudget` ensures a certain number of percentage of Pods will not voluntarily be evicted from a node (draining node, scale down, maintainence, upgrade,...) ![](https://hackmd.io/_uploads/H1AyW5N4h.png) - Only either `minAvailable` or `maxAvailable` is specified in a PodDisruptionBudget - PodDisruptionBudget is useful in the context of singletons too. For example, setting `maxUnavailable` to 0 or setting `minAvailable` to 100% will prevent any voluntary eviction. Setting voluntary eviction to zero for a workload will turn it into an unevict‐able Pod and will prevent draining the node forever ### Discussion - Strong singleton: In-Application > Out-of-Application ## Chapter 11: Stateless Service - Stateless application, best suited for dynamic cloud environments, can be rapidly scaled and made highly available ### Problem - Stateless services: can be created, scaled and destroyed without side effects - Has no stored knowledge of or reference to past requests. - Information are stored in db, message queue, mounted fs,... ### Solution - Deployment is the recommended user-facing abstraction for creating and updating stateless applications ### Instances - Replicaset and Deploymentset can automate the lifecycle management of stateless applications - Replicaset: replicas, scale in scale out - Deployment: upgrade, rollback,... ### Networking - Service: load balancing traffic to stateless pods ### Storage - Share the same PVC - Some accessModes characteristics: - ReadWriteOnce: single node, one or multiple pods - ReadOnlyMany: multiple nodes, read only - ReadWriteMany: mounted by many nodes, read and writes - ReadWriteOncePod: only a single Pod had access -> service is turn into singleton and prevent scaling out ### Discussion ## Chapter 12: Stateful Service ### Problem - Deployment or ReplicaSet with 1 replica does not guarantee At-Most-Once semantics ### Storage - Attach to the same PV -> conflict, can be resolved with in-app solution like separating data to subfolders -> error-prone -> loss data during scaling - Separate ReplicaSet for every instance -> scaling up requires new set of ReplicaSet ### Networking - Requires stable network identity - Workaround: replicaSet with replica = 1 -> hostname changes every restart, not aware of where the service is accessed from ### Identity - Requires a unique identity -> Replicaset has random name for it's pods ### Ordinality - The instances of clustered stateful applications have a fixed position in the collection of instances -> used in scaling, distribution of access, locks, singleton, leaders. ### Other Requirements ### Solution ### Storage - StatefulSet use `volumeClaimTemplates` on the fly during Pod creation to allow every Pod get its own dedicated PVC - Scaling up -> create new Pods and PVCs - Scaling down -> doesn't delete PVC -> PVCs and PVs cannot be recycled or deleted, and K8S cannot free the resources automatically -> prevent data loss ### Networking - Has stable identity generated by StatefulSet's name and an ordinal index (starting from 0) ![](https://hackmd.io/_uploads/HkdOdTvS2.png) ![](https://hackmd.io/_uploads/S1-COTDB2.png) ### Identity - A predictable Pod name and identity is generated based on StatefulSet's name ### Ordinality - During scaling up and down -> the sequence go up and down accordingly ### Other Features - Partitioned updates: using `.spec.updateStrategy.rollingUpdate.partition`, all Pods with index greater than or equal to partition are updated -> canary release - Parallel deployments: set `.spec.podManagementPolicy` to Parallel, instead of sequence launching and destroying, it runs parallel - At-Most-One Guarantee: It does not start a Pod again unless the old instance is confirmed to be shut down completely ### Discussion - StatefulSets are good but stateful applications are far more unique and compex -> Controller and Operator ## Chapter 13: Service Discovery - This pattern provides a stable endpoint through which consumers of a service can access the instances providing the service ### Problem - Service consumers need a mechanism for discovering Pods that are dynamically placed by the scheduler and sometimes elastically scaled up and down ### Solution - Before K8S: client-side discovery, service consumer looking at a registry for service instances and then choosing one to call ![](https://hackmd.io/_uploads/B1TOZzpBn.png) - In K8S: all that happens behind the scenes ![](https://hackmd.io/_uploads/Hku2ZfpS3.png) - There are many mechanisms can be used to implement this pattern, depends on whether a service consumer is within or outside the cluster and the service provider is within or outside the cluster ### Internal Service Discovery - Create a service, as a stable entry - This service has unchaged clusterIP as long as it exists - How can other applications within the cluster figure out what this dynamically clusterIP is? - Discovery through environment variables: - Mechanism: service is injected to the env of all pods (host and port) - Drawback: Only pods are **created** **after** the service is injected with these envs (pod need to be restarted to get new services's addresses) - Discovery through DNS lookup: - Mechanism: Reach the service by FQDM `test-svc.default.svc.cluster.local` - Characteristics of the Service with type ClusterIP - Multiple ports explosure - Session affinity: `sessionAffinity: ClientIP`, create a sticky session for the same client IP - Readiness probes: pod endpoint is removed if not ready - Virtual IP: the IP of service is virtual. Kube-proxy catch the packages of this IP and replaces it with the selected Pod IP - ClusterIP: can specify cluster IP for service (in range), handy for legacy services that use predefined IP address ### Manual Service Discovery - Specify external services ![](https://hackmd.io/_uploads/SJwx_qrUh.png) ![](https://hackmd.io/_uploads/Hk_cwqH82.png) - externalName services, using DNS only, `database-serfice.<namespace>.svc.cluster.local` now point to `my.database.example.com`. This is a way to create alias to external endpoint (DNS CNAME) ![](https://hackmd.io/_uploads/HJ5CDqS8n.png) ### Service Discovery from Outside the Cluster - Type: NodePort ![](https://hackmd.io/_uploads/ryLoOqSL2.png) ![](https://hackmd.io/_uploads/B1vnucrI2.png) - Characteristics: - Port number: specific or random chosen by Kubernetes - Firewall rules: configure WAF to allow ports - Node selection: every nodes - Pods selection: `externalTrafficPolicy: Local` traffic is routed to the pod on that node (reduce latency), have to make sure the node has pod running on - Source address: client IP are NAT'd, IP becomes internal node IP when sending packets to **pods on the other Node**. Use `externalTrafficPolicy: Local` to prevent this - Type: LoadBalancer, using a cloud provider's load balancer ![](https://hackmd.io/_uploads/Bk-g6qBIh.png) ![](https://hackmd.io/_uploads/B1jXacBLh.png) ### Application Layer Service Discovery - Ingress sits in front of Services and act as a smart router - Reusing a single external load balancer and IP to service multiple Services ![](https://hackmd.io/_uploads/Hk5sCcS8h.png) ### Discussion ![](https://hackmd.io/_uploads/r14qJiBL2.png) ## Chapter 14: Self Awareness - Some applications need to be self-aware and require information about themselves ### Problem - These applications need to know about Pod name, Pod IP, resources limits, requests,... - For those need, K8S provides `downward API` ### Solution - Similar to metadata for AWS EC2 - The downward API allows to pass information through environment variables and files ![](https://hackmd.io/_uploads/B15dU_383.png) ![](https://hackmd.io/_uploads/Hy3FIu3U2.png) ![](https://hackmd.io/_uploads/B141DdnL2.png) ![](https://hackmd.io/_uploads/SJt7v_hUn.png) - Unless the Pod is restared, the environment variables for labels and annotations will not reflect the changes. But downwardAPI volumes can reflect changes ![](https://hackmd.io/_uploads/SkcpwOn82.png) ### Discussion - Downward API has limitation on the number of keys. If applications need more data, they have to query the API server # Part 3: Structural Patterns ## Chapter 15: Init Container - Initialization logic ### Problem - Similar to constructor of classes in programming languages - Tasks like: permissions setup on filesystem, database schema setup, application seed data ### Solution - All init containers are executed in a sequence, one by one, must be terminated successfully before the application containers ![](https://hackmd.io/_uploads/BJ6xUFpUh.png) - The effective Pod-level request and limit values become the highest values of the two groups: - The highest init container request/limit value - The sum of all application container values for request/limit ![](https://hackmd.io/_uploads/S1etUK683.png) ![](https://hackmd.io/_uploads/SkYnUKpLn.png) - Use sleep to debug for init containers ### More Initialization Techniques - Admission controllers: set of plugins intercepts every request to the K8S API Server before the persistence of the object and can mutate or validate it - Admission webhooks: `mutating webhook` which can change resources to enforce custom defaults and `validation webhook` which can reject resources to enforce custom admission policies - Init at creation time - In the end, the most significant difference is that init containers can be used by devel‐ opers deploying on Kubernetes, whereas admission webhooks help administrators and various frameworks control and alter the container initialization process. ## Chapter 16: Sidecar - Allows single-purpose containers to cooperate closely together ### Problem - Collaboration between containers ### Solution ![](https://hackmd.io/_uploads/Sy9WFja8h.png) ![](https://hackmd.io/_uploads/H1lzYsTUn.png) - 2 approaches for using sidecars: - Transparent sidecars: invisible to the application, example: `Envoy` of Istio - Explicit sidecars: main application interacts with over well-defined APIS ## Chapter 17: Adapter - The Adapter pattern takes a heterogeneous containerized system and makes it conform to a consistent, unified interface with a standardized and normalized format that can be consumed by the outside world - Inherits from Sidecar pattern but has the single purpose of providing adapted access to the application ### Problem - The different between multiples components (containerized) can cause difficulties when all components have to be treated in a unified way by other systems ### Solution - Example: Multiple services need to be monitored and alerted with metrics,... but it's hard because they are different. Using adapter pattern, exporting metrics from various application containers into one standard format and protocol ![](https://hackmd.io/_uploads/rk5w3M1D2.png) - Different adapter container for each Pods, export standard metrics to the monitoring system ![](https://hackmd.io/_uploads/H1URhzkDn.png) ![](https://hackmd.io/_uploads/SytA2Gywn.png) - Another usecase: logging ### Discussion - Adapter is a specialization of the Sidecar pattern - It acts as a reverse proxy to a heterogeneous system by hiding its complexity behind a unified interface ## Chapter 18: Ambassador - The Ambassador pattern is a specialized sidecar, reverse to the Adapter pattern, it hides external complexities and provides a unified interface for accessing services outside the Pod ### Problem - Consuming external service may require a speical service discovery library, or we may want to swap different kinds of services by using different kinds of service-discovery libs and methods ### Solution - In production env, application may need to access to cache in different shards ![](https://hackmd.io/_uploads/rkbDkQyw2.png) ![](https://hackmd.io/_uploads/S1sPym1Ph.png) ![](https://hackmd.io/_uploads/Bkbn1mywh.png) ### Discussion - It's sometime called Proxy pattern # Part IV: Configuration Patterns ## Chapter 19: EnvVar Configuration ### Problem - Separate the configuration with application codes, so that we can change it easily ### Solution ![](https://hackmd.io/_uploads/B1YpjOgv3.png) ![](https://hackmd.io/_uploads/H1GDp_xvh.png) ### Discussion - Environment variables are simple to use, but are applicable mainly for simple use cases and have limitations for complex configuration requirements. The next patterns show how to overcome those limitations. ## Chapter 20: Configuration Resource ### Problem - It is better to keep all the configuration data in a single place and not scattered around in various resource definition files ### Solution - Configmap and Secret objects - For file config, if application supports hot reload, the changes will be reflected. But not for env vars ![](https://hackmd.io/_uploads/Sywp-m4wh.png) ![](https://hackmd.io/_uploads/SkXlzmVv2.png) - Immutable ConfigMap and Secret, the only way to change them is to delete and recreate, also increase performance because API server does not need to watch for the changes of these objects ![](https://hackmd.io/_uploads/SJ5pz7Vw2.png) ### How Secure Are Secrets? - A Secret is distributed only to nodes running Pods that need access to the Secret - On the nodes, Secrets are stored in memory in a tmpfs and never written to physical storage, and they are removed when the Pod is removed - In etcd, the backend storage for the Kubernetes API, Secrets can be stored in encrypted form - **Note**: A user or a controller with Pod-creation access in a namespace can impersonate any service account and access all Secrets and ConfigMaps in that namespace ### Discussion - The most signifi‐ cant advantage of using ConfigMaps and Secrets is that they decouple the definition of configuration data from its usage - Limitations: - Secret has 1MB size limit - Quota number for Configmaps ## Chapter 21: Immutable Configuration - Offers 2 wasy to make configuration data immutable so that the application's config in a well-known and recorded state ### Problem - EnvVar Configuration pattern exceeds a threshold -> hard to maintain - This problem can be fixed by using Configuration Resources with immutable property, but they do have size limitation ### Solution - We can put all environment specific configuration data into a single, passive data image that we can distribute as a regular container image - The application is linked to these data images at runtime, the images combine all configuration information and can be versioned like container image ### Docker Volumes ![](https://hackmd.io/_uploads/Hy4reL6D2.png) ### Kubernetes Init Containers - We can use Init Containers to init an empty shared volume during startup - The volume empty dir is shared between 2 containers, the init container will copy config files from it's image to the shared volume, the main container now can access to it ![](https://hackmd.io/_uploads/BkZD-UaPh.png) ![](https://hackmd.io/_uploads/B1FvZ86v2.png) ![](https://hackmd.io/_uploads/Bkfh-8TD3.png) - To deploy this on Production, only have to change the image ### OpenShift Templates ![](https://hackmd.io/_uploads/ryigXUavn.png) ### Discussion - Data containers have some unique advantages: - Environment-specific configuration is sealed within a container - Configuration created this way can be distributed over a container registry. The configuration can be examined even without accessing the cluster. - The configuration is immutable - Configuration data images are useful when the configuration data is too complex to put into environment variables or ConfigMaps, since it can hold arbitrarily large configuration data. - Drawbacks: - It has higher complexity - It does not address any of the security concerns around sensitive configuration data. - The technique described here is still limited for use cases where the overhead of copying over data from init containers to a local volume is acceptable - Extra init container processing is required in the Kubernetes case, and hence we need to manage different Deployment objects for different environments. ## Chapter 22: Configuration Template ### Problem - Problem comes when the configuration data is large, can be wrong parse for quotes, exceed the limit of 1MB for Configmap and Secrets (1MB is a limit imposed by the underlying backend store etcd) - Different environments -> slightly different configuration, but if we create many config objects -> duplication and redundancy ### Solution - Stores only the differing configuration values like DB connection params - Tools like Tiller (Ruby) or Gomplate (Go) for processing templates ![](https://hackmd.io/_uploads/SyCb5Iaw2.png) - The startup process: - Init container is started, run the template processor, the processor takes the templates from its image, and templates params from the volume, stores the result in emptyDir volume - The application container starts up and loads the config files from emptyDir - Example ![](https://hackmd.io/_uploads/HJW1iL6w3.png) ![](https://hackmd.io/_uploads/HyZMiI6vh.png) ![](https://hackmd.io/_uploads/ByxNoLpP2.png) ![](https://hackmd.io/_uploads/H1MUiUpP2.png) - When change the environment, only need to create a new configmap holding the different parameters. - To debug: check the directory `/var/lib/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/` on the node, as it contains empty dir volume ### Discussion # Security Patterns ![](https://hackmd.io/_uploads/B1NUn86P2.png) ## Chapter 23 Process Containment - Apply the principle of least privilege to constrain a process to the minimum privileges it needs to run ### Problem - Primary attack vectors for K8S -> through application code - There are tools to scan for code security, dependencies - The containers are also scanned, this is usually done by checking the base image and all its packages - Regardless how many checks, new code and new dependencies can introduce new vulnerabilities -> Apply the least privilege principle turns K8S configs act as another line of defense ### Solution - Security context configurations of Pod and container. Each level can have it's own config, container takes precedence over Pod config ### Running Containers with a Non-Root User - Users and groups of container are used to control access to files, directories, volume mounts ![](https://hackmd.io/_uploads/ryG1F4-u3.png) - Should check for the UID and GID in container image for assuring right permissions - Can be set with `.spec.securityContext.runAsNonRoot` flag to true instead - Prevent `sudo` with `.spec.containers[].securityContext.allowPrivilegeEscalation` to false ### Restricting Container Capabilities - The container is a process run on a node, so it can have the same privilges a process can have ![](https://hackmd.io/_uploads/SJEJsEZdn.png) ### Avoiding a Mutable Container Filesystem - Limiting the attack surface by having a read-only container FS - Set `.spec.containers[].securityContext.readOnlyRootFile` to `true` - `seccompProfile`: Linux kernel feature can be used to limit the process running in a container to call only a subset of the available calls - `seLinuxOptions`: assign custom SELinux labels ### Enforcing Security Policies - How can we ensure a collection of Pods follows certain security standards - Using Pod Security Standards (PSS) and Pod Security Admission (PSA) controller - PSS defines, and PSA enforce, these policies are grouped in 3 security profiles ![](https://hackmd.io/_uploads/BktvTEWd3.png) - PSS replaced by PSA in K8S v1.25 ![](https://hackmd.io/_uploads/ryj0aEW_2.png) ### Discussion - The tendency of shifting left the security considerations and testing practices, includ‐ ing deploying into Kubernetes with the production security standards, is getting more popular. Such practices help identify and tackle security issues earlier in the development cycle and prevent last-minute surprises. ## Chapter 24 Network Segmentation ### Problem - Network communication is flat, need to isolate them by namespaces ### Solution - Defining NetworkPolicty, works on L3/L4 networking OSI, create ingress and egress firewall rules for workload Pods - Using service mesh with L7 layer, specifically HTTP-based communication ### Network Policies - User-defined rules are picked up by the CNI, mostly CNI plugin support NetworkPolicy, except Flannel ![](https://hackmd.io/_uploads/rkrt7BW_n.png) - NetworkPolicy objects are namepace-scoped and match only Pods from within the NetworkPolicy's namespace. Can't define cluster-wide, needs 3rd party plugin like Calico ![](https://hackmd.io/_uploads/HJuOPBWu3.png) - A deny-all rule should be applied first ![](https://hackmd.io/_uploads/HytiwSZuh.png) - **INGRESS** ![](https://hackmd.io/_uploads/rJh5urbuh.png) - **EGRESS** ![](https://hackmd.io/_uploads/ryIJFBZ_h.png) ![](https://hackmd.io/_uploads/HyEetH-_2.png) ![](https://hackmd.io/_uploads/SJAMtSb_2.png) - Block IP ![](https://hackmd.io/_uploads/HyxNYr-On.png) ### Authentication Policies - Service Mesh: operational requirements like security, observability or reliability, works by injecting sidecars to containers, act as ambassador or adapter ![](https://hackmd.io/_uploads/S1dWoH-uh.png) ![](https://hackmd.io/_uploads/BJ1SiSZun.png) ### Discussion ## Chapter 25: Secure Configuration - The best ways to keep your credentials as secure as possible when running on Kubernetes ### Problem - When secrets are stored outside of the cluster, in GitOps for example, really dangerous ### Solution - The most straightforward solution is to store the credentials encrypted outside the cluster and decrypt them in the application. But it takes effort, there are better ways to implement security - The support for secure configuration on Kubernetes falls 2 categories: - Out-of-cluster encryption: the transformation into Kubernetes Secrets happens just before entering the cluster or inisde the cluster by a permanently running operator process - Centralized secret management: AWS Secrets Manager or Azure KeyVault, HashiCorp Vault,... ### Out-of-Cluster Encryption - Pick up secret and confidential data from outside the cluster and transform them into K8S Secret - Sealed Secrets - External Secrets - sops #### Sealed Secrets - The idea: Store the encrypted data for a Secret in a CRD SealedSecret - An operator monitors and creates K8S Secret for each SealedSecret with decrypted content - The decryption happens within cluster, and the encryption happens outside by a CLI tool called `kubeseal` (takes a Secret and translates it to a SealedSecret -> can be stored on Git) ![](https://hackmd.io/_uploads/rkwGqkXd3.png) - Keys are created and stored in the cluster, can be rotated if needed - 3 scopes of SealedSecret - Strict: only create the SealedSecret in the same namespace and the same name as the original Secret (default) - Namespace-wide: Different name but same ns - Cluster-wide: different name and different namespace ![](https://hackmd.io/_uploads/HkQ4o1Qdh.png) - Need to backup the private key, can't decrypt if the operator uninstalled - Drawback: server side operator continuously running in the cluster #### External Secrets - The main different with sealed secrets: you do not manage the encrypted data but rely on external SMSs - SMS: encryption, decryption, secure persistence ![](https://hackmd.io/_uploads/BJpRsyQOn.png) - Components: - SecretStore: holds the type and the configuration of the external SMS - ExternalSecret: references a SecretStore, the operator will create a corresponding K8S Secret filled with the data fetched from the external SMS ![](https://hackmd.io/_uploads/rJKVhk7d3.png) ![](https://hackmd.io/_uploads/rJKrnJmOh.png) - Dominant way to sync and map externally defined secret to a K8S Secret - Drawback: server side operator continuously running in the cluster #### Sops - Entirely works outside of the cluster (client-side solution): Secret OPerationS - Allow to encrypt and decrypt any YAML or JSON to safely store in source code repository ![](https://hackmd.io/_uploads/HymV6JQ_n.png) ![](https://hackmd.io/_uploads/SkdrTJXO3.png) ![](https://hackmd.io/_uploads/rJqDTyXOh.png) ![](https://hackmd.io/_uploads/rkMopkQ_3.png) ![](https://hackmd.io/_uploads/rJapT17_h.png) ### Centralized Secret Management - The K8S Secrets created with the previous way can be read by cluster admin or cluster-wide access applications #### Secrets Store CSI Driver - CSI (Container Storage Interface): K8S API for exposing storage systems to containerized applications - Secrets Store CSI Driver: allows access to SMSs and mounts them as regular K8S volumes -> nothing will be stored in `etcd` database - Setup: - Installing the Secrets Store CSI Driver and config with a specific SMS (cluster-admin permission required) - Config access rules and policies -> a K8S Service Account is mapped to a secret manager-specific role that allows access to the secret ![](https://hackmd.io/_uploads/HyX-ylm_3.png) ![](https://hackmd.io/_uploads/r1tV1xXO2.png) ![](https://hackmd.io/_uploads/HyAdJlmO2.png) #### Pod injection ![](https://hackmd.io/_uploads/r13GxxQ_2.png) ### Discussion - Which to choose? Depends on the purpose: - Simple way to encrypt Secrets stored in public-readable places like Git -> sops - Secret synchronization -> External Secrets Operator - No confidential information is stored permanently in the cluster except the access tokens for accessing SMS -> Secret Storage CSI Providers - Shielding from a direct access to an SMS -> Vault Sidecar Agent Injector ## Chapter 26: Access Control ### Problem - Authentication and authorization ### Solution ![](https://hackmd.io/_uploads/SyPsKnMFn.png) ### Authentication - Bearer Tokens with OIDC, client certs (x.509), authenticating proxy, static token files, webhook token authentication - The order is not fixed ### Admission Controllers - Intercept requests to the API server and take additional actions ### Subject - All about the `who`, the identity associated with the request ![](https://hackmd.io/_uploads/SJvvh2zF2.png) - Users and SAs can be grouped into user groups and SA groups #### Users ![](https://hackmd.io/_uploads/SJizT3MYh.png) ![](https://hackmd.io/_uploads/Sy6BT2GFn.png) #### Service accounts - They use OpenID connect and JWT to authenticate - Servce account name: `system:serviceaccount:<namespace>:<name>` ![](https://hackmd.io/_uploads/Bkn9yTMF2.png) - Every namespace has a default SA `default` for Pods don't have associated SA - SA are mounted to Pod ![](https://hackmd.io/_uploads/rJvHlTzY3.png) - Before K8S 1.24, SA tokens -> Secrets, but after, it's assosiated to each running Pod -> rotation and security ![](https://hackmd.io/_uploads/BJmAe6fF2.png) #### Groups ![](https://hackmd.io/_uploads/SJiy-aft2.png) ### RBAC ![](https://hackmd.io/_uploads/r1j7bpMY3.png) ![](https://hackmd.io/_uploads/Bk0uZpzK3.png) ![](https://hackmd.io/_uploads/SktY-6GF3.png) #### Role - apiGroups: "" -> core API, "*" -> all groups - resources: list of K8S resources - verbs: allowed actions ![](https://hackmd.io/_uploads/Bk2JM6zFh.png) ![](https://hackmd.io/_uploads/B1b-G6Mtn.png) #### RoleBinding ![](https://hackmd.io/_uploads/B15QGpzF2.png) ![](https://hackmd.io/_uploads/SkRvzaMtn.png) #### ClusterRole - Similar to Role but cluster-wide applied ![](https://hackmd.io/_uploads/HJwjf6zKh.png) ![](https://hackmd.io/_uploads/Syb6zaMF2.png) ![](https://hackmd.io/_uploads/Sy-zmpfK3.png) #### ClusterRoleBinding ![](https://hackmd.io/_uploads/SkOKmTGFn.png) ![](https://hackmd.io/_uploads/HkFq7azt2.png) # PART VI: Advanced Patterns ## Chapter 27: Controller - Actively monitors and maintains a set of K8S resources in a desired state - Kubernetes itself consists of a fleet of controllers ### Problem - Changes in resource status -> Controllers create events and broadcase to listeners - K8S represents a distributed state manager ### Solution - ReplicaSets, DaemonSets, StatefulSet, Deployments, Services are controller - Custom controllers can monitor and react to state-changing events ![](https://hackmd.io/_uploads/ryeaMjmKh.png) ![](https://hackmd.io/_uploads/rkBZXs7F3.png) - Reconciliation components: - Controllers: reconciliation process monitors and acts on standard K8S resources, enhance platform behavior and add new platform features - Operators: interacts with CRDs, encapsulate complex application domain logic and manage application lifecycle - Controllers use the Singleton Pattern -> 1 controller watches only (Deployment with 1 replica) - Place to store controller data: labels, annotations, configmaps - The example: controller watchs the ConfigMaps for the changes, if the CM has the annotation deletePodSelector, it use the selectors to delete the Pods (useful for application need a restart to reload environment variables) ![](https://hackmd.io/_uploads/rkSSUo7Yn.png) ![](https://hackmd.io/_uploads/r15_UoQF2.png) ![](https://hackmd.io/_uploads/SkVuDjQKn.png) ![](https://hackmd.io/_uploads/rkIydjmt2.png) ## Chapter 28: Operator - Is a controller that uses a CRD ### Problem - Controllers are limited to watching and managing K8S primitive resources only - CRD + controller of them = Operator *An operator is a Kubernetes controller that understands two domains: Kubernetes and something else. By combining knowledge of both areas, it can automate tasks that usually require a human operator that understands both domains.* ### Solution ### Custom Resources Definitions - CRDs are managed like any other sources, through K8S API, stored in etcd ![](https://hackmd.io/_uploads/rJ4Z7LSK2.png) - OpenAPI V3 schema is used to allow K8S to validate the custom resource (should be used in production) - `subresources` - `scale`: CRD specify how it manage replicas - `status`: allow API to update status of resource ![](https://hackmd.io/_uploads/rJMZtLrFh.png) ![](https://hackmd.io/_uploads/rJhGFUrF3.png) ### Controller and Operator Classification - Based on the operator's action, classifications: - Installation CRDs: for installing and operating applications on K8S. Typical: Prometheus CRDs -> installing and managing Prometheus - Application CRDs: allows applications deep integration with K8S. Example: ServiceMonitor CRD: used by Prometheus operator to register specific K8S Services to be scraped by a Prometheus server ![](https://hackmd.io/_uploads/H1lpsLBF3.png) - Custom API server ![](https://hackmd.io/_uploads/H1DI2UHt2.png) ### Operator Development and Deployment - Tools, frameworks: - Kubebuilder - Operator Framework - Metacontroller ### Examples - ConfigWatcher: a CRD watch ConfigMap and specifies which Pods to restart if ConfigMap changes ![](https://hackmd.io/_uploads/SyJPRIrK3.png) ![](https://hackmd.io/_uploads/SyWuC8HYn.png) ![](https://hackmd.io/_uploads/BJ4RoPHYh.png) ## Chapter 29: Elastic Scale - Perform scaling based on load automatically ### Problem - Vertical and horizontal scaling for load adapting ### Solution - Horizontal: increase replicas - Vertical: increase the resource for containers ### Manual Horizontal Scaling - Imperate: `kubectl scale` - Declarative: Apply the new manifest with increased number of replicas ### Horizontal Pod Autoscaling #### Kubernetes HorizontalPodAutoscaler - Using the `.spec.resources.requests` as the scale target ![](https://hackmd.io/_uploads/SkfUgHUKn.png) ![](https://hackmd.io/_uploads/S1eJWBLF2.png) - Type of metrics: - Standard metrics: such as CPU or memory, type: `Resource` - Custom metrics: type: `Object` or `Pod`, advanced monitoring setup. Provided by different metrics adapter - External metrics: resources that are not a part of the cluster. Only one external metrics endpoint can be hooked into K8S API Server. For using many different external systems, use KEDA - Scaling behavior ![](https://hackmd.io/_uploads/rJ_24rIK2.png) - HPA lacks of scale-to-zero for stopping all Pods #### Knative - Support autoscaling based on HTTP traffic - Features - Knative Service: simplified deployment model, supports scale-to-zero, traffic-splitting - Knative Eventing: EventMesh - Knative Functions: AWS-Lambda like - Knative Pod Autoscaler (KPA) ![](https://hackmd.io/_uploads/H1hllUIFh.png) ![](https://hackmd.io/_uploads/S1JpgLLKn.png) ![](https://hackmd.io/_uploads/Hknl-I8t2.png) #### KEDA - Pull-based approach that scales on external metrics from different systems ![](https://hackmd.io/_uploads/SkXTt1OK2.png) - Scaling algorithm: - 0-1: KEDA operator manages - 1-n: HPA manages, based on external metrics from KEDA ![](https://hackmd.io/_uploads/SkwJjJ_Y2.png) - Custom scalers: creating external service that communicates with KEDA over gRPC-based API #### Summary ![](https://hackmd.io/_uploads/B1qhTkdK2.png) ### Vertical Pod Autoscaling (experimental) - Horizontal is less disruptive - For stateful services, vertical scaling may be preferred - VPA: automating the process of adjusting and allocating resources based on real-world usage ![](https://hackmd.io/_uploads/rJahJldK2.png) ![](https://hackmd.io/_uploads/H1JLlxdFn.png) - How VPA works with different kinds of updateMode ![](https://hackmd.io/_uploads/r1ecXldK2.png) - HPA and VPA are not awares of others, may be conflicted ![](https://hackmd.io/_uploads/Hys7SgdFh.png) ### Cluster Autoscaling ![](https://hackmd.io/_uploads/Sk2UIeOF3.png) ### Scaling Levels ![](https://hackmd.io/_uploads/ByXWPxdKh.png) #### Application Tuning - Tune the application to best use the allocated resources, heap, nonheap, thread stack sizes,... - Container-native applications use start scripts that can calculate good default values for thread counts, and memory sizes for the application based on the allocated container resources rather than the shared full-node capacity. Using such scripts is an excellent first step #### Vertical Pod Autoscaling - Setting the right resource requests and limits in the containers #### Horizontal Pod Autoscaling - Assuming that you have performed the preceding two methods once for identifying good values for the application setup itself and determined the resource consumption of the container, from there on, you can enable HPA and have the application adapt to shifting resource needs. #### Cluster autoscaling - CA can extend the cluster to ensure demanded capacity or shrink it to spare some resources ## Chapter 30: Image Builder - Techinique to biuld container images within the cluster ### Problem - Building and running applications in one place can reduce maintenance costs - The cluster can redeploy every applications built on the base image ### Solution ![](https://hackmd.io/_uploads/rJ4d1Htt3.png) - Categories of tools: - Container image builder: create container images within cluster - Build orchestration: trigger container image builder, updating deployment ### Container Image Builder - Nonprivileged build mode with rootless builds #### Dockerfile-Based builders - These builders based on Dockerfile format, unprivileged mode, using background daemon or REST API - Buildah and Podman: create OCI-compliant images without a Docker daemon - Kaniko: backbone of Google Cloud Build - BuildKit: from Docker #### Multilanguage builders - Buildpacks: detect existing project, build accordingly to the language and technology - CNB: Cloud Native Buildpacks (CNCF) ![](https://hackmd.io/_uploads/SJ9nbrFYh.png) ![](https://hackmd.io/_uploads/r1gfGHFKn.png) #### Specialized builders - Highly optimized build flow: increases flexibility and decreases build times - Narrow scopes ![](https://hackmd.io/_uploads/ByMYzBtth.png) ### Build Orchestrators - CI/CD platforms like: Tekton, ArgoCD, Flux,... - Includes many phases: building, testing, releasing, deploying, scanning,... - Specialized orchestrators ![](https://hackmd.io/_uploads/HyCrXBFK2.png) ### Build Pod - A simple example, instead of CICD solution ![](https://hackmd.io/_uploads/Bka-4SKYh.png) ![](https://hackmd.io/_uploads/rkSBNrFtn.png) ![](https://hackmd.io/_uploads/HkFrEHtKh.png) ### OpenShift Build