Resource allocation chat

# Resource allocation chat 2023-11-16 Attending: - Erik - Georgiana ## Agenda - Overview status - Data: - Data is available about instance types allocatable cpu/memory and daemonset-requests - Data about instance type max pods capacity is missing but can be added easily - Script functions: - A primitive conservative generation strategy with memory requests equals limits - Overview limitations - - Plan what and how to address the situation ## UX - constraints users - constraints engineers - constraints cost optimizations --- - general enough and cover situations - right tool for scenario ## Scope - The ra CLI should be general enough to cover various situations ra --instance-type --scale-memory-rl=1 --scale-cpu-rl=8 ## How to conclude what input params we need - think about scenarios ### General use - set of instance types allowed - [n2-highmem-4, n2-highmem-16] - [n2-highmem-16] - set of limit on the max requests Information: - - Usage: - max users - average and median users - min users ### New setup What is a good default set of options - cost - startup time - efficiencicy Generate twice, cut and paste ## Multi-machine type choices ### combined challanges: - 4G request on 4 CPU is different --min-users-per-node=1 1 -4 2 -4 4 -4 8 -4 16 -4 32 -4 64 -16 128 -16 Situation, event, max users 200: --min-users-per-node=32 --instance-types=[4cpu, 16cpu, 64cpu] 1 -16 2 -16 4 -16 8 -64 16 -64 ### separated --scale-memory-rl=1 1 -4 2 -4 4 -4 8 -4 16 -4 32 -4 1 -8 2 -8 4 -8 8 -8 16 -8 32 -8 64 -8 1 -16 2 -16 4 -16 8 -16 16 -16 32 -16 64 -16 128 -16 --increment-factor=2 1 -64 (max number of pods risk) 2 -64 (max number of pods risk) 4 -64 (max number of pods risk) 8 -64 16 -64 32 -64 64 -64 128 -64 256 -64 512 -64 ## Complexities - small differences in "similar" requests ~27.7Gi on 4 CPU node vs ~29.1Gi on 16CPU - imperfect fit when scheduling on nodes in shared - startup times - few users per node cause longer startup times - cost & efficiency - minimize overhead of unscheduled capacity (few users per node && oversubscription factor of 1) - maximize efficiency of available capacity (many users per node && oversubscription factor of requests/limits grater than 1) - memory constraints can be observed in two ways - limit reached, process killed inside container - typically in this situation, the kernel dies - requests surpassed (but might be still below limit), server (pod inside node) killed !!! - kubelet will choose to kill the pod that might violated the request/limit ratio the most, even though that pod wasn't the trigger to overpassing the capacity of the node (memory pressure) - is more problematic for users and the hub - users don't understand their own usage - tailor made `allowed_teams`, filtering profile list entries (but not options) - optimizing cloud cost ## Change drivers We assume we keep using a single instance type no matter what. - cpu / memory constrained - first grow resource allocation options if possible - increase the size of the single instance type - startup times slow - cost high What could make us change their setup to expose multiple instance types to the large group of users? We can still configure multiple instance types for a few users. ## Goign beyond large-group defaults Either: - run script multiple times to generate separate profiles - run script once to handle all - ## Erik is currently leaning towards - ra script to only generate options for individual instance types - a single instance should be picked for common use - create a new script instead of re-working the old - motivate first with a detailed proposal, then implement