# Resource allocation chat 2023-11-16
Attending:
- Erik
- Georgiana
## Agenda
- Overview status
- Data:
- Data is available about instance types allocatable cpu/memory and daemonset-requests
- Data about instance type max pods capacity is missing but can be added easily
- Script functions:
- A primitive conservative generation strategy with memory requests equals limits
- Overview limitations
-
- Plan what and how to address the situation
## UX
- constraints users
- constraints engineers
- constraints cost optimizations
---
- general enough and cover situations
- right tool for scenario
## Scope
- The ra CLI should be general enough to cover various situations
ra --instance-type --scale-memory-rl=1 --scale-cpu-rl=8
## How to conclude what input params we need - think about scenarios
### General use
- set of instance types allowed
- [n2-highmem-4, n2-highmem-16]
- [n2-highmem-16]
- set of limit on the max requests
Information:
-
- Usage:
- max users
- average and median users
- min users
### New setup
What is a good default set of options
- cost
- startup time
- efficiencicy
Generate twice, cut and paste
## Multi-machine type choices
### combined
challanges:
- 4G request on 4 CPU is different
--min-users-per-node=1
1 -4
2 -4
4 -4
8 -4
16 -4
32 -4
64 -16
128 -16
Situation, event, max users 200:
--min-users-per-node=32 --instance-types=[4cpu, 16cpu, 64cpu]
1 -16
2 -16
4 -16
8 -64
16 -64
### separated
--scale-memory-rl=1
1 -4
2 -4
4 -4
8 -4
16 -4
32 -4
1 -8
2 -8
4 -8
8 -8
16 -8
32 -8
64 -8
1 -16
2 -16
4 -16
8 -16
16 -16
32 -16
64 -16
128 -16
--increment-factor=2
1 -64 (max number of pods risk)
2 -64 (max number of pods risk)
4 -64 (max number of pods risk)
8 -64
16 -64
32 -64
64 -64
128 -64
256 -64
512 -64
## Complexities
- small differences in "similar" requests ~27.7Gi on 4 CPU node vs ~29.1Gi on 16CPU
- imperfect fit when scheduling on nodes in shared
- startup times
- few users per node cause longer startup times
- cost & efficiency
- minimize overhead of unscheduled capacity (few users per node && oversubscription factor of 1)
- maximize efficiency of available capacity (many users per node && oversubscription factor of requests/limits grater than 1)
- memory constraints can be observed in two ways
- limit reached, process killed inside container
- typically in this situation, the kernel dies
- requests surpassed (but might be still below limit), server (pod inside node) killed !!!
- kubelet will choose to kill the pod that might violated the request/limit ratio the most, even though that pod wasn't the trigger to overpassing the capacity of the node (memory pressure)
- is more problematic for users and the hub
- users don't understand their own usage
- tailor made `allowed_teams`, filtering profile list entries (but not options)
- optimizing cloud cost
## Change drivers
We assume we keep using a single instance type no matter what.
- cpu / memory constrained
- first grow resource allocation options if possible
- increase the size of the single instance type
- startup times slow
- cost high
What could make us change their setup to expose multiple instance types to the large group of users?
We can still configure multiple instance types for a few users.
## Goign beyond large-group defaults
Either:
- run script multiple times to generate separate profiles
- run script once to handle all
-
## Erik is currently leaning towards
- ra script to only generate options for individual instance types
- a single instance should be picked for common use
- create a new script instead of re-working the old
- motivate first with a detailed proposal, then implement