The curious case of 'slot reconciliation'

# The curious case of 'slot reconciliation' ###### tags: `Proposals` ## Background Akri uses Kubernetes device plugin framework to manage Akri resources. A device plugin would need to register itself with kubelet which in turn will call "list_and_watch" and "allocate" to monitor resources and allocate a resource to a specific pod. Unfortunately, there are some challenges with the current kubelet design: 1. Kubelet doesn't share any information about the pod when requesting an allocate to a resource. ` rpc Allocate(AllocateRequest) returns (AllocateResponse) {}` 2. Device plugin framework doesn't support a "Deallocate" that can called by kubelet to clean up when a pod goes offline or fails. Given the above, Akri has to implement a "slot reconciliation" algorithm to be able to clean up and reconcile any slot that are not being used anymore and hence optimized resource usage. ## Periodic slot reconciliation Current implemented algorithm is a best effort that attempts to keep the state of the resource allocation up-to-date through periodically checking the state of the node against the state of the instances in the API server and correcting that when needed. The flow works as follows: [![failed algorithm](https://github.com/romoh/playground/blob/main/akri_slot_reconciliation_algorithm.png?raw=true "Title")](https://viewer.diagrams.net/?highlight=0000ff&layers=1&nav=1&title=akri_slot_reconciliation_algorithm.png#R7VrbdtsoFP0aPzZLd9mPiXPtdDrppJ1MHomEJSYSaBC%2B9esLCCxhqY6d%2BJJkOSsrMXBAiHP23nBwzx3msysKivRPEsOs51jxrOee9xzH9hynJ36teF7VhIGqSCiKlVFdcYd%2BQlVpqdoximFpGDJCMoYKszIiGMOIGXWAUjI1zUYkM59agAS2Ku4ikLVr71HM0qq274R1%2FTVESaqfbAeDqiUH2li9SZmCmEwbVe5Fzx1SQlj1KZ8NYSYWT6%2FL%2Fc38PvvyFFx9%2Flb%2BD36c%2FfH96z%2BfqsEuN%2BmyeAUKMXvx0D9v%2Fkruis9DGIG7%2FBbMLk%2BDH59sNXbJ5nrBYMzXTxUJZSlJCAbZRV17RskYx1AMa%2FFSbfOFkIJX2rzyP8jYXAUDGDPCq1KWZ6oVzhD7V3Q%2F8VXpodFyPlMjy8JcFzCj80YnUXxottXdZEn3W3Pt9DqQMY3gCjtXhTCgCWQrFlbZicVsxKHyzBUkOeST5AYUZoChiRmsQMV8srCr%2Fco%2FKNduEEFqNhOQjdWTLiZQPl%2FA1CohR19ctkKhdrTwzTRFDN4VQK7PlLOF6dQRwUx53OZvepZkoCyVE0pGydMCf8J6AabNfcRnzuBs5arq1kBBV3FXXxWnNRHYGt1pgwQ8a0d%2B8Fp%2BeICrlt0yl7gDVZ3oa7pCogrHp4JKefkxI9FTVXWJxOQVZDRz95%2F3VgYeYXYGoqdETnRIMkJ5EyZYkMPz9FEj2cBxDev9IDloI7l7PHtNKKsg0wG1NrLVSLcE8feqTchoVPKJGZSubXR8%2ByeuP6h%2FjGD3fMd8ZvWmaoh6VB4ZYN4wK4RB2TEx%2FVDbBJUTWqsnuWQfuv4SjqoZ1KharOlaQOvWNaeFtK%2FkKHWvBUjnWltvSuqCluPPieBYC2AxCTISr5RCsVBESJ6Vggms9p8MIAxpKQ2AhCIVLZgwOXsgtsCXbbZOSf44Lp8XSINEBUVfghxlwnnXMJtAhiLQIaMgQwnmhYg7FdJuduaPRDjhpaAufZcByhVnh%2FIaLjGB26GvVoe%2B9nelr4O19HVT2L83BX4b%2Bqpp%2F1n%2BCDfTV3t%2F%2BspF1AjwYHlj%2BBtJ3ZqKtfftRxXbQhi%2Bbp%2B3Hx3T8274%2FlTKUUvHEC4ZwJEUuTIjTIpdWXLZ4C4X2RZpiUopZXGtaCMRFVLRLE4f%2FEkoghX%2Fl%2BAodOY50jZ4wHe6DpJBW%2BgW6rf98LBb4XHB0dd22ns%2FwofmqSa0%2FdbSD3Z0hO%2Fk5MFB%2BFdzqd3bSNJfwtvb5l%2FnYPzbLere0q41XA6V6pVaot7eHQTLoemttTvY9MDtLbHPtg%2FQK93WlJ94IScZKpkWIS05soHCnEwqlRLGQHYBohXlkMtUXuhumFPHRyMrzwrM%2FaLfoRN2B1sF%2Fo7YyjvuFpuu3DW3OPpaSefg7Jdyi%2BObA3mDtbhlWyeP9uZzCGSviKKIZRruUQr5WXYpeUKwqBlT6ZLFnnOKOAJ4JBYwQiCTDIFJxBdbmH8wHli%2Bd3C7eKDr4iHY1a4lOPLA73Yt3SfvdTctB0t%2Brpp2A7ZXIsVhHBFHlOT83%2BntjVgISHlYf0AAmlumBX3u4%2Bav0zP%2BEYAbAbDjnn1Vuv%2BN4K%2Bdr%2Fs2ru7ZTbhZI5lxUTcQUjCNDM2Hg6Oph4vyweDYbznqMInVLaMmXFe2nDcFGztsuWMb9zavYEy7wZc1ez7HmPu6b1nX992L7b0p37ddfw8EEYpcppUD%2BiST2BVfyuyCPDw8Ql4Dq%2BT1tbQXSYaEApnCLiBFRHTjb1CUcoASYdnEUohlt0DkItwz%2FFgWC18ck93isDLoG2S932R3d8j2393uacNE7baPL966QnCwS6%2BV826wwd%2Fiu4kRyhR4%2Bd8xHitM68sumWscFzFgtVXzyKNyD6q7tIv1Uz7etx79gbnb8t3wpH1pYnv9NoRfkIfkxfor0FX6qf4iuXvxCw%3D%3D) However this algorithm is not perfect and has some gaps: 1. To avoid the gap that can happen between when a pod is still getting scheduled and deletion, a grace period is added (currently set to 5 mins). This can lead to a resource/slot being unavailable for at least 5 mins before it is finally reconciled. 2. Exiting when some containers on the node are still not ready can be problematic as this can lead to an infinite wait. The agent is deployed as DeomonSet and in the event the cluster has quick containers turnaround, the reconcile algorithm might never run. 3. Assuming there's only one slot left, when a pod is deleted kubelet will attempt to allocate the resource of the deleted pod but akri will fail the first allocate request. This will result in the pod being in "AdmissionFailure" state. 4. Responsivness can be bad in this scenario. Min time to reconcile a resource is 5 mins and max can be infinite dependening on the containers status (best case 5 mins of the grace period) 5. The algorithm uses a periodic timer as a trigger and not leveraging other information available through other k8s entities. This is CPU intensive and uses the node resources. ## A ~~failed~~ less than ideal attempt In the quest of finding a better solution, an initial step was to rely on triggering this reconcile logic based off kubernetes events by subscribing to the Informer's pod events. That way we can re-run the reconcile logic more reliably based on real state changes and not arbitary value time based triggers and reduce CPU usage. Even better, since we would be notified of pod deletion, we can quickly reconcile slots for those pods and no longer need the grace period to remove a slot -as [API server already maintains a grace period before reporting a pod deletion](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle) The algorithm would work as follows: ![failed algorithm](https://github.com/romoh/playground/blob/main/akri_slot_reconciliation_algorithm_failed.png?raw=true "Title") While that algorithm has better responsivness, it had the same starvation problem the original one had (as I discovered afterwards). It also has a new gap for when the agent is restarted and missed a pod 'delete' event, that won't be recoverable till the next pod event when the whole state re-evaluation happens. well, that wasn't an ideal solution.. ## Now what? While the solution above gets closer to finding a reliable trigger to reevaluate and reconcile any slots. It had a couple of issues, it doesn't act upon the pod that triggered the pod event, but instead was re-evaluating the state of all the pods which causes this unreliability and besides it didn't have a way to recreate its state in the event the agent is restarted. The ideal solution should handle: * A pod going offline ~~or container crashing~~. * Agent restarting. * Allocate expiring: A pod/container failed after call to 'allocate' - (Note: Is that even possible?) ## Before jumping to solutions.. There are some facts we need to understand to be able to build a reliable & resilient design. * How does kubelet handle allocate/deallocate? To come up with a reliable way to deallocate resources at Akri, we need to match what kubelet does. Based on experimentation and going through kubelet source code, [kubelet handles allocation and deallocation at pod level](https://sourcegraph.com/github.com/kubernetes/kubernetes/-/blob/pkg/kubelet/cm/devicemanager/manager.go#L623:10). so there's a guarantee that if a container fails (regardless of restart policy specified in the pod spec), kubelet will free up the allocated resource only when the pod is terminated and not at container level. In other words, if a pod has 2 containers using Akri's resources and one of the containers just keeps restarting or crashes, kubelet will never call `Allocate` untill the pod is deleted. (note: This is an example to explain the problem, but likely an edge case) What is also reassuring to this is pod won't even be scheduled unless all of its containers have guaranteed slots available >users can request devices in a Container specification as they request other types of resources, with the following limitations: Extended resources are only supported as integer resources and **cannot be overcommitted**. * While there are some discussions on supporting [deallocate in kublet](https://github.com/kubernetes/kubernetes/pull/91190#issuecomment-675764738) We will still require slot reconcilliation to support older kubernetes versions. ## A better solution While kubelet doesn't provide the pod, container or slot mapping during an allocate call, we can attempt to infer that from the different triggers that happen on the cluster and maintain our own map. When the agent starts, it will start listening to pod events. The agent can ignore all the events from pods unrelated to akri. Once a device plugin is registered, then the pod events can be used to update the slot_pod_map. The slot_pod_map is a mutex that will have the following info <slot, pod, timestamp> When allocate is called, the device plugin will first lock the mutex to check if the slot already exists. If not, it will add it to the slot_pod_map with an empty pod assigned and update the instances in API server to reflect the current slot/node allocation. If the slot already exists, then fails allocation (error! this shouldn't happen). While `allocate` is called, pod events have already started flowing for that pod, mainly `WatchEvent::Added` & `WatchEvent::Modified` events. Added event can be safely ignored since allocation is already covered by kubelet call to `allocate`. On the other hand, `WatchEvent::Modified` can be triggered during the different container lifecycle events (containerinit, started or terminated). Hence, inspecting the pod container status [`pod.status.container_statuses`](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-states), we can know when a pod's container is up and runny by looking for `started` flag. Once started, we can inspect the container for akri's annotations using crictl. Once akri'slot annotation are found, we can update the slot_pod_map for that slot with that pod name. In case we get a `WatchEvent::Deleted`, we can remove the slot entry from slot_pod_map and update the API server. *Note:* It is likely that `WatchEvent::Deleted` event will be followed by/overlapped with a kubelet `allocate` call. There's no quarantee that allocate and delete won't be called at the same time or in separate order. For this we have a couple of possibilities: * allocate is called after delete is complete, the slot_pod_map will not have a slot entry for this slot and all good. * allocate is called while delete is being processed. This will be handled by the fact of using a mutex so allocate will wait for delete to release the lock before proceeding. * allocate is called before delete (less likely to happen). In this case allocation will fail. The suggested algorithm would work as follows [![](https://i.imgur.com/8u4sdq3.png)](https://www.sequencediagram.org/index.html#initialData=C4S2BsFMAIGVwPbGgJ0gYwQO3ScIBDUbaAWmgCMCBnEdaAM0QHcAoVgawFcLIphSAPgAmkAG51IAfQAO4LgHMQWAFwBBcInRFI5aomABGVqInppcxcqFpMOPJBTrhw6PqSHowBG4OyEwlIAtgQyADpYABTuRgA0MgFSXFgcWAjMWACUJuKSsvJKWEKhIFLUjmKOKgCqMsI6Xj40tApYvh6N0OhcKGhYwFJpoqzKDAgoQY42GNi4UE4ACokAsgEgDCCQrpEJwobZtrMOKNN2c1UAkgzQu552wATKjtAAFjTtRtAEWGkPxFixCLAF6QNpcOoNXbUMoGaEhGSMcbQaIGQzxAL7CJ8crQECtcYwEZYMYTKaCQ72eYqaBLQIAET4kGAMB2GIOM0pZIp5ycVxuGMRyVcyg+-kC8Ni0FE-EgrG5x2KMlK5RQlSc4PqzM68qgH2M7CAA) ### Highlevel algorithm ### ![](https://i.imgur.com/swq9jNi.png) **Implementation related notes**: - Pod terminated is different than pod deleted. Both needs to be handled. - It is suggested to use pod id instead of pod name to avoid name collision. - Investigate using pod annotation instead of maintaining an in-memory map. This avoids the complexity of rebuilding this state in case the agent is restarted, but on the other hand introduces the risk of failing annotations writes (annotations are still persisted on the API server and may require multiple retries to succeed for larger clusters) **Cost** | Task | Cost (days) | | -------- | -------- | | New algorithm implementation - informer hook up & reconcile | 3 | | Handling agent restarts | 2 | | Updating existing tests and adding new ones | 5 | | E2E testing | 2 | | Total | 12 days |