Pantavisor Platform Liveness

In this document, we are going to see how to make use of the existing status infrastructure to check and store platform liveness. The idea is based on [kubernetes liveness](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). ## Status This feature will add two new statuses to the existing [container status list](https://docs.pantahub.com/containers/#status): * UNRESPONSIVE: the container has reached its status goal but is not sending an alive signal that is expected within a defined timeout. * ALIVE: the container has reached its status goal and is sending an expected alive signal inside of a configurable timeout. These statuses will not be available for containers unless alive timeout is set in [run.json](https://docs.pantahub.com/pantavisor-state-format-v2/#containerrunjson). Alive timeout will be independent from status goal: ``` { "#spec": "service-manifest-run@1", "name": "awconnect", "alive_timeout": 30, ... } ``` Default values for alive timeout could be also configured from groups and configuration. As we are already doing now, to consult the status of a container, group or revision, pv-ctrl can be used: ``` # getting the status of a container named "pvr-sdk" pvcontrol ls | jq '.[] | select(.name == "pvr-sdk") | .status' # getting the status of the "platform" group pvcontrol groups ls | jq '.[] | select(.name == "platform") | .status' # getting the global revision status pvcontrol devmeta ls | jq '."pantavisor.status"' ``` To add more information for these feedback commands, new keys could be added to pvcontrol ls such as ready (or status goal reached) and alive. These could be expanded to groups and global revision too. ## Sending Signals from A Container As with the READY status, a signal will have to be sent from the container namespace to progress and, in this case, maintain the ALIVE state. ### pv-ctrl To send an ALIVE [signal](https://docs.pantahub.com/containers/#signals) via pv-ctrl, we would do the same as we do now for the READY signal: ``` pvcontrol signal alive ``` This will be possible from every container no matter its [role](https://docs.pantahub.com/containers/#signals). ### Probes Probes will be added to the container spec to allow users to configure an automated signal if a given command exits with status 0. The command will be executed by Pantavisor inside the container namespace. A probe for an alive signal would look like: ``` { "#spec": "service-manifest-run@1", "name": "awconnect", "alive_timeout": 30, "probes": [ { "signal": "alive", "command": "ping -c 4 8.8.8.8" "initial_delay": 10, "period": 15 } ], ... } ``` Probes can be used to send a ready signal too: ``` { "#spec": "service-manifest-run@1", "name": "pvr-sdk", "status_goal": "READY", "alive_timeout": 30, "probes": [ { "signal": "ready", "command": "pvcontrol ls" "initial_delay": 5, "period": 10 } ], ... } ``` The signal will stop when the goal is effectively reached in case of the READY status goal. ## Policies Policies will be added so Pantavisor can react to container status independently from its status goal. The idea is to, some day, be able to model the full behavior of Pantavisor at any runtime state with these policies instead of having them hard coded. For example, be able to define if we rollback in case of container not responding. For now, the following actions will be possible: * nothing (default): do nothing, but inform about changes in the container/group/global statuses. * container: perform a container restart. * system: perform a full system reboot. These actions will be performed based on some conditions such as update state and the triggering container status: ``` { "#spec": "service-manifest-run@1", "name": "pvr-sdk", "status_goal": "READY", "policies": [ { "conditions": { "update_state": ["DONE","UPDATED"], "status": "UNRESPONSIVE", }, "action": { "type": "system" } } ], ... } ``` This will mean that, if the pvr-sdk container is UNRESPONSIVE because pvr-sdk is unable to send an alive signal in less than 10 seconds, Pantavisor will perform a full system reboot. It can be added to the group level in device json too: ``` { "name": "app", "status_goal": "STARTED", "timeout": 30, "policies": [ { "conditions": { "update_state": ["DONE","UPDATED"], "status": "UNRESPONSIVE", }, "action": { "type": "container", "retries": 5, "sleep": 10 } } ], ... ... } ``` In this case, all containers in the app group will be restarted after a 10 second timeout of responsiveness, unless overridden from run.json. The restart will be repeated 5 times before going to a full system reboot with a sleep time of 10 seconds between each try. ## Implementation Plan 1. Improve container/group/global feedback. 2. New alive timeout and statuses. 3. Add pvcontrol signal alive 4. Policy first version with state (UPDATED, DONE), status (UNRESPONSIVE) and action (nothing, system) 5. Ready and alive probes 6. Other policy actions (container)