Pantavisor Architecture

A document to keep the discussions about our dream Pantavisor architecture. ## Pantavisor Features ### Core #### System management * **If it runs Linux, it runs Pantavisor:** Pantavisor can be run in two different init modes: embedded (on device, with a flashable image), and app engine (on Linux host, with an install package). * **Bootloader:** In embedded mode, Pantavisor supports both uboot and grub bootladers. * **Watchdog:** It can be set and pinged from Pantavisor, with four different modes: disabled, shutdown (only pings during shutdown), startup (only pings during startup and shutdown) or always. * **sysctl:** The /proc/sys hierarchy can be configured from Pantatavisor. #### Configuration * **Fully customizable:** Pantavisor can be configured at compile time (pantavisor.config), at boot up time (cmdline or policies), at update time (trail config) or at run time (user metadata). #### Reproducible Device Revisions * **Reproducible Revisions:** Pantavisor is capable of running reproducible revisions. A revision contains a BPS (kernel, modules, firmware and Pantavisor binary) and a number of containers. A revision is represented by a state JSON and a set of objects pointed from that JSON. * **Integrity:** Revisions can be stored on disk. Its integrity can be protected by crypto signatures and object checksum validation. Signature protection of revisions is named secureboot, and it has four levels of severity: disabled, audit, lenient or strict. * **Transactional Updates:** When a new revision is issued, Pantavisor can transition to the new updated version, evaluating whether is necessary to reboot the device o not. * **Updates Feedback:** The state of any revision update is tracked by Pantavisor: TESTING, UPDATED, DONE, WONTGO, ERROR, etc. * **Non Reboot Transitions:** A revision that is successfully transitioned to without a reboot, is set as UPDATED. * **Reboot Transitions:** A revision that is successfully transitioned to with a reboot, is set as DONE. The latest DONE revision is set as the rollback point to go in case of error in further revision updates. * **Early Error Detection:** A revision that cannot be started is set as WONTGO. * **Revision Stability Test:** A revision that is started but fails during a configurable testing time, is set as ERROR. #### Storage Management * **Persistence:** Pantavisor stores a number of elements on disk to make them persistent after reboots: boot up information, metadata, configurations, additional disk volumes, logs, objects and state JSONs. * **Disks:** Additional disks can be set to be used both by Pantavisor and the containers with different types of encryption: non-encrypted, without hardware acceleration, using CAAM or using DCP. Pantavisor can use these disks to store its metadata. * **Metadata:** There are two types of metadata depending on who defines it: user and device metadata. * **Device Metadata:** Pantavisor itself keeps a number of device metadata keys up to date with relevant information about Pantavisor: network interfaces, running revision, hardware and system info, etc. * **User Metadata:** Pantavisor can be configured using some predefined user metadata keys: SSH pub keys or various Pantavisor specific configuration. * **Garbage Collector:** A garbage collector ensures Pantavisor does not take more disk space than the desired one, taking care of old revision specific logs, status JSONs, objects and disks. The garbage collector will always respect the latest DONE revision, as well as the currently running one. The garbage collector can be triggered in different ways: if a threshold of disk usage is surpassed or on command. #### Container Runtime * **Containers:** Each container is comprised of a rootfs that can be run isolated in its own namespace and an LXC configuration file. * **Container Additional Storage:** Containers can use additional storage volumes with permanent, revision or boot options. * **Container Overlay:** Additional files can be attached to a revision to modify the container rootfs. * **Container Drivers:** Containers can request Pantavisor to load drivers (sets of BSP modules) as required, optional or manual. * **Container Logs:** Containers can easily pump logs into the [log server](#log-server). * **Container Groups:** Containers can be grouped to define its startup order and several default configuration values. * **Container Roles:** Containers can have a management role to enable Pantavisor control from its namespace. * **Container Reset:** Containers can be individually reset in case of updates that affect them. Otherwise, a full system reboot can be issued. * **Container Status:** The status of each container is tracked by Pantavisor: INSTALLED, MOUNTED, BLOCKED, STARTED, READY, ALIVE, etc. A status goal can be set to define the point a revision has to reach to be considered stable. * **Status Goal:** If a status goal is not set withing a configurable time, the revision that contains the container fails with ERROR. * **Readiness and Liveness:** A container can send signals to affect its status to signify readiness or liveness. Probes can be configured to do this automatically. * **Policies:** Policies can be configured to be able to set a system action after a condition set by a container. #### External Control * **HTTP Server:** Pantavisor offers a UNIX socket using an HTTP protocol for interaction from containers. * **Get Information from Containers:** Run time information can be consulted: container status, stored metadata, stored trails and objects and build info. Only available to management containers. * **Install Revisions from Containers:** Revisions can be installed from management containers. This includes both state JSON and objects. * **Manage Pantavisor from Containers** Commands to alter Pantavisor behavior can be issued from management containers: run a revision, porweoff/reboot, run garbage collector. * **Manage Metadata from Containers** User and device metadata can be edited from management containers. * **Notify Pantavisor from Containers:** Ready and alive signals can be sent from every container. #### Debug * **SSH:** An SSH server can be enabled from the device for debugging purposes. * **Shell:** A TTY shell can be opened from the device side for debugging purposes. ### Log Server * **All Device Logs in One Place:** Log Server centralizes all container logs, separated by revision, in one place. It offers a couple of UNIX sockets to do so: one to send log traces, the other one to subscribe file descriptors. * **Configurable Log Output:** Log Server can store the logs in different not mutually exclusive formats: file tree, single file, stdout or nullsink. * **Configurable Log Server:** Log Server can be configured with different settings besides the outputs: maximum log size, log level, dmesg capture or container capture. ### PH Client * **Associate Your PH Account With a Device:** The client offers the possibility to claim a new device from a PantacorHub account. * **Device Feedback:** From that moment on, the device continuously probes PH for new revisions. * **Install New Revisions from the Cloud:** When a new revision is found in PH, the client downloads the state JSON and objects into the device. Then, a request to Pantavisor is issued to transition to the new revision. * **Update Feedback:** The update process state (UPDATED, DONE, ERROR, WONTGO, etc.) is sent back to PH. The download of the revision is tried for a configurable number of times before being set as ERROR. * **Metadata Management from the Cloud:** The client also sends and receives device and user metadata to and from PH. * **Log Push to Cloud:** The client sends the stored logs too. ## Current Architecture Pantavisor forks several processes. If we discard the short term ones for things like tsh or start containers, these are the rest: * Main process (mandatory) * LXC container (mandatory) * logserver * ph_logger main service * ph_logger range service ### Main Process #### State Machine The main process state machine has the following states: * INIT: initialization of Pantavisor. * RUN: current revision initialization and end update transitions. * WAIT: the main loop per se. * COMMAND: to process the pv-ctrl commands (run, gc, etc.). * UPDATE: update installation and start update transitions. * UPDATE_APPLY: update installation without transition. * ROLLBACK: perform a revision rollback. * REBOOT: perform a reboot. * POWEROFF: perform a poweroff. * ERROR: goes to reboot. * EXIT: exits the state machine, which should not happen. * FACTORY_UPLOAD: legacy metadata upload. ##### WAIT (Main Loop) The actions executed in the main loop are: 1. Start/Check containers to make sure current revision is running. 2. Claim process in remote mode. 3. Wait for updates in remote mode. 4. Finish updates if TESTING. 5. Metadata management. 6. Check garbage collector. 7. Check debug tools. 8. Process pv-ctrl requests. ###### Main Loop Problems There are a couple of important architectural problems: * Main loop period is very unreliable, as some period might be blocked for a long time. This also means no container evaluation, no watchdog, no pv-ctrl availability, etc. * Main loop is only in the WAIT state, which means no pv-ctrl availability, no container status evaluation, no watchdog, etc. when the state machine is out of WAIT. According to some quick experiments on M2 board, these are some significant times between WAIT executions: * 23 seconds: from initialization till first WAIT with all containers. * 12 seconds: HTTP PUT object of size 17330176. * 9 seconds: HTTP GET object. * 25 seconds: ecu_io and simple_ux starting at the same time. ## Future Architecture ### Solve Main Loop Problems To solve the main loop problems, we propose a solution around these two main points of action: * Simplify the state machine to expand the availability of the main loop operations. * Add optional worker processes to execute time consuming tasks. #### Simplify the State Machine The state machine is too complex and something more simple could be implemented, relying more on the main loop for most of the operations. The new Pantavisor state machine would look like this: * INIT: initialize Pantavisor. * WAIT: main loop. This would include all current main loop actions plus initializing revisions, handle commands and updates. * EXIT: manage poweroff and reboot. Having in mind this state machine simplification, this would be the resulting pv_wait that is called periodically during the WAIT state: ``` pv_wait() { // mount volumes // check current revision containers pv_revision_run() // try to finish updates after update transition/reboot pv_updater_post() // save metadata on disk pv_metadata_refresh(); // check and run garbage collector pv_storage_check_gc(); // check and reset debug tools are running pv_debug_check(); // wait 1 sec or process requests pv_ctrl_wait(); // init, register, claim and sync ph device // upload devmeta, download usrmeta // check for updates // download updates ph_client_run(); // try to start updates after run command pv_updater_pre(); } ``` #### Add Worker Processes to Execute Time Consuming Tasks This requires the implementation of the [Pantavisor Scheduler](https://hackmd.io/L9SXEdxeTI2_KKLhgPusnQ). ``` pv_wait() { // mount volumes // check current revision containers // send startup tasks to scheduler pv_revision_run() // try to finish updates after update transition/reboot pv_updater_post() // save metadata on disk pv_metadata_refresh(); // check and run garbage collector // send gc run tasks to scheduler pv_storage_check_gc(); // check and reset debug tools are running pv_debug_check(); // wait 1 sec or process request // send request tasks to scheduler pv_ctrl_wait(); // process scheduler // check timed out tasks pv_scheduler_run(); // init, register, claim and sync ph device // upload devmeta, download usrmeta // check for updates // download updates // send request tasks to scheduler ph_client_run(); // process tasks if sent to main loop worker pv_worker_run(); // try to start updates after run command pv_updater_pre(); } ``` #### Further Work With these in place, we could gradually convert the main loop into an event loop where tasks are sent to the scheduler, they would trigger events that would get back to the main loop, which would be in charge of updating its data structures accordingly: As mentioned earlier, the ultimate objective is to turn the main loop completely into an event loop. This would require to convert most of our current main loop work into tasks that can be processed by the scheduler. There are libraries that could be used once this work is done, such as [ev](https://github.com/codepr/ev). ### Implementation Plan #### Add Worker Processes to Execute Time Consuming Tasks See [scheduler plan](https://hackmd.io/L9SXEdxeTI2_KKLhgPusnQ?view#Implementation-Plan). #### Simplify the State Machine 1. Refactor Main Loop. In different deliveries: * PH Client: move all requests into a single module with its independent state machine with main function in ph_client_process. * Updater: group updater stuff in pv_wait into pv_updater_post and pv_updater_pre. 2. Refactor State Machine: * Remove FACTORY_UPLOAD. * Unify ROLLBACK, REBOOT, POWEROFF, ERROR and EXIT. * Refactor RUN: into WAIT. * Refactor COMMAND: into WAIT. * Refactor UPDATE and UPDATE_APPLY: into WAIT.