# Test-harness GPU support work plan ----- ## Objective extend the test-harness to be able to utilize external Nvidia GPUs (retail cards used by miners) and run livepeer in GPU mode. ------ ## Possible Solutions ### Option 1 : Nvidia Docker The idea here is we switch the Docker version used to this [nvidia enabled docker](https://github.com/NVIDIA/nvidia-docker) that allows containers to access the host machine's underlying GPU. #### pros 1. the test-harness works well with docker, the refactor shouldn't be as extensive as the Anisble option. 2. all the pros of running a container like having isolated `env` and not worrying about lib support on the host machines. 3. easier to integrate into existing metrics pipelines like prometheus and loki #### cons 1. miners (target user) don't really use docker and prefer to run on the bare metal. 2. I'm not sure the multi gpu support hacks would work with nvidia docker (haven't tried that) ### Main work tracks 1. refactor the test-harness to separate machine provisioning, network creation and deployment phases 2. add VPN automation support to the test-harness a. create a test-harness specific vpn credentials b. add server to livepeer vpn c. ip / port mapping 3. create nvidia docker installation script 4. generate gpu `env` vars 5. ability to run some parts of the deployment on gcp and only run the transcoder on the GPU rig ### Option 2: Ansible automation Instead of running livepeer node(s) in containers, Ansible can be used to provision the host machine to run livepeer directly. ##### pros 1. no extra layer of abstraction (no docker) 2. ~~more realistic gpu memory performance when testing concurrency since there will be no memory management like in docker.~~ **note** Nvidia Docker does not manage the GPU memory, Thanks to Michael for the correction ##### cons 1. this is a considerable refactor given how deeply integrated docker is with the test-harness. 2. although ansible can maintain state, it won't guarantee that there isn't old artifacts or unused `env` vars, this isn't a problem with docker. #### work breakdown 1. refactor the test-harness to separate machine provisioning, network creation and deployment phases 2. convert deployment steps into ansible cookbook 3. convert provisioning logic from `docker-machine` to ansible 5. refactor internal urls to use IPs instead of docker swarm node name 4. add VPN automation support to the test-harness a. create a test-harness specific vpn credentials b. add server to livepeer vpn c. ip / port mapping 3. generate gpu `env` vars 4. ability to run some parts of the deployment on gcp and only run the transcoder on the GPU rig ----------- ## Concerns / possible issues ------ ### VPN shananigans the way we get access to these machines are usually through VPN, this has 2 side effects 1. only 1 user at a time on the machine. 2. sending a stream from an external source (a gcp streamer for example) isn't possible since the RTMP endpoint won't be exposed to the public internet the workaround for the first point is we use the livepeer vpn to connect to the genesis vpn, this nesting of vpns is going to affect the network performance and add a layer of complexity. Also, the livepeer vpn can be used to wrap any external server we need to be able to access the genesis rigs, but this will require some extra automation to get that setup running seamlessly. ### pre provisioned machines the rigs we have access to come provisioned and usually running a miner, this is good since it means we don't need to go through GPU driver/ lib installation. but unlike a freshly created GCP instance, the setup might not be perfect and will require way more sanity checks to make sure we're able to automatically deploy livepeer and run it successfully in a deterministic way.