For various reasons, demos, tests, … We may want to run scenarios on hundreds of machines spawning nodes. This is something that cannot be done on the CI nor on a personal computer. Instead, we may want to deploy machines running nodes on the cloud, connect them and run the test on those machines.
This was already done in the past via Kubernetes. However, several issues was raised:
Since more than two years now, the Tezt framework has been used successfuly to write integration tests for Tezos. Tezt contains also many high-level functions that help to write integration tests, which are harder to write in bash, or any script language using Tezos API. Core developers of Tezos being familiar with OCaml, and Tezt tests being written in OCaml, it makes the maintenance of those scenarios easier. Moreover, even if tests are not run everytime, putting the code into the CI allows to ensure the code still compiles. Finally, Tezt
contains a Runner
module that allows to write integration tests where nodes are node run on the local machine, but via an ssh connection.
The issue, is that for the moment, we are not able to interface Tezt with a cloud solution easily. Consequently, the question is: How we should proceed?
In the following, we assume using AWS for our cloud solution.
Tezt-AWS
The following library provides OCaml binding for AWS which could be integrated in a module of Tezt
.
This could be used to provide an OCaml API to write scenarios that would run on AWS.
Pros:
Cons:
Tezt
+ AWS scriptsSimilar solution than the solution 1, but the handling of AWS would be done via entrypoint run through a bash script for example.
Pros:
Cons:
What a test developer needs to use in term of abstraction to write scenarios? In particular, how do we fetch logs?
A solution could be the following:
(* This call waits for [n] machines to be run on AWS with a compiled version of tezos. Likely built by [make build-unreleased] or any other way to build those binaries. The call returns one the setup is working. The [spawn_aws_machine] handles automatically the lifespan of the machines. The call must fail if it is not run with the AWS_TOKEN environment variable provided. For debugging, we may want to have a localhost parameters such that instead of spawning aws machine, it spawns local nodes. *)
let* configuration = spawn_aws_machine ~n:5 ~tezos_git_branch:"master" in
Lwt.iter_s (fun i ->
(* Called the Runner module that runs a tezt executable with the corresponding [test]. Notice that the script could change depending on the id of the machine. The call is also supposed to forward all the logs emitted by the script run on each machine with an [id]. All the daemons/clients log run with this call are not expected to be forwarded. It is the responsibility of the developer to write appropriate reporting. *)
Tezt.run ~test ~id:i ~self:configuration[i] ~network:configuration)
(0 -- n)
We identified multiple use cases for spawning hundreds of nodes (Octez nodes, rollup nodes, etc.). At least one use case requires running complex client commands. We need to wait for some commands to finish before running some other commands on different nodes, so there is a need for synchronization between the various machines.
Tezt seems like a good candidate to solve both problems:
Runner
module.Moreover, by using Tezt to write those scenarios, one can easily test those scenarios locally (with fewer nodes), either by using ?runner:None
to directly call the executables, or by using a ~runner
that connects to localhost
.
One issue is that the Runner
module of Tezt hasn't been used for a long time. One should check that it still works.
One other issue is that it is unknown whether the Runner
module can manage hundreds of SSH connections at the same time in a timely fashion. This should also be tested. If it's too much, one may want to organize the scenario as a hierarchy, where one main Tezt executable remotely spawns multiple other Tezt executables. But one would lose the capability to finely synchronize commands.
Related to the previous issue, the Runner
module currently causes all logs to be sent to the main process. Having the logs of hundreds of executables be sent to the main process would probably be too much. One probably needs to patch Tezt to be able to disable process logs from being sent back to the main process. We would keep a few Log.info
(or just print_endline
actually) that would be sent back so that we can still see an aggregate of (a summary of) the logs in the main process.
Last but not least, Tezt of course does not solve the issue of spawning hundreds of machines in the first place. This is the topic of the next paragraph.
The input that Tezt needs to be able to dispatch commands to remote machines using SSH is the list of remote machines, with their IP addresses (or hostname) and the SSH configuration. The SSH configuration can boil down to a username if we assume that the machine that will run the main Tezt executable will be configured to have the private key of a key pair for which the public key is installed in the .ssh/authorized_keys
file of all other machines.
Because we want the machines to be short-lived (a few hours at most), this input cannot be a fixed list of constants. It has to be generated automatically and given to Tezt. Or, Tezt can be responsible to spawn the machines at the start of its scenario.
If Tezt is responsible to spawn the machines, it means that it needs to call AWS's API, directly or indirectly.
The trade-off is thus between safety and the work required by our devops. A microservice would be much safer, responsibility to prevent abuse would be on DevOps instead of devs, but creating this microservice would be more work.