Write integration tests that runs several nodes

For various reasons, demos, tests, … We may want to run scenarios on hundreds of machines spawning nodes. This is something that cannot be done on the CI nor on a personal computer. Instead, we may want to deploy machines running nodes on the cloud, connect them and run the test on those machines.

This was already done in the past via Kubernetes. However, several issues was raised:

Reproducibility was hard to do
Developers are not familiar with the technology so very hard to write/deploy. With DevOps, it was hard to write a scenario and get the observable useful for the developer.
Cannot be maintained easily

Since more than two years now, the Tezt framework has been used successfuly to write integration tests for Tezos. Tezt contains also many high-level functions that help to write integration tests, which are harder to write in bash, or any script language using Tezos API. Core developers of Tezos being familiar with OCaml, and Tezt tests being written in OCaml, it makes the maintenance of those scenarios easier. Moreover, even if tests are not run everytime, putting the code into the CI allows to ensure the code still compiles. Finally, Tezt contains a Runner module that allows to write integration tests where nodes are node run on the local machine, but via an ssh connection.

The issue, is that for the moment, we are not able to interface Tezt with a cloud solution easily. Consequently, the question is: How we should proceed?

In the following, we assume using AWS for our cloud solution.

Solution 1 : `Tezt-AWS`

The following library provides OCaml binding for AWS which could be integrated in a module of Tezt.

This could be used to provide an OCaml API to write scenarios that would run on AWS.

Pros:

Everything is in OCaml so easier to debug
Separation of concern: We need to see what should be a good API, but in term of abstraction the inner part of the API should not be the concern of the develoeprs writing tests

Cons:

Who should maintain the API? Who has the knowledge?
How can we prevent badly written tests to not do anything?

Solution 2 : `Tezt` + AWS scripts

Similar solution than the solution 1, but the handling of AWS would be done via entrypoint run through a bash script for example.

Pros:

Same as above except the AWS part is not in OCaml
The maintenance could be done by DevOps?
Easier to prevent from badly written tests

Cons:

Can be harder to maintain those scripts?

API

What a test developer needs to use in term of abstraction to write scenarios? In particular, how do we fetch logs?

A solution could be the following:








(* This call waits for [n] machines to be run on AWS with a compiled version of tezos. Likely built by [make build-unreleased] or any other way to build those binaries. The call returns one the setup is working. The [spawn_aws_machine] handles automatically the lifespan of the machines. The call must fail if it is not run with the AWS_TOKEN environment variable provided. For debugging, we may want to have a localhost parameters such that instead of spawning aws machine, it spawns local nodes. *)
let* configuration = spawn_aws_machine ~n:5 ~tezos_git_branch:"master" in
Lwt.iter_s (fun i -> 
    (* Called the Runner module that runs a tezt executable with the corresponding [test]. Notice that the script could change depending on the id of the machine. The call is also supposed to forward all the logs emitted by the script run on each machine with an [id]. All the daemons/clients log run with this call are not expected to be forwarded. It is the responsibility of the developer to write appropriate reporting. *)
    Tezt.run ~test ~id:i ~self:configuration[i] ~network:configuration)
    (0 -- n)

Romain's Version of This Document (same idea, different words)

We identified multiple use cases for spawning hundreds of nodes (Octez nodes, rollup nodes, etc.). At least one use case requires running complex client commands. We need to wait for some commands to finish before running some other commands on different nodes, so there is a need for synchronization between the various machines.

Tezt

Tezt seems like a good candidate to solve both problems:

complex client commands: it provides many helpers to do that;
synchronization: a main Tezt process can be used to run commands through SSH using the Runner module.

Moreover, by using Tezt to write those scenarios, one can easily test those scenarios locally (with fewer nodes), either by using ?runner:None to directly call the executables, or by using a ~runner that connects to localhost.

Issues

One issue is that the Runner module of Tezt hasn't been used for a long time. One should check that it still works.

One other issue is that it is unknown whether the Runner module can manage hundreds of SSH connections at the same time in a timely fashion. This should also be tested. If it's too much, one may want to organize the scenario as a hierarchy, where one main Tezt executable remotely spawns multiple other Tezt executables. But one would lose the capability to finely synchronize commands.

Related to the previous issue, the Runner module currently causes all logs to be sent to the main process. Having the logs of hundreds of executables be sent to the main process would probably be too much. One probably needs to patch Tezt to be able to disable process logs from being sent back to the main process. We would keep a few Log.info (or just print_endline actually) that would be sent back so that we can still see an aggregate of (a summary of) the logs in the main process.

Last but not least, Tezt of course does not solve the issue of spawning hundreds of machines in the first place. This is the topic of the next paragraph.

Spawning the VMs

The input that Tezt needs to be able to dispatch commands to remote machines using SSH is the list of remote machines, with their IP addresses (or hostname) and the SSH configuration. The SSH configuration can boil down to a username if we assume that the machine that will run the main Tezt executable will be configured to have the private key of a key pair for which the public key is installed in the .ssh/authorized_keys file of all other machines.

Because we want the machines to be short-lived (a few hours at most), this input cannot be a fixed list of constants. It has to be generated automatically and given to Tezt. Or, Tezt can be responsible to spawn the machines at the start of its scenario.

If Tezt is responsible to spawn the machines, it means that it needs to call AWS's API, directly or indirectly.

Directly: Tezt could use the AWS API, or could use AWS's command-line tool to call AWS's API. It means that the machine that runs Tezt needs to be configured to have AWS access. Which in turns makes it possible for this machine to spawn hundreds of machines without making sure that those machines are not too expensive and that they will be killed after the couple of hours during which we need them. So it sounds like a risky idea.
Indirectly: one could provide a microservice, managed by our DevOps, which would provide an endpoint that Tezt could call to request some machines. The microservice would guarantee some properties such as: not being able to spawn too many machines, making sure that those machines are not too expensive, making sure that those machines are destroyed after a couple of hours.

The trade-off is thus between safety and the work required by our devops. A microservice would be much safer, responsibility to prevent abuse would be on DevOps instead of devs, but creating this microservice would be more work.

Write integration tests that runs several nodes

Solution 1 : Tezt-AWS

Solution 2 : Tezt + AWS scripts

API

Romain's Version of This Document (same idea, different words)

Tezt

Issues

Spawning the VMs

Read more

DAL cheatsheet

State of the art of Data Availability

How to run a DAL node

Don't fear the shrinking

Solution 1 : `Tezt-AWS`

Solution 2 : `Tezt` + AWS scripts