E2E Refactoring

# E2E Refactoring ###### tags: `testing`, `refactoring` **Author(s):** Victor Farazdagi (Prysmatic Labs) *Last Updated: Mar 19, 2021* [TOC] ## Overview Currently, our E2E tests are not properly isolated: if we have multiple tests, the previously initialized processes can interfere with successive tests. This issue can be masked by running tests on separate shards (bazel provides both network and file system isolation between shards), however, when running in streamed (single shard) mode, as advertised in [endtoend/README.md](https://github.com/prysmaticlabs/prysm/blob/develop/endtoend/README.md), proper isolation is required. This document outlines refactoring plan that will make sure that E2E code allows better control on life-cycles of components (ETH1, Beacon, Validator, Boot nodes), with proper cleanup between runs. ## Flaws in the current implementation The following issues, that exist in the current implementation, should be fixed if the proposed refactoring is implemented: ### i. Handling process termination The E2E tests are typically composed of specifying configuration values, starting up various interconnected nodes (using `exec.Command()`), capturing certain logs from those nodes. Currently, processes inited by `exec.Command()` are not stopped at the end of the test, so this needs to be handled. :::info **Proposed Solution** In order to run nodes concurrently, go routines are used. To make sure that we can stop multiple running nodes at the end of the test function (also making sure that function waits for nodes to really terminate), we can rely on `exec.CommandContext()` -- this function allows to pass context to executed command, and upon cancellation of that context, send kill signal to the running process. Each node will be blocked at `exec.Wait()`, while running, and once node is terminated or completes its operation, `exec.Wait()` unblocks. Here is sample code that explains this: ```golang func Run(ctx context.Context) error { cmd := exec.CommandContext(ctx, "netstat") if err := cmd.Start(); err != nil { return err } // Started command is blocked here, up until it // is either terminated or completes. return cmd.Wait() } func main() { var wg sync.WaitGroup ctx, cancel := context.WithCancel(context.Background()) wg.Add(1) go func() { // Error should be handled, if not nil. err := Run(ctx) // Signal that the goroutine has completed. wg.Done() }() // At this point (assuming that nodes are ready, // which will be tackled separately), we query nodes, // run evaluators etc (here we just wait for 3 secs). <-time.After(3 * time.Second) log.Println("done, halting all spawned nodes") cancel() // Wait for the child goroutine to finish, which // will only occur when the child process has stopped // and the call to cmd.Wait has returned. // This prevents main() exiting prematurely. wg.Wait() } ``` Even better, we do not need to handle synchronization and error handling manually, and can rely on [https://pkg.go.dev/golang.org/x/sync/errgroup](https://pkg.go.dev/golang.org/x/sync/errgroup). Basically, it provided the very similar means to run sub-processes of a single process, capture errors, if any, and sending cancellation signal to all the sub-processes: ```golang ctx, cancel := context.WithCancel(context.Background()) g, ctx = errgroup.WithContext(ctx) // Start sub-process, using the very same `Run()` // method defined above: g.Go(Run()) // In a separate goroutine we have a ticker running our // evaluators by connecting to various nodes or by // waiting for some sentinel values in their logs etc. g.Go(func() error { // Once we have done all the testing, we can cancel // the parent context. cancel() }) // Wait for all tasks to be complete. // Any of tasks returning an error, will result in // context being closed, and error returned in g.Wait(). if err := g.Wait(); err != nil { t.Fatal(err) } ``` Essentially, it is the very same thing we've accomplished manually, using `sync.WaitGroup`, but more ellegantly (and without overly obusing context cancellation, see [Context isn't for cancellation](https://dave.cheney.net/2017/08/20/context-isnt-for-cancellation)). ::: ### ii. Better signaling of node readiness Each process takes some time to commence (and for different types of nodes different log messages are used as sentinels, that inform us that node is fully functional and is ready to accept connections). So, ideally, we need some way of knowing that a node is ready. :::info **Proposed Solution** First of all, we will need to define `struct` for each type of nodes: currently, we are simply calling `StartEth1Node()`, `StartBeaconNode()` etc. But since we want to keep some internal state (`started bool` flag), we better abstract nodes in a separate data structures. Since running with `errgroup.Run()` method accepts function that returns error, node start routines will have the `func (n *SomeNode) Start(ctx, config) error` signature. When `exec.CommandContext()` is called inside that function, generally, we wait for some sentinel value to be produced in logs: if that value is not found in allocated amount of time, error will be returned, and the whole test is halted (again, property of `errgroup`). If, however, the sentinel value is found, then internal flag will be set to `true`, and on using a `func (n *SomeNode) Started(ctx) bool` blocking method, other goroutines will be able to proceed, knowing that the node is fully started. ::: ### iii. Handling errors Finally, each node can fail, either during startup or when running, so we need to capture errors from several (concurrently running) routines. :::info **Proposed Solution** Currently, we simply call `t.Fatal()` everywhere, leaking our `t` all around the place. With `errgroup`, methods that fail will just return error, context will be cancelled (and all other nodes terminated cleanly), and errors will be available in the main goroutine (which waits on `g.Wait()`). ::: ### iv. General code cleanup Current implementation of `end_to_end.go:RunEndToEndTest()` is a little bit bloated, and not easy to understand. It can be sliced into several different methods, to become more approachable. :::info **Proposed Solution** There are number of things proposed: - Define `ComponentInterface` interface (which will expose `Start()`, `Started()` methods), so that all the services have known and consistent behaviour. - Define `Eth1Node`, `BeaconNode`, `ValidatorNode`, `SlasherNode`, `WhateverNode` structs implementing that interface: all of them will have predictable start/ready methods, all of them will be blocking on `exec.Wait()`, all of them will terminate on parent context cancellation. - Update `RunEndToEndTest()` procedure to use defined services, and, probably, by slicing into manageable pieces, otherwise it is hard to define granular tests (that use only several components). We can keep this update to `RunEndToEndTest()` minimal, to make sure we are not breaking anything. :::