HackMD - Collaborative Markdown Knowledge Base

# [How to improve integration tests (BFT-363)](https://linear.app/matterlabs/issue/BFT-363/research-how-to-improve-integration-tests) Current integration tests are a type of in-process p2p tests. This approach is lightweight and designed to simulate a network while asserting basic test cases. However, it has limitations: * it cannot simulate real-world network conditions, including adversarial ones * it cannot scale horizontally * it cannot allow the ability to manually add/remove network participants (including malicious ones), over time ## Enter System tests System tests are suitable for further leveling up the abstraction-level under test. In this case, the entire node software, black-boxed, in an isolated process, container, and perhaps a machine. This is not the natural mode for running [era-consensus](https://github.com/matter-labs/era-consensus), which is intended to be used as a lib by [zksync-era](https://github.com/matter-labs/zksync-era/), and not run stand-alone. However, it is not uncommon to add test-only layers and execution mode in order to test consensus components, and since the networking stack is to be managed by [era-consensus](https://github.com/matter-labs/era-consensus) alone, it is actually the best way to test it. ## Test environment A scalable yet flexible approach would be to combine local environment (dev machine, CI runners) and running on top of K8s clusters (which can also be run locally, via [minikube](https://minikube.sigs.k8s.io/docs/start/)). Orchestrating nodes without using docker container or without K8s clusters will miss out on many out-of-the-box features (including [Chaos Mesh](https://chaos-mesh.org/)). Seamless cloud deployment is critical because it cannot be assumed that any dev machine would be able to run these tests, especially if they will include time-sensitive assertions, or non-trivial scaling. ## Test framework ### Build Build docker image from source. ### Deploy Use the docker image to deploy network on top of K8s cluster. Each pod within the cluster will be accessible via K8s API, including the node's logs. This module can be used to run networks continiously for longevity testing, without automated tests attached. ### Monitor Use Loki and Grafana for logs aggregation and dashboard. This should be used for mainly for manual inspection and debug anslysis. ### Test The automated tests should be implemented as clients of an ephemeral, short-lived network deployment. Commands to nodes, if needed, and queries, to derive test assertions, should be facilitated through test-only RPC API layer added to the node (or a different IPC solution). Test assertions should not rely on logs. It is an easily solution which doesn't age well. Test assertions should minimize reliance on timeouts, which leads to flakiness. ### Byzantine The easiest and most effective way to simulate adversarial conditions is on the network level, via [Chaos Mesh](https://chaos-mesh.org/). However, it may be needed to introduce non-cannonical, malicious actors implementation for some cases. While it is easy to fork the codebase, the main challenge is to keep it updated and compatible over time. Few options for that: * Manually fork, spawn and join the network (on longevity tests only). * Integrate the behavior to the cannonical implementation (as test-only), activated via config or RPC command. * Fork/branch the cannonical implementation, and maintain it updated over time. The test framework will include it as byzantine in the network deployment. * For small changes, maintain a git patch, and apply it to construct the malicious node. The test framework will include it as byzantine in the network deployment.