# Infra testing tool [ToC] ## Summary The goal would be to have a simple config file in which we can define which peers we want to instanciate, their specifities, the test they will run and a simple way to launch a test run. Also we need a way to centralize all the logs from these nodes so that we can process them, the processing will be done by the Berty team. No need to develop a graphical interface to select the different parameters, the tool will only be for devs and nothing better for us than text files that we can easily copy, modify, version, etc... Basically, the needs would be to have the possibility to: - define the type of nodes to instanciate: peer / Berty user, relay, bootstrap, replication and rdvp - define their connection: which peer they are connected to, "chaos" parameter of the connection instability (random disconnection, unstable bandwidth, etc...), type of transport used (QUIC, TCP, UDP, Websocket, etc...) - define their uptime: cycling on something like: 1 min up, 2 min down, etc... Or up during all the test, or "chaos" random uptime. - define the groups (in the Berty protocol sense of the term) of which these peers are members - a system to pass custom flags to the node (e.g. by default your tool runs the following command `berty daemon -flag1 -flag2`, and in addition, we can manually pass `-flag3` to a specific node through the config file) - define their test case: send one text message each second, two media message of X MB each two minutes, etc... Knowing that for any interaction between your test tool and the nodes (joining a Berty group, choosing a transport type, sending messages, etc...) you will work totally in collaboration with the team: if you have any need, either the solution already exists and we explain you how to do it, or we implement a new API call, a flag parameter or anything else. ## Logs and metrics For the logs, you will simply have to send all the logs that the binary will send you somewhere classified by node ID, test session ID, etc... (putting everything in a folder on some storage would be enough at first), something like that: ``` Logs ├── session_1337 | ├── node_1748.log | ├── node_7583.log │ └── node_7832.log ├── session_4242 | ├── node_3284.log | ├── node_4859.log │ └── node_8493.log (...) ``` You should also add your "chaos logs" (bandwidth limitation to X Kbps, node shutdown for X seconds, etc.) either: - by enriching the logs of each node with this information - or in separate files For metrics, it's not essential nor a priority, but it would be a good bonus to be able to track for each node their usage in terms of CPU, RAM, network, etc... ## Config file specs I will describe our needs below, but don't hesitate to give us your opinion if you think we can improve this proposal with something simpler, more flexible and/or more efficient. The config file should look something like this: ```yaml= Good_Relay: type: relay amount: 1 connections: - to: internet transport: tcp bandwidth: 100Mbps reliability: 0,0 Bad_Relays: type: relay amount: 3 connections: - to: internet transport: tcp bandwidth: 1-10Mbps reliability: 5,50 Bootstrap: type: bootstrap amount: 1 connections: - to: internet transport: tcp bandwidth: 100Mbps reliability: 0,0 Rendezvous: type: rdvp amount: 1 connections: - to: internet transport: quic bandwidth: 100Mpbs reliability: 0,0 Replication_Servers: type: replication amount: 3 connections: - to: internet transport: tcp bandwidth: 100Mbps reliability: 0,0 groups: - name: group1 - name: group2 routers: - type: rdvp address: Rendezvous - type: bootstrap address: Bootstrap LAN_Peers: type: peer amount: 20 flag: -p2p.mdns connections: - to: lan_1 transport: tcp bandwidth: 100Mbps reliability: 0,0 groups: - name: group_1 tests: - type: text size: 10KB every: 15 - type: media size: 2MB every: 120 - name: group_2 tests: - type: text - size: 10KB-60KB - every: 1-30 Bridge_LAN_Internet: type: peer amount: 1 connections: - to: lan_1 transport: tcp bandwidth: 100Mbps reliability: 0,0 - to: internet transport: quic bandwidth: 10Mbps reliability: 0,0 groups: - name: group_1 - name: group_2 routers: - type: rdvp address: Rendezvous - type: rdvp address: '/ip4/51.159.21.214/udp/4040/quic/p2p/QmdT7AmhhnbuwvCpa5PH1ySK9HJVB82jr3fo1bxMxBPW6p' Good_Cellular_Peers: type: peer amount: 20 connections: - to: internet_no_inbound transport: tcp bandwidth: 1Mbps-10Mbps reliability: 20,200 groups: - name: group1 tests: - type: text size: 10Kb every: 30 routers: - type: relay address: Good_Relays - type: rdvp address: Rendezvous Bad_Cellular_Peers: type: peer amount: 20 connections: - to: internet_no_inbound transport: tcp bandwidth: 5Kbps-200Kbps reliability: 20,3 groups: - name: group_1 tests: - type: text size: 10KB every: 30 routers: - type: relay address: Bad_Relays - type: relay address: '/ip4/51.159.21.214/udp/4040/quic/p2p/QmdT7AmhhnbuwvCpa5PH1ySK9HJVB82jr3fo1bxMxBPW6p' - type: rdvp address: Rendezvous - type: bootstrap address: Bootstrap ``` Each block is a set / a group of peers with an arbitrary name: `Replication_Servers`, `Bad_Cellular_Peers`, etc... ### type * Mandatory * Can be one of: * `peer` a standard Berty user * `replication` a server that will join a group and provide high availability to other members, can't decrypt messages * `relay` a server providing NAT traversal to other peers using a TURN-like protocol * `rdvp` a DNS-like server used to register and retrieve peers related to a namespace (kind of key-value store) * `bootstrap` an entry point to the Berty network, will be used to exchange peers during init ### amount * Mandatory * Must be a positive Int defining the amount to instanciate ### flag * Optionnal * An arbitrary string that will be passed as a command line flag when launching Berty program ### connections * Mandatory * A list of connections in term of network reachability. PeerA connected to PeerB == PeerA can ping PeerB. No need to open any kind of socket, libp2p will handle this part. #### to (connections) * Mandatory * Can be one of: * `internet` connected to and reachable from internet * `internet_no_inbound` connected to internet but not reachable from it / no inbound connection allowed * an arbitrary name, e.g: `lan1`. All sets of peers connected to the same network name will be connected to each other #### transport (connections) * Mandatory * Define the protocol on which libp2p will listen for a given connection. Can be one of: * `tcp` * `udp` * `quic` * `websocket` * `p2p-circuit` #### bandwidth (connections) * Mandatory * Can be: * a fixed unit: `10Kpbs`, `1Mpbs`, etc... * a range in which the chaos script will make oscillate the bandwidth: `10Kbps-10Mbps` #### reliability (connections) * Mandatory * A pair of int defining a preiod of time in seconds and a probability to become unreachable, e.g: `120,2` == every 120 seconds 1 chance in 2 of becoming unreachable. * if set to `0,0` the connection will never drop * if a node is already unreachable, the same rule apply, e.g: `30,4` == every 30 seconds, 1 chance in 4 of remaining unreachable so 3 chance in 4 of becoming reachable ### groups * Optionnal (should fail if set for relay, bootstrap and rdvp) * A list of Berty groups in which the node is part of. #### name (groups) * Mandatory * An arbitrary name for the group to join as a member. #### tests (groups) * Optionnal * A test suite (send messages of given size, every X seconds, etc...) to run on this group. ##### type (groups->tests) * Mandatory * Can be one of: * `text` random text message * `media` random media attachment ##### size (groups->tests) * Mandatory * A size unit of the message to send, can be: * a fixed unit: `10KB`, `1MB`, etc… * a range in which the chaos script will make oscillate the size: `10KB-200KB` ##### every (groups->tests) * Mandatory * An Int defining the interval in seconds for sending a message, e.g: `1`, `15`, etc… ### routers * Optionnal (should fail if set for relay, bootstrap and rdvp) * A list of router (relay, bootstrap and rdvp) to specify in node config. #### type (routers) * Mandatory * Can be one of: * `relay`, will be set as relay in node config * `rdvp`, will be set as rdvp in node config * `bootstrap`, will be set as boostrap in node config #### address (routers) * Mandatory * Can be: * the name of a set of peer (e.g in config file example above: `Bootstrap`, `Rendezvous`) * a valid multiadress (you can easily test this by using the [NewMultiaddr method](https://pkg.go.dev/github.com/multiformats/go-multiaddr#NewMultiaddr)) ## Workflow Same as above, this is just to give you an idea of what we have in mind, but you can make proposals to adapt it to something more relevant. ```bash > ssh infra_test@svc.berty.io Welcome to Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-48-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage Last login: Thu Apr 15 07:35:35 2021 from 80.127.83.12 > ls logs session_1337 session_4242 > head -10 tests/test1.yaml Bridge_LAN_Internet: type: peer connections: - to: lan_1 transport: tcp bandwidth: 100Mbps reliability: 0,0 - to: internet transport: quic bandwidth: 10Mbps > ./infra_testing_run tests/test1.yaml error: you must specify a duration usage: ./infra_testing_run <config_file> <duration> > ./infra_testing_run tests/test1.yaml 3600 error: Bridge_LAN_Internet set is missing 'amount' attribute > sed -i '3 i amount: 10' tests/test1.yaml > head -10 tests/test1.yaml Bridge_LAN_Internet: type: peer amount: 10 # Added using sed command above connections: - to: lan_1 transport: tcp bandwidth: 100Mbps reliability: 0,0 - to: internet transport: quic > ./infra_testing_run tests/test1.yaml 3600 session_1234 started: - instanciating sets of nodes: done - setupping the network: done - joining Berty groups: done - running test cases: done - collecting logs: done - shutting down everything: done > ls logs session_1234 session_1337 session_4242 > tree logs/session_1234 session_1234 ├── config_used # Copy of tests/test1.yaml ├── node_1748.log ├── node_7583.log └── node_7832.log ``` ## Ideas v2 1. Add new features in config file: - Replication servers are added by peers instead of joining groups by themselves - Peers can send/auto-accept contact request - Tests can be launched on MultiMember groups and Contact groups 2. Add other limitations / chaos properties to connections, but that would require to use an os level tool (not using AWS infra management tool), like: - ping / connection delay - % of packets dropped