owned this note
owned this note
Published
Linked with GitHub
# Eth2 Clients Experiment Summary
## Executions and test performed
### Versions
[Prysm 2.0.6](https://github.com/prysmaticlabs/prysm/releases/tag/v2.0.6)
[Lighthouse 2.1.4](https://github.com/sigp/lighthouse/releases/tag/v2.1.4)
[Teku 22.3.2](https://github.com/ConsenSys/teku/releases/tag/22.3.2)
[Nimbus 1.6.0](https://github.com/status-im/nimbus-eth2/commit/2b0957f32a115eb9dee7fca9d1aeb6703ceae4d0)
[Lodestar 0.34.0](https://github.com/ChainSafe/lodestar/releases/tag/v0.34.0)
[Grandine 0.2.0 (several beta versions)](https://github.com/sifraitech/grandine)
### Configurations
All machines were monitored using Prometheus Node Exporter and a custom python script.
Àll of them were connected to an already synced geth running in a separate machine. This geth was the same for all the clients and experiments.
The only exception is the kiln experiment, as the kiln guide describes the need to deploy a geth client in the same machine.
https://notes.ethereum.org/@launchpad/kiln
To deploy geth in the Kiln machines, we have used the following command:
```
./go-ethereum/build/bin/geth --datadir geth-datadir --http --http.api='engine,eth,web3,net,debug' --http.corsdomain '*' --networkid=1337802 --syncmode=full --authrpc.jwtsecret=/tmp/jwtsecret --bootnodes enode://c354db99124f0faf677ff0e75c3cbbd568b2febc186af664e0c51ac435609badedc67a18a63adb64dacc1780a28dcefebfc29b83fd1a3f4aa3c0eb161364cf94@164.92.130.5:30303 --override.terminaltotaldifficulty 20000000000000
```
During our experiment, the Kiln network suffered an incident where many miners entered the network, making the merge happen before. To avoid this, Kiln nodes were requested to override the total difficulty so as the merge could happen in the scheduled time.
Therefore, in some cases we had to add an additional flag, but in others just updating the config file was enough.
#### Prysm
##### Default sync (used for standard machine, fat node and raspberry PI)
config.yaml:
```
monitoring-host: 0.0.0.0
http-web3provider: http://XX.XX.XXX.XXX:8545/
slots-per-archive-point: 2048
```
##### All-topics
We have added ```subscribe-all-subnets: true``` to the configuration file.
##### Archival mode
We have changed the ```slots-per-archive-point``` parameter to 64.
##### Kiln
Following the Kiln guide, our configuration was the following:
```
bazel run //beacon-chain -- \
--genesis-state $PWD/../genesis.ssz \
--datadir $PWD/../datadir-prysm \
--http-web3provider=/home/crawler/kiln/merge-testnets/kiln/geth-datadir/geth.ipc \
--execution-provider=/home/crawler/kiln/merge-testnets/kiln/geth-datadir/geth.ipc \
--chain-config-file=$PWD/../config.yaml \
--bootstrap-node=enr:-Iq4QMCTfIMXnow27baRUb35Q8iiFHSIDBJh6hQM5Axohhf4b6Kr_cOCu0htQ5WvVqKvFgY28893DHAg8gnBAXsAVqmGAX53x8JggmlkgnY0gmlwhLKAlv6Jc2VjcDI1NmsxoQK6S-Cii_KmfFdUJL2TANL3ksaKUnNXvTCv1tLwXs0QgIN1ZHCCIyk \
--jwt-secret=/tmp/jwtsecret \
--monitoring-host 0.0.0.0"
```
#### Lighthouse
##### Default sync (used for standard machine, fat node and raspberry PI)
```
lighthouse bn --http --metrics --metrics-address 0.0.0.0 --eth1-endpoints http://XX.XX.XXX.XXX:8545/ --slots-per-restore-point 2048 --datadir /mnt/diskChain/.lighthouse/mainnet
```
##### All-topics
We have added the parameter ```--subscribe-all-subnets``` to the execution command.
##### Archival mode
We have modified the ```-slots-per-restore-point``` to 64.
##### Kiln
Following the kiln guide, the execution command is as follows:
```
lighthouse \
--spec mainnet \
--network kiln \
--debug-level info \
beacon_node \
--datadir ./testnet-lh1 \
--eth1 \
--http \
--http-allow-sync-stalled \
--metrics --metrics-address 0.0.0.0 \
--merge \
--execution-endpoints http://127.0.0.1:8551 \
--enr-udp-port=9000 \
--enr-tcp-port=9000 \
--discovery-port=9000 \
--jwt-secrets=/tmp/jwtsecret
```
#### Teku
After speaking to the developer teams we were suggested to configure the JVM memory allocation.
This was done using the following command:
```
export JAVA_OPTS="-Xmx5g -Xms5g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$CLIENT_BASE_DIR/heap_data"
```
##### Default sync (used for standard machine, fat node and raspberry PI)
config.yaml:
```
network: "mainnet"
eth1-endpoint: ["http://51.79.142.201:8545/"]
metrics-enabled: true
rest-api-docs-enabled: true
metrics-port: 8007
p2p-port: 9001
data-storage-archive-frequency: 2048
metrics-interface: "0.0.0.0"
metrics-host-allowlist: ["*"]
rest-api-enabled: true
rest-api-host-allowlist: ["*"]
rest-api-interface: "0.0.0.0"
rest-api-port: 5051
```
##### All-topics
We have added the option ```p2p-subscribe-all-subnets-enabled```.
##### Archival mode
We have removed the ```data-storage-archive-frequency``` parameter from the configuration file.
We have added the option ```data-storage-mode: "archive"``` in the configuration file.
##### Kiln
Following the kiln guide, the execution command is:
```
./teku/build/install/teku/bin/teku \
--data-path datadir-teku \
--network config.yaml \
--p2p-discovery-bootnodes enr:-Iq4QMCTfIMXnow27baRUb35Q8iiFHSIDBJh6hQM5Axohhf4b6Kr_cOCu0htQ5WvVqKvFgY28893DHAg8gnBAXsAVqmGAX53x8JggmlkgnY0gmlwhLKAlv6Jc2VjcDI1NmsxoQK6S-Cii_KmfFdUJL2TANL3ksaKUnNXvTCv1tLwXs0QgIN1ZHCCIyk \
--ee-endpoint http://localhost:8551 \
--Xee-version kilnv2 \
--rest-api-enabled true --metrics-enabled=true --metrics-host-allowlist=* --metrics-interface=0.0.0.0 \
--validators-proposer-default-fee-recipient=0x2Ad2f1999A99F6Af12D4634e2C88a0891c3013e8 \
--ee-jwt-secret-file /tmp/jwtsecret \
--log-destination console
```
#### Nimbus
##### Default sync (used for standard machine, fat node and raspberry PI)
The execution command is:
```
run-mainnet-beacon-node.sh --web3-url="http://XX.XX.XXX.XXX:8545/" --metrics-address=0.0.0.0 --metrics --tcp-port=9002 --udp-port=9003 --num-threads=4 --data-dir=/home/crawler/.nimbus-db/
```
##### All-topics
We have added the parameter ```--subscribe-all-subnets``` to the execution command.
##### Archival mode
There is no parameter to adjust the number of slots per state to store.
##### Kiln
Following the Kiln guide, the execution command is as follows:
```
nimbus-eth2/build/nimbus_beacon_node \
--network=./ \
--web3-url=ws://127.0.0.1:8551 \
--rest --validator-monitor-auto \
--metrics --metrics-address=0.0.0.0 --data-dir=./nimbus-db \
--log-level=INFO \
--jwt-secret=/tmp/jwtsecret
```
#### Lodestar
##### Default sync (used for standard machine, fat node and raspberry PI)
```
sudo docker run -p 9596:9596 -p 8006:8006 -p 9005:9005 -v /mnt/diskBlock/lodestar:/root/.local/share/lodestar/ chainsafe/lodestar:v0.34.0 beacon --network mainnet --metrics.enabled --metrics.serverPort=8006 --network.localMultiaddrs="/ip4/0.0.0.0/tcp/9005" --network.connectToDiscv5Bootnodes true --logLevel="info" --eth1.providerUrls="http://XX.XX.XXX.XXX:8545/" --api.rest.host 0.0.0.0
```
##### All-topics
We have added the parameter ```--network.subscribeAllSubnets true``` to the execution command.
##### Archival mode
There is no documented archival mode for Lodestar.
##### Kiln
Following the Kiln guide, the execution command is as follows:
```
./lodestar beacon --rootDir=../lodestar-beacondata --paramsFile=../config.yaml --genesisStateFile=../genesis.ssz --eth1.enabled=true --execution.urls=http://127.0.0.1:8551 --network.connectToDiscv5Bootnodes --network.discv5.enabled=true --jwt-secret=/tmp/jwtsecret --network.discv5.bootEnrs=enr:-Iq4QMCTfIMXnow27baRUb35Q8iiFHSIDBJh6hQM5Axohhf4b6Kr_cOCu0htQ5WvVqKvFgY28893DHAg8gnBAXsAVqmGAX53x8JggmlkgnY0gmlwhLKAlv6Jc2VjcDI1NmsxoQK6S-Cii_KmfFdUJL2TANL3ksaKUnNXvTCv1tLwXs0QgIN1ZHCCIyk --metrics.enabled --metrics.serverPort=8006
```
#### Grandine
During our grandine experiments we were provided with several executables, each of them implementing a different functionality.
Even though all of them belong to the same version 0.2.0, several executables were used (different beta versions).
##### Default sync (used for standard machine, fat node and raspberry PI)
The execution command is as follows:
```
grandine-0.2.0 --metrics --archival-epoch-interval 64 --eth1-rpc-urls http://XX.XX.XXX.XXX:8545/ --http-address 0.0.0.0 --network mainnet
```
##### All-topics
We have added
```--subscribe-all-subnets``` parameter to the execution command.
##### Archival mode
We have modified the parameter ```--archival-epoch-interval 64``` to 2.
##### Kiln
The execution command:
```
sudo docker run --name grandine_container -v /home/crawler/.grandine:/root/.grandine -v /tmp/jwtsecret:/tmp/jwtsecret --network=host sifrai/grandine:latest grandine --eth1-rpc-urls http://localhost:8551/ --network kiln --jwt-secret=/tmp/jwtsecret --keystore-dir /root/.grandine/keys --keystore-password-file /root/.grandine/secrets
```
### Tests performed
We have executed each client in sync mode (standard machine, fat machine and raspberry PI), all-topics mode and in Kiln network.
During all these experiments the goal was to measure the performance and hardware resource consumption in each mode and machine.
We have executed some clients in archival mode: in this case we have not measured the hardware consumption but the goal was to perform an API benchmark test, in order to check the resilience and speed of the client when receiving different number of queries to the Beacon API.
### Issues in the tests performed
#### Default sync
The first test performed was syncing all clients except Grandine (we did not have the executable yet). During this process we investigated the best way to configure each of them, even asking the developer teams how to do so.
During this process, we encountered several issues:
##### Prysm
We did not encounter any issues when executing the client.
##### Lighthouse
The Lighthouse database using the above configuration takes around 100 - 110 GB. However, the disk we were using had a space of 90 GB, so the client filled the disk. As soon as we noticed this, we created a new disk and move the db to it, so no resyncing was needed.
We also encountered a memory problem, where the client ran out of memory and the OS would kill the process.
##### Teku
As mentioned, the only issue we encountered the first time we executed the client was that the memory consumption would rise until the OS killed the process. After configuring the JVM, the client worked fine.
##### Nimbus
While compiling Nimbus we noticed the process was taking very long. After speaking to the developer team, we were suggested to add the flag **-j4** to the **make** command, which enables multithreading while compiling. This improved the compiling time to around 9 minutes. After this was sorted out, the client ran smoothly.
##### Lodestar
During the installation of Lodestar, we followed the official guide: https://chainsafe.github.io/lodestar/installation/
However, we were unable to successfully install the client. The issue we found out was that nodeJS version had to be greater or equal to 16.0.0, however the guide specified greater or equal to 12.0.0. The version we were using was 14.8.3 and after upgrading to 16.0.0 it worked.
This has already been updated in the current documentation.
After executing the default mode, we realized the client did not find any peers, so we asked the developer team and they suggested to use the docker execution, which seemed to be more stable and easy to use. Switching to docker worked fine and the client started syncing.
However, the database space was more than 80GB at the time of the experiment and the disk ran out of space.
We created a new disk, moved the db and continued syncing. However, the databse seemed corrupted, we were getting the following error:
```
Error: string encoded ENR must start with 'enr:'
at Function.decodeTxt (/usr/app/node_modules/@chainsafe/discv5/lib/enr/enr.js:92:19)
at readEnr (/usr/app/node_modules/@chainsafe/lodestar-cli/src/config/enr.ts:28:14)
at persistOptionsAndConfig (/usr/app/node_modules/@chainsafe/lodestar-cli/src/cmds/init/handler.ts:86:24)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at Object.beaconHandler [as handler] (/usr/app/node_modules/@chainsafe/lodestar-cli/src/cmds/beacon/handler.ts:26:3)
```
We asked Lodestar team for support: the ENR was corrupted, we have renamed the enr file to enr_corrupted and the client now works, a new ENR was generated.
We also found a bug in the Beacon API, related to the number of peers. When querying Lodestar API, the number of peers returned would always be 0. This bug was reported and an issue was opened in Github.
#### Grandine
During the syncing process we realized the client would sometimes use a lot of memory and eventually get killed by the OS. After speaking to the developer team, we were provided with a new executable which did not crash anymore.
Apart from this, Grandine does expose prometheus metrics and an API, but it is unstable and we were not able to use it in every single test, as when querying the endpoint it sometimes killed the client. The API exposes data about the slot but we were not able to retrieve the number of peers from it.
### All-topics
After syncing the clients we have stopped them and added the necessary parameters to activate the all-topics mode.
During this process we have not encountered any major issues, apart from verifying that the client was in fact in the all-topics mode.
For some clients this is shown in the terminal output. For some others, we could check the Prometheus metrics to verify this.
### Archival mode
#### Prysm
During the execution of Prysm in archival mode we faced a long and slow synchronization. This process took more than 3 weeks and it was also very irregular, as the exposed metrics to Prometheus sometimes were not available.
Once the client was synced, we were also unable to perform the API benchmark properly, as the client would stop responding after several queries.
#### Lighthouse
We have not encountered any issues to sync the client in archive mode using the above configuration, other than the disk space required to store the database, which came out to be more than 1TB.
#### Teku
We experienced a similar behaviour in Teku, where the sync process took longer than 4 weeks. In this case, the metrics worked well.
After the client was synced we could perform the API benchmark test and obtain a response for each query.
#### Nimbus
As per the developer team suggestion we have used the same default mode we used to sync the client, so we have not encountered any other issues to execute the client in this mode.
#### Lodestar
There is no archival mode available as per the developer team's answer and, therefore, we have not executed the APÎ benchmark test.
#### Grandine
Grandine did not implement the standard Beacon API and, therefore, we were not able to perform the API benchmark test.
### Raspberry PI
#### Lodestar
We were not able to execute any docker image, as they are amd64 based so we had to recompile the project for the raspberry PI. Again we did find some issues while installing nodeJS and upgrading to the correct version as well as compiling Lodestar, which sometimes compiled but would not execute because of a missing dependency (probably due to not updated).
### Kiln
#### Prysm
While installing and running Prysm in the Kiln network we were not able to connect Prysm to the installed Geth in the same machine by following the guide. After speaking to the Prysm team, we were suggested to not use the JWT to connect to Geth, but the IPC file instead, which was not specified in the guide. After this fix, the client worked well.
#### Lodestar
We faced the following error:
```Error: Invalid length index
at CompositeListType.tree_setProperty (/home/crawler/kiln/merge-testnets/kiln/lodestar/node_modules/@chainsafe/ssz/src/types/composite/list.ts:494:13)
at CompositeListTreeValue.setProperty (/home/crawler/kiln/merge-testnets/kiln/lodestar/node_modules/@chainsafe/ssz/src/backings/tree/treeValue.ts:294:22)
at Object.set (/home/crawler/kiln/merge-testnets/kiln/lodestar/node_modules/@chainsafe/ssz/src/backings/tree/treeValue.ts:92:19)
at DepositDataRootRepository.batchPut (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/db/repositories/depositDataRoot.ts:34:27)
at DepositDataRootRepository.batchPutValues (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/db/repositories/depositDataRoot.ts:43:5)
at Eth1DepositsCache.add (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositsCache.ts:104:5)
at Eth1DepositDataTracker.updateDepositCache (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositDataTracker.ts:178:5)
at Eth1DepositDataTracker.update (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositDataTracker.ts:159:33)
at Eth1DepositDataTracker.runAutoUpdate (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositDataTracker.ts:133:29)
Mar-31 15:20:58.651[ETH1] error: Error updating eth1 chain cache Invalid length index
```
The client worked fine, produced blocks, attested and performed accordingly, but the above warning kept appearing. After speaking to the Lodestar team, there was a missing parameter: ```--network kiln```
### Raspberry PI checkpoint sync
As the Raspberry PI synchronization was slow, we tried using the checkpoint sync functionality to monitor each of the clients once synced in the raspberry PI.
#### Prysm
We tried using the remote checkpoint sync, which consists on connecting to a remote synced node and obtaining the last finalized state, then syncing from there. However, we were constantly having issues to do this. After speaking with the Prysm team, there was a bug in how the remote client version was parsed and, therefore, the checkpoint sync failed.
In the end, we had to manually download the last finalzed checkpoint from an already synced Prysm and load it locally in the raspberry PI.
#### Lighthouse
In this case we just needed to add the parameter ```--checkpoint-sync-url http://XX.XX.XXX.XXX:5052``` and the client will continue syncing from the last finalized checkpoint from our already synced node.
#### Teku
In this case we just needed to add the parameter ```initial-state: http://XX.XX.XXX.XXX:5052/eth/v2/debug/beacon/states/finalized``` and the client will continue syncing from the last finalized checkpoint from our already synced node.
#### Nimbus
We were not able to execute the checkpoint sync using Nimbus 1.6.0, as it is not supported.
In this case we needed to update to version 1.7.0, as per the recommendation of the Nimbus team.
We just had to add:
```
trustedNodeSync --trusted-node-url=http://X.X.X.X:5051
```
#### Lodestar
In this case we were able to execute the client using the checkpoint sync by adding the parameters ```--weakSubjectivityServerUrl http://139.99.75.0:5051/ --weakSubjectivitySyncLatest```
#### Grandine
Grandine does not support checkpoint sync, so we tried downloading an already synced db into the machine.
However, when executing the client it would start syncing from scratch, so we were not able to execute Grandine as a synced node in a raspberry PI.
## Data Points
NE data: 508.5M data points
Python data: 225.2M data points
Eth-Pools tool Prometheus: 2.3M data points
Archival API benchmark: 10.1M data points
Total cells in used CSVs: 146422234 ~= 146.4M data points used for plotting
## Execution time
1243 days ~= 29832 CPU hours