changed 3 years ago
Linked with GitHub

Eth2 Clients Experiment Summary

Executions and test performed

Versions

Prysm 2.0.6
Lighthouse 2.1.4
Teku 22.3.2
Nimbus 1.6.0
Lodestar 0.34.0
Grandine 0.2.0 (several beta versions)

Configurations

All machines were monitored using Prometheus Node Exporter and a custom python script.
Àll of them were connected to an already synced geth running in a separate machine. This geth was the same for all the clients and experiments.
The only exception is the kiln experiment, as the kiln guide describes the need to deploy a geth client in the same machine.
https://notes.ethereum.org/@launchpad/kiln
To deploy geth in the Kiln machines, we have used the following command:

./go-ethereum/build/bin/geth --datadir geth-datadir --http --http.api='engine,eth,web3,net,debug' --http.corsdomain '*' --networkid=1337802 --syncmode=full --authrpc.jwtsecret=/tmp/jwtsecret --bootnodes enode://c354db99124f0faf677ff0e75c3cbbd568b2febc186af664e0c51ac435609badedc67a18a63adb64dacc1780a28dcefebfc29b83fd1a3f4aa3c0eb161364cf94@164.92.130.5:30303 --override.terminaltotaldifficulty 20000000000000

During our experiment, the Kiln network suffered an incident where many miners entered the network, making the merge happen before. To avoid this, Kiln nodes were requested to override the total difficulty so as the merge could happen in the scheduled time.
Therefore, in some cases we had to add an additional flag, but in others just updating the config file was enough.

Prysm

Default sync (used for standard machine, fat node and raspberry PI)

config.yaml:

monitoring-host: 0.0.0.0
http-web3provider: http://XX.XX.XXX.XXX:8545/
slots-per-archive-point: 2048
All-topics

We have added subscribe-all-subnets: true to the configuration file.

Archival mode

We have changed the slots-per-archive-point parameter to 64.

Kiln

Following the Kiln guide, our configuration was the following:

bazel run //beacon-chain -- \
--genesis-state $PWD/../genesis.ssz \
--datadir $PWD/../datadir-prysm  \
--http-web3provider=/home/crawler/kiln/merge-testnets/kiln/geth-datadir/geth.ipc  \
--execution-provider=/home/crawler/kiln/merge-testnets/kiln/geth-datadir/geth.ipc  \
--chain-config-file=$PWD/../config.yaml \
--bootstrap-node=enr:-Iq4QMCTfIMXnow27baRUb35Q8iiFHSIDBJh6hQM5Axohhf4b6Kr_cOCu0htQ5WvVqKvFgY28893DHAg8gnBAXsAVqmGAX53x8JggmlkgnY0gmlwhLKAlv6Jc2VjcDI1NmsxoQK6S-Cii_KmfFdUJL2TANL3ksaKUnNXvTCv1tLwXs0QgIN1ZHCCIyk \
--jwt-secret=/tmp/jwtsecret \
--monitoring-host 0.0.0.0"

Lighthouse

Default sync (used for standard machine, fat node and raspberry PI)
lighthouse bn --http --metrics --metrics-address 0.0.0.0 --eth1-endpoints http://XX.XX.XXX.XXX:8545/ --slots-per-restore-point 2048 --datadir /mnt/diskChain/.lighthouse/mainnet
All-topics

We have added the parameter --subscribe-all-subnets to the execution command.

Archival mode

We have modified the -slots-per-restore-point to 64.

Kiln

Following the kiln guide, the execution command is as follows:

lighthouse \
	--spec mainnet \
	--network kiln \
	--debug-level info \
	beacon_node \
	--datadir ./testnet-lh1 \
	--eth1 \
	--http \
	--http-allow-sync-stalled \
	--metrics --metrics-address 0.0.0.0 \
	--merge \
	--execution-endpoints http://127.0.0.1:8551 \
	--enr-udp-port=9000 \
	--enr-tcp-port=9000 \
	--discovery-port=9000 \
	--jwt-secrets=/tmp/jwtsecret

Teku

After speaking to the developer teams we were suggested to configure the JVM memory allocation.
This was done using the following command:

export JAVA_OPTS="-Xmx5g -Xms5g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$CLIENT_BASE_DIR/heap_data"
Default sync (used for standard machine, fat node and raspberry PI)

config.yaml:

network: "mainnet"
eth1-endpoint: ["http://51.79.142.201:8545/"]
metrics-enabled: true
rest-api-docs-enabled: true
metrics-port: 8007
p2p-port: 9001
data-storage-archive-frequency: 2048
metrics-interface: "0.0.0.0"
metrics-host-allowlist: ["*"]
rest-api-enabled: true
rest-api-host-allowlist: ["*"]
rest-api-interface: "0.0.0.0"
rest-api-port: 5051
All-topics

We have added the option p2p-subscribe-all-subnets-enabled.

Archival mode

We have removed the data-storage-archive-frequency parameter from the configuration file.
We have added the option data-storage-mode: "archive" in the configuration file.

Kiln

Following the kiln guide, the execution command is:

./teku/build/install/teku/bin/teku \
	--data-path datadir-teku \
	--network config.yaml \
	--p2p-discovery-bootnodes enr:-Iq4QMCTfIMXnow27baRUb35Q8iiFHSIDBJh6hQM5Axohhf4b6Kr_cOCu0htQ5WvVqKvFgY28893DHAg8gnBAXsAVqmGAX53x8JggmlkgnY0gmlwhLKAlv6Jc2VjcDI1NmsxoQK6S-Cii_KmfFdUJL2TANL3ksaKUnNXvTCv1tLwXs0QgIN1ZHCCIyk \
	--ee-endpoint http://localhost:8551 \
	--Xee-version kilnv2 \
	--rest-api-enabled true --metrics-enabled=true --metrics-host-allowlist=* --metrics-interface=0.0.0.0 \
	--validators-proposer-default-fee-recipient=0x2Ad2f1999A99F6Af12D4634e2C88a0891c3013e8 \
	--ee-jwt-secret-file /tmp/jwtsecret \
	--log-destination console

Nimbus

Default sync (used for standard machine, fat node and raspberry PI)

The execution command is:

run-mainnet-beacon-node.sh --web3-url="http://XX.XX.XXX.XXX:8545/" --metrics-address=0.0.0.0 --metrics --tcp-port=9002 --udp-port=9003 --num-threads=4 --data-dir=/home/crawler/.nimbus-db/
All-topics

We have added the parameter --subscribe-all-subnets to the execution command.

Archival mode

There is no parameter to adjust the number of slots per state to store.

Kiln

Following the Kiln guide, the execution command is as follows:

nimbus-eth2/build/nimbus_beacon_node \
    --network=./ \
    --web3-url=ws://127.0.0.1:8551 \
    --rest --validator-monitor-auto \
    --metrics --metrics-address=0.0.0.0 --data-dir=./nimbus-db \
    --log-level=INFO \
    --jwt-secret=/tmp/jwtsecret

Lodestar

Default sync (used for standard machine, fat node and raspberry PI)
sudo docker run -p 9596:9596 -p 8006:8006 -p 9005:9005 -v /mnt/diskBlock/lodestar:/root/.local/share/lodestar/ chainsafe/lodestar:v0.34.0 beacon --network mainnet --metrics.enabled --metrics.serverPort=8006 --network.localMultiaddrs="/ip4/0.0.0.0/tcp/9005" --network.connectToDiscv5Bootnodes true --logLevel="info" --eth1.providerUrls="http://XX.XX.XXX.XXX:8545/" --api.rest.host 0.0.0.0
All-topics

We have added the parameter --network.subscribeAllSubnets true to the execution command.

Archival mode

There is no documented archival mode for Lodestar.

Kiln

Following the Kiln guide, the execution command is as follows:

./lodestar beacon --rootDir=../lodestar-beacondata --paramsFile=../config.yaml --genesisStateFile=../genesis.ssz  --eth1.enabled=true --execution.urls=http://127.0.0.1:8551 --network.connectToDiscv5Bootnodes --network.discv5.enabled=true --jwt-secret=/tmp/jwtsecret --network.discv5.bootEnrs=enr:-Iq4QMCTfIMXnow27baRUb35Q8iiFHSIDBJh6hQM5Axohhf4b6Kr_cOCu0htQ5WvVqKvFgY28893DHAg8gnBAXsAVqmGAX53x8JggmlkgnY0gmlwhLKAlv6Jc2VjcDI1NmsxoQK6S-Cii_KmfFdUJL2TANL3ksaKUnNXvTCv1tLwXs0QgIN1ZHCCIyk --metrics.enabled --metrics.serverPort=8006

Grandine

During our grandine experiments we were provided with several executables, each of them implementing a different functionality.
Even though all of them belong to the same version 0.2.0, several executables were used (different beta versions).

Default sync (used for standard machine, fat node and raspberry PI)

The execution command is as follows:

grandine-0.2.0 --metrics --archival-epoch-interval 64 --eth1-rpc-urls http://XX.XX.XXX.XXX:8545/ --http-address 0.0.0.0 --network mainnet
All-topics

We have added
--subscribe-all-subnets parameter to the execution command.

Archival mode

We have modified the parameter --archival-epoch-interval 64 to 2.

Kiln

The execution command:

sudo docker run --name grandine_container -v /home/crawler/.grandine:/root/.grandine -v /tmp/jwtsecret:/tmp/jwtsecret --network=host sifrai/grandine:latest grandine --eth1-rpc-urls http://localhost:8551/ --network kiln --jwt-secret=/tmp/jwtsecret --keystore-dir /root/.grandine/keys --keystore-password-file /root/.grandine/secrets

Tests performed

We have executed each client in sync mode (standard machine, fat machine and raspberry PI), all-topics mode and in Kiln network.
During all these experiments the goal was to measure the performance and hardware resource consumption in each mode and machine.

We have executed some clients in archival mode: in this case we have not measured the hardware consumption but the goal was to perform an API benchmark test, in order to check the resilience and speed of the client when receiving different number of queries to the Beacon API.

Issues in the tests performed

Default sync

The first test performed was syncing all clients except Grandine (we did not have the executable yet). During this process we investigated the best way to configure each of them, even asking the developer teams how to do so.

During this process, we encountered several issues:

Prysm

We did not encounter any issues when executing the client.

Lighthouse

The Lighthouse database using the above configuration takes around 100 - 110 GB. However, the disk we were using had a space of 90 GB, so the client filled the disk. As soon as we noticed this, we created a new disk and move the db to it, so no resyncing was needed.
We also encountered a memory problem, where the client ran out of memory and the OS would kill the process.

Teku

As mentioned, the only issue we encountered the first time we executed the client was that the memory consumption would rise until the OS killed the process. After configuring the JVM, the client worked fine.

Nimbus

While compiling Nimbus we noticed the process was taking very long. After speaking to the developer team, we were suggested to add the flag -j4 to the make command, which enables multithreading while compiling. This improved the compiling time to around 9 minutes. After this was sorted out, the client ran smoothly.

Lodestar

During the installation of Lodestar, we followed the official guide: https://chainsafe.github.io/lodestar/installation/
However, we were unable to successfully install the client. The issue we found out was that nodeJS version had to be greater or equal to 16.0.0, however the guide specified greater or equal to 12.0.0. The version we were using was 14.8.3 and after upgrading to 16.0.0 it worked.
This has already been updated in the current documentation.

After executing the default mode, we realized the client did not find any peers, so we asked the developer team and they suggested to use the docker execution, which seemed to be more stable and easy to use. Switching to docker worked fine and the client started syncing.

However, the database space was more than 80GB at the time of the experiment and the disk ran out of space.
We created a new disk, moved the db and continued syncing. However, the databse seemed corrupted, we were getting the following error:

Error: string encoded ENR must start with 'enr:'
    at Function.decodeTxt (/usr/app/node_modules/@chainsafe/discv5/lib/enr/enr.js:92:19)
    at readEnr (/usr/app/node_modules/@chainsafe/lodestar-cli/src/config/enr.ts:28:14)
    at persistOptionsAndConfig (/usr/app/node_modules/@chainsafe/lodestar-cli/src/cmds/init/handler.ts:86:24)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at Object.beaconHandler [as handler] (/usr/app/node_modules/@chainsafe/lodestar-cli/src/cmds/beacon/handler.ts:26:3)

We asked Lodestar team for support: the ENR was corrupted, we have renamed the enr file to enr_corrupted and the client now works, a new ENR was generated.

We also found a bug in the Beacon API, related to the number of peers. When querying Lodestar API, the number of peers returned would always be 0. This bug was reported and an issue was opened in Github.

Grandine

During the syncing process we realized the client would sometimes use a lot of memory and eventually get killed by the OS. After speaking to the developer team, we were provided with a new executable which did not crash anymore.
Apart from this, Grandine does expose prometheus metrics and an API, but it is unstable and we were not able to use it in every single test, as when querying the endpoint it sometimes killed the client. The API exposes data about the slot but we were not able to retrieve the number of peers from it.

All-topics

After syncing the clients we have stopped them and added the necessary parameters to activate the all-topics mode.
During this process we have not encountered any major issues, apart from verifying that the client was in fact in the all-topics mode.
For some clients this is shown in the terminal output. For some others, we could check the Prometheus metrics to verify this.

Archival mode

Prysm

During the execution of Prysm in archival mode we faced a long and slow synchronization. This process took more than 3 weeks and it was also very irregular, as the exposed metrics to Prometheus sometimes were not available.
Once the client was synced, we were also unable to perform the API benchmark properly, as the client would stop responding after several queries.

Lighthouse

We have not encountered any issues to sync the client in archive mode using the above configuration, other than the disk space required to store the database, which came out to be more than 1TB.

Teku

We experienced a similar behaviour in Teku, where the sync process took longer than 4 weeks. In this case, the metrics worked well.
After the client was synced we could perform the API benchmark test and obtain a response for each query.

Nimbus

As per the developer team suggestion we have used the same default mode we used to sync the client, so we have not encountered any other issues to execute the client in this mode.

Lodestar

There is no archival mode available as per the developer team's answer and, therefore, we have not executed the APÎ benchmark test.

Grandine

Grandine did not implement the standard Beacon API and, therefore, we were not able to perform the API benchmark test.

Raspberry PI

Lodestar

We were not able to execute any docker image, as they are amd64 based so we had to recompile the project for the raspberry PI. Again we did find some issues while installing nodeJS and upgrading to the correct version as well as compiling Lodestar, which sometimes compiled but would not execute because of a missing dependency (probably due to not updated).

Kiln

Prysm

While installing and running Prysm in the Kiln network we were not able to connect Prysm to the installed Geth in the same machine by following the guide. After speaking to the Prysm team, we were suggested to not use the JWT to connect to Geth, but the IPC file instead, which was not specified in the guide. After this fix, the client worked well.

Lodestar

We faced the following error:

    at CompositeListType.tree_setProperty (/home/crawler/kiln/merge-testnets/kiln/lodestar/node_modules/@chainsafe/ssz/src/types/composite/list.ts:494:13)
    at CompositeListTreeValue.setProperty (/home/crawler/kiln/merge-testnets/kiln/lodestar/node_modules/@chainsafe/ssz/src/backings/tree/treeValue.ts:294:22)
    at Object.set (/home/crawler/kiln/merge-testnets/kiln/lodestar/node_modules/@chainsafe/ssz/src/backings/tree/treeValue.ts:92:19)
    at DepositDataRootRepository.batchPut (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/db/repositories/depositDataRoot.ts:34:27)
    at DepositDataRootRepository.batchPutValues (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/db/repositories/depositDataRoot.ts:43:5)
    at Eth1DepositsCache.add (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositsCache.ts:104:5)
    at Eth1DepositDataTracker.updateDepositCache (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositDataTracker.ts:178:5)
    at Eth1DepositDataTracker.update (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositDataTracker.ts:159:33)
    at Eth1DepositDataTracker.runAutoUpdate (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositDataTracker.ts:133:29)
Mar-31 15:20:58.651[ETH1]            error: Error updating eth1 chain cache  Invalid length index

The client worked fine, produced blocks, attested and performed accordingly, but the above warning kept appearing. After speaking to the Lodestar team, there was a missing parameter: --network kiln

Raspberry PI checkpoint sync

As the Raspberry PI synchronization was slow, we tried using the checkpoint sync functionality to monitor each of the clients once synced in the raspberry PI.

Prysm

We tried using the remote checkpoint sync, which consists on connecting to a remote synced node and obtaining the last finalized state, then syncing from there. However, we were constantly having issues to do this. After speaking with the Prysm team, there was a bug in how the remote client version was parsed and, therefore, the checkpoint sync failed.
In the end, we had to manually download the last finalzed checkpoint from an already synced Prysm and load it locally in the raspberry PI.

Lighthouse

In this case we just needed to add the parameter --checkpoint-sync-url http://XX.XX.XXX.XXX:5052 and the client will continue syncing from the last finalized checkpoint from our already synced node.

Teku

In this case we just needed to add the parameter initial-state: http://XX.XX.XXX.XXX:5052/eth/v2/debug/beacon/states/finalized and the client will continue syncing from the last finalized checkpoint from our already synced node.

Nimbus

We were not able to execute the checkpoint sync using Nimbus 1.6.0, as it is not supported.
In this case we needed to update to version 1.7.0, as per the recommendation of the Nimbus team.
We just had to add:

trustedNodeSync --trusted-node-url=http://X.X.X.X:5051

Lodestar

In this case we were able to execute the client using the checkpoint sync by adding the parameters --weakSubjectivityServerUrl http://139.99.75.0:5051/ --weakSubjectivitySyncLatest

Grandine

Grandine does not support checkpoint sync, so we tried downloading an already synced db into the machine.
However, when executing the client it would start syncing from scratch, so we were not able to execute Grandine as a synced node in a raspberry PI.

Data Points

NE data: 508.5M data points
Python data: 225.2M data points
Eth-Pools tool Prometheus: 2.3M data points
Archival API benchmark: 10.1M data points
Total cells in used CSVs: 146422234 ~= 146.4M data points used for plotting

Execution time

1243 days ~= 29832 CPU hours

Select a repo