Prysm 2.0.6
Lighthouse 2.1.4
Teku 22.3.2
Nimbus 1.6.0
Lodestar 0.34.0
Grandine 0.2.0 (several beta versions)
All machines were monitored using Prometheus Node Exporter and a custom python script.
Àll of them were connected to an already synced geth running in a separate machine. This geth was the same for all the clients and experiments.
The only exception is the kiln experiment, as the kiln guide describes the need to deploy a geth client in the same machine.
https://notes.ethereum.org/@launchpad/kiln
To deploy geth in the Kiln machines, we have used the following command:
./go-ethereum/build/bin/geth --datadir geth-datadir --http --http.api='engine,eth,web3,net,debug' --http.corsdomain '*' --networkid=1337802 --syncmode=full --authrpc.jwtsecret=/tmp/jwtsecret --bootnodes enode://c354db99124f0faf677ff0e75c3cbbd568b2febc186af664e0c51ac435609badedc67a18a63adb64dacc1780a28dcefebfc29b83fd1a3f4aa3c0eb161364cf94@164.92.130.5:30303 --override.terminaltotaldifficulty 20000000000000
During our experiment, the Kiln network suffered an incident where many miners entered the network, making the merge happen before. To avoid this, Kiln nodes were requested to override the total difficulty so as the merge could happen in the scheduled time.
Therefore, in some cases we had to add an additional flag, but in others just updating the config file was enough.
config.yaml:
monitoring-host: 0.0.0.0
http-web3provider: http://XX.XX.XXX.XXX:8545/
slots-per-archive-point: 2048
We have added subscribe-all-subnets: true
to the configuration file.
We have changed the slots-per-archive-point
parameter to 64.
Following the Kiln guide, our configuration was the following:
bazel run //beacon-chain -- \
--genesis-state $PWD/../genesis.ssz \
--datadir $PWD/../datadir-prysm \
--http-web3provider=/home/crawler/kiln/merge-testnets/kiln/geth-datadir/geth.ipc \
--execution-provider=/home/crawler/kiln/merge-testnets/kiln/geth-datadir/geth.ipc \
--chain-config-file=$PWD/../config.yaml \
--bootstrap-node=enr:-Iq4QMCTfIMXnow27baRUb35Q8iiFHSIDBJh6hQM5Axohhf4b6Kr_cOCu0htQ5WvVqKvFgY28893DHAg8gnBAXsAVqmGAX53x8JggmlkgnY0gmlwhLKAlv6Jc2VjcDI1NmsxoQK6S-Cii_KmfFdUJL2TANL3ksaKUnNXvTCv1tLwXs0QgIN1ZHCCIyk \
--jwt-secret=/tmp/jwtsecret \
--monitoring-host 0.0.0.0"
lighthouse bn --http --metrics --metrics-address 0.0.0.0 --eth1-endpoints http://XX.XX.XXX.XXX:8545/ --slots-per-restore-point 2048 --datadir /mnt/diskChain/.lighthouse/mainnet
We have added the parameter --subscribe-all-subnets
to the execution command.
We have modified the -slots-per-restore-point
to 64.
Following the kiln guide, the execution command is as follows:
lighthouse \
--spec mainnet \
--network kiln \
--debug-level info \
beacon_node \
--datadir ./testnet-lh1 \
--eth1 \
--http \
--http-allow-sync-stalled \
--metrics --metrics-address 0.0.0.0 \
--merge \
--execution-endpoints http://127.0.0.1:8551 \
--enr-udp-port=9000 \
--enr-tcp-port=9000 \
--discovery-port=9000 \
--jwt-secrets=/tmp/jwtsecret
After speaking to the developer teams we were suggested to configure the JVM memory allocation.
This was done using the following command:
export JAVA_OPTS="-Xmx5g -Xms5g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$CLIENT_BASE_DIR/heap_data"
config.yaml:
network: "mainnet"
eth1-endpoint: ["http://51.79.142.201:8545/"]
metrics-enabled: true
rest-api-docs-enabled: true
metrics-port: 8007
p2p-port: 9001
data-storage-archive-frequency: 2048
metrics-interface: "0.0.0.0"
metrics-host-allowlist: ["*"]
rest-api-enabled: true
rest-api-host-allowlist: ["*"]
rest-api-interface: "0.0.0.0"
rest-api-port: 5051
We have added the option p2p-subscribe-all-subnets-enabled
.
We have removed the data-storage-archive-frequency
parameter from the configuration file.
We have added the option data-storage-mode: "archive"
in the configuration file.
Following the kiln guide, the execution command is:
./teku/build/install/teku/bin/teku \
--data-path datadir-teku \
--network config.yaml \
--p2p-discovery-bootnodes enr:-Iq4QMCTfIMXnow27baRUb35Q8iiFHSIDBJh6hQM5Axohhf4b6Kr_cOCu0htQ5WvVqKvFgY28893DHAg8gnBAXsAVqmGAX53x8JggmlkgnY0gmlwhLKAlv6Jc2VjcDI1NmsxoQK6S-Cii_KmfFdUJL2TANL3ksaKUnNXvTCv1tLwXs0QgIN1ZHCCIyk \
--ee-endpoint http://localhost:8551 \
--Xee-version kilnv2 \
--rest-api-enabled true --metrics-enabled=true --metrics-host-allowlist=* --metrics-interface=0.0.0.0 \
--validators-proposer-default-fee-recipient=0x2Ad2f1999A99F6Af12D4634e2C88a0891c3013e8 \
--ee-jwt-secret-file /tmp/jwtsecret \
--log-destination console
The execution command is:
run-mainnet-beacon-node.sh --web3-url="http://XX.XX.XXX.XXX:8545/" --metrics-address=0.0.0.0 --metrics --tcp-port=9002 --udp-port=9003 --num-threads=4 --data-dir=/home/crawler/.nimbus-db/
We have added the parameter --subscribe-all-subnets
to the execution command.
There is no parameter to adjust the number of slots per state to store.
Following the Kiln guide, the execution command is as follows:
nimbus-eth2/build/nimbus_beacon_node \
--network=./ \
--web3-url=ws://127.0.0.1:8551 \
--rest --validator-monitor-auto \
--metrics --metrics-address=0.0.0.0 --data-dir=./nimbus-db \
--log-level=INFO \
--jwt-secret=/tmp/jwtsecret
sudo docker run -p 9596:9596 -p 8006:8006 -p 9005:9005 -v /mnt/diskBlock/lodestar:/root/.local/share/lodestar/ chainsafe/lodestar:v0.34.0 beacon --network mainnet --metrics.enabled --metrics.serverPort=8006 --network.localMultiaddrs="/ip4/0.0.0.0/tcp/9005" --network.connectToDiscv5Bootnodes true --logLevel="info" --eth1.providerUrls="http://XX.XX.XXX.XXX:8545/" --api.rest.host 0.0.0.0
We have added the parameter --network.subscribeAllSubnets true
to the execution command.
There is no documented archival mode for Lodestar.
Following the Kiln guide, the execution command is as follows:
./lodestar beacon --rootDir=../lodestar-beacondata --paramsFile=../config.yaml --genesisStateFile=../genesis.ssz --eth1.enabled=true --execution.urls=http://127.0.0.1:8551 --network.connectToDiscv5Bootnodes --network.discv5.enabled=true --jwt-secret=/tmp/jwtsecret --network.discv5.bootEnrs=enr:-Iq4QMCTfIMXnow27baRUb35Q8iiFHSIDBJh6hQM5Axohhf4b6Kr_cOCu0htQ5WvVqKvFgY28893DHAg8gnBAXsAVqmGAX53x8JggmlkgnY0gmlwhLKAlv6Jc2VjcDI1NmsxoQK6S-Cii_KmfFdUJL2TANL3ksaKUnNXvTCv1tLwXs0QgIN1ZHCCIyk --metrics.enabled --metrics.serverPort=8006
During our grandine experiments we were provided with several executables, each of them implementing a different functionality.
Even though all of them belong to the same version 0.2.0, several executables were used (different beta versions).
The execution command is as follows:
grandine-0.2.0 --metrics --archival-epoch-interval 64 --eth1-rpc-urls http://XX.XX.XXX.XXX:8545/ --http-address 0.0.0.0 --network mainnet
We have added
--subscribe-all-subnets
parameter to the execution command.
We have modified the parameter --archival-epoch-interval 64
to 2.
The execution command:
sudo docker run --name grandine_container -v /home/crawler/.grandine:/root/.grandine -v /tmp/jwtsecret:/tmp/jwtsecret --network=host sifrai/grandine:latest grandine --eth1-rpc-urls http://localhost:8551/ --network kiln --jwt-secret=/tmp/jwtsecret --keystore-dir /root/.grandine/keys --keystore-password-file /root/.grandine/secrets
We have executed each client in sync mode (standard machine, fat machine and raspberry PI), all-topics mode and in Kiln network.
During all these experiments the goal was to measure the performance and hardware resource consumption in each mode and machine.
We have executed some clients in archival mode: in this case we have not measured the hardware consumption but the goal was to perform an API benchmark test, in order to check the resilience and speed of the client when receiving different number of queries to the Beacon API.
The first test performed was syncing all clients except Grandine (we did not have the executable yet). During this process we investigated the best way to configure each of them, even asking the developer teams how to do so.
During this process, we encountered several issues:
We did not encounter any issues when executing the client.
The Lighthouse database using the above configuration takes around 100 - 110 GB. However, the disk we were using had a space of 90 GB, so the client filled the disk. As soon as we noticed this, we created a new disk and move the db to it, so no resyncing was needed.
We also encountered a memory problem, where the client ran out of memory and the OS would kill the process.
As mentioned, the only issue we encountered the first time we executed the client was that the memory consumption would rise until the OS killed the process. After configuring the JVM, the client worked fine.
While compiling Nimbus we noticed the process was taking very long. After speaking to the developer team, we were suggested to add the flag -j4 to the make command, which enables multithreading while compiling. This improved the compiling time to around 9 minutes. After this was sorted out, the client ran smoothly.
During the installation of Lodestar, we followed the official guide: https://chainsafe.github.io/lodestar/installation/
However, we were unable to successfully install the client. The issue we found out was that nodeJS version had to be greater or equal to 16.0.0, however the guide specified greater or equal to 12.0.0. The version we were using was 14.8.3 and after upgrading to 16.0.0 it worked.
This has already been updated in the current documentation.
After executing the default mode, we realized the client did not find any peers, so we asked the developer team and they suggested to use the docker execution, which seemed to be more stable and easy to use. Switching to docker worked fine and the client started syncing.
However, the database space was more than 80GB at the time of the experiment and the disk ran out of space.
We created a new disk, moved the db and continued syncing. However, the databse seemed corrupted, we were getting the following error:
Error: string encoded ENR must start with 'enr:'
at Function.decodeTxt (/usr/app/node_modules/@chainsafe/discv5/lib/enr/enr.js:92:19)
at readEnr (/usr/app/node_modules/@chainsafe/lodestar-cli/src/config/enr.ts:28:14)
at persistOptionsAndConfig (/usr/app/node_modules/@chainsafe/lodestar-cli/src/cmds/init/handler.ts:86:24)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at Object.beaconHandler [as handler] (/usr/app/node_modules/@chainsafe/lodestar-cli/src/cmds/beacon/handler.ts:26:3)
We asked Lodestar team for support: the ENR was corrupted, we have renamed the enr file to enr_corrupted and the client now works, a new ENR was generated.
We also found a bug in the Beacon API, related to the number of peers. When querying Lodestar API, the number of peers returned would always be 0. This bug was reported and an issue was opened in Github.
During the syncing process we realized the client would sometimes use a lot of memory and eventually get killed by the OS. After speaking to the developer team, we were provided with a new executable which did not crash anymore.
Apart from this, Grandine does expose prometheus metrics and an API, but it is unstable and we were not able to use it in every single test, as when querying the endpoint it sometimes killed the client. The API exposes data about the slot but we were not able to retrieve the number of peers from it.
After syncing the clients we have stopped them and added the necessary parameters to activate the all-topics mode.
During this process we have not encountered any major issues, apart from verifying that the client was in fact in the all-topics mode.
For some clients this is shown in the terminal output. For some others, we could check the Prometheus metrics to verify this.
During the execution of Prysm in archival mode we faced a long and slow synchronization. This process took more than 3 weeks and it was also very irregular, as the exposed metrics to Prometheus sometimes were not available.
Once the client was synced, we were also unable to perform the API benchmark properly, as the client would stop responding after several queries.
We have not encountered any issues to sync the client in archive mode using the above configuration, other than the disk space required to store the database, which came out to be more than 1TB.
We experienced a similar behaviour in Teku, where the sync process took longer than 4 weeks. In this case, the metrics worked well.
After the client was synced we could perform the API benchmark test and obtain a response for each query.
As per the developer team suggestion we have used the same default mode we used to sync the client, so we have not encountered any other issues to execute the client in this mode.
There is no archival mode available as per the developer team's answer and, therefore, we have not executed the APÎ benchmark test.
Grandine did not implement the standard Beacon API and, therefore, we were not able to perform the API benchmark test.
We were not able to execute any docker image, as they are amd64 based so we had to recompile the project for the raspberry PI. Again we did find some issues while installing nodeJS and upgrading to the correct version as well as compiling Lodestar, which sometimes compiled but would not execute because of a missing dependency (probably due to not updated).
While installing and running Prysm in the Kiln network we were not able to connect Prysm to the installed Geth in the same machine by following the guide. After speaking to the Prysm team, we were suggested to not use the JWT to connect to Geth, but the IPC file instead, which was not specified in the guide. After this fix, the client worked well.
We faced the following error:
at CompositeListType.tree_setProperty (/home/crawler/kiln/merge-testnets/kiln/lodestar/node_modules/@chainsafe/ssz/src/types/composite/list.ts:494:13)
at CompositeListTreeValue.setProperty (/home/crawler/kiln/merge-testnets/kiln/lodestar/node_modules/@chainsafe/ssz/src/backings/tree/treeValue.ts:294:22)
at Object.set (/home/crawler/kiln/merge-testnets/kiln/lodestar/node_modules/@chainsafe/ssz/src/backings/tree/treeValue.ts:92:19)
at DepositDataRootRepository.batchPut (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/db/repositories/depositDataRoot.ts:34:27)
at DepositDataRootRepository.batchPutValues (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/db/repositories/depositDataRoot.ts:43:5)
at Eth1DepositsCache.add (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositsCache.ts:104:5)
at Eth1DepositDataTracker.updateDepositCache (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositDataTracker.ts:178:5)
at Eth1DepositDataTracker.update (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositDataTracker.ts:159:33)
at Eth1DepositDataTracker.runAutoUpdate (/home/crawler/kiln/merge-testnets/kiln/lodestar/packages/lodestar/src/eth1/eth1DepositDataTracker.ts:133:29)
Mar-31 15:20:58.651[ETH1] error: Error updating eth1 chain cache Invalid length index
The client worked fine, produced blocks, attested and performed accordingly, but the above warning kept appearing. After speaking to the Lodestar team, there was a missing parameter: --network kiln
As the Raspberry PI synchronization was slow, we tried using the checkpoint sync functionality to monitor each of the clients once synced in the raspberry PI.
We tried using the remote checkpoint sync, which consists on connecting to a remote synced node and obtaining the last finalized state, then syncing from there. However, we were constantly having issues to do this. After speaking with the Prysm team, there was a bug in how the remote client version was parsed and, therefore, the checkpoint sync failed.
In the end, we had to manually download the last finalzed checkpoint from an already synced Prysm and load it locally in the raspberry PI.
In this case we just needed to add the parameter --checkpoint-sync-url http://XX.XX.XXX.XXX:5052
and the client will continue syncing from the last finalized checkpoint from our already synced node.
In this case we just needed to add the parameter initial-state: http://XX.XX.XXX.XXX:5052/eth/v2/debug/beacon/states/finalized
and the client will continue syncing from the last finalized checkpoint from our already synced node.
We were not able to execute the checkpoint sync using Nimbus 1.6.0, as it is not supported.
In this case we needed to update to version 1.7.0, as per the recommendation of the Nimbus team.
We just had to add:
trustedNodeSync --trusted-node-url=http://X.X.X.X:5051
In this case we were able to execute the client using the checkpoint sync by adding the parameters --weakSubjectivityServerUrl http://139.99.75.0:5051/ --weakSubjectivitySyncLatest
Grandine does not support checkpoint sync, so we tried downloading an already synced db into the machine.
However, when executing the client it would start syncing from scratch, so we were not able to execute Grandine as a synced node in a raspberry PI.
NE data: 508.5M data points
Python data: 225.2M data points
Eth-Pools tool Prometheus: 2.3M data points
Archival API benchmark: 10.1M data points
Total cells in used CSVs: 146422234 ~= 146.4M data points used for plotting
1243 days ~= 29832 CPU hours