prd-elc-besu-prysm-mainnet-nightly-forest-snap
Performance AnalysisDate of analysis: 23/04/2024
The node had some blocks behind spikes but finally couldn't recover start around 08/04/2024 01:00:00 - let's call this the "incident".
Based on the drop in peers, this time appears to match with a (manual?) restart of one of both of the clients, but we are missing the older log so can't be sure.
There have been a number of OOM killer events, but not at incident time.
Only the following are recorded in dmesg…
$ sudo dmesg -T | grep "Killed process"
[Fri Apr 12 01:49:06 2024] Out of memory: Killed process 28925 (prysm-beacon) total-vm:33279532kB, anon-rss:12255500kB, file-rss:5888kB, shmem-rss:0kB, UID:1025 pgtables:34808kB oom_score_adj:0
[Thu Apr 18 05:44:28 2024] Out of memory: Killed process 193594 (prysm-beacon) total-vm:32956988kB, anon-rss:10152940kB, file-rss:2560kB, shmem-rss:0kB, UID:1025 pgtables:35140kB oom_score_adj:0
[Mon Apr 22 13:30:37 2024] Out of memory: Killed process 310385 (prysm-beacon) total-vm:34276952kB, anon-rss:10531612kB, file-rss:8960kB, shmem-rss:0kB, UID:1025 pgtables:30568kB oom_score_adj:0
Nothing in the red, but RAM Used is quite high
$ uptime
06:47:00 up 13 days, 5:07, 2 users, load average: 0.71, 0.92, 0.92
https://www.jfokus.se/jfokus17/preso/Using-jPDM–a-performance-diagnostic-model.pdf
No…
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 0 269336 44968 3047964 0 0 1409 401 4 7 11 1 79 9 0
1 0 0 269336 45072 3047964 0 0 0 824 8549 10326 44 2 52 2 0
5 0 0 269336 45120 3047964 0 0 0 640 2429 2835 62 1 36 1 0
...
...
0 0 0 203248 45756 3119512 0 0 272 880 7286 10620 4 1 92 3 0
1 0 0 203248 45844 3119504 0 0 0 996 5658 8311 2 2 92 4 0
0 0 0 201740 45880 3119496 0 0 0 408 1791 2825 2 1 96 2 0
...
...
3 0 0 230860 36164 2443916 0 0 0 820 6453 11340 65 1 32 2 0
3 0 0 230860 36212 2443916 0 0 0 416 1965 1736 76 0 24 1 0
1 0 0 220528 36256 2444360 0 0 0 5784 3519 5217 60 1 38 1 0
Two modes:
No
Although…
$ pidstat 1
Linux 6.5.0-1018-azure (prd-elc-besu-prysm-mainnet-nightly-forest-snap) 04/23/24 _x86_64_ (4 CPU)
04:51:56 UID PID %usr %system %guest %wait %CPU CPU Command
04:51:57 102 578 0.99 0.00 0.00 0.00 0.99 3 dbus-daemon
04:51:57 998 588 2.97 2.97 0.00 0.00 5.94 1 node_exporter
04:51:57 0 909 0.99 0.00 0.00 0.00 0.99 0 dockerd
04:51:57 104 940 0.99 0.00 0.00 0.00 0.99 2 mdsd
04:51:57 1025 478418 148.51 3.96 0.00 0.99 152.48 0 prysm-beacon
04:51:57 1024 502127 0.99 0.00 0.00 0.00 0.99 0 java
prysm-beacon has periodic high usage…it's got 4 CPUs, so I guess 148% out of 400% is actually 37%?
Clearly from the OOMs the memory has been saturated at points.
$ free -m
total used free shared buff/cache available
Mem: 15979 12745 182 4 3051 2883
Swap: 0 0 0
htop (filtered by prysm)
htop (filtered by besu)
…many more threads
$ mpstat -P ALL 1
...
06:35:38 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
06:35:39 all 98.25 0.00 1.50 0.00 0.00 0.25 0.00 0.00 0.00 0.00
06:35:39 0 99.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:35:39 1 95.05 0.00 3.96 0.00 0.00 0.99 0.00 0.00 0.00 0.00
06:35:39 2 99.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:35:39 3 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:35:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
06:35:40 all 16.96 0.00 1.01 22.28 0.00 0.00 0.00 0.00 0.00 59.75
06:35:40 0 15.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 0.00 73.00
06:35:40 1 14.29 0.00 2.04 20.41 0.00 0.00 0.00 0.00 0.00 63.27
06:35:40 2 14.14 0.00 2.02 54.55 0.00 0.00 0.00 0.00 0.00 29.29
06:35:40 3 24.49 0.00 0.00 2.04 0.00 0.00 0.00 0.00 0.00 73.47
06:35:40 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
06:35:41 all 3.53 0.00 0.76 24.18 0.00 0.25 0.00 0.00 0.00 71.28
06:35:41 0 2.97 0.00 0.99 1.98 0.00 0.99 0.00 0.00 0.00 93.07
06:35:41 1 5.10 0.00 1.02 73.47 0.00 0.00 0.00 0.00 0.00 20.41
06:35:41 2 4.04 0.00 1.01 14.14 0.00 0.00 0.00 0.00 0.00 80.81
06:35:41 3 2.02 0.00 0.00 8.08 0.00 0.00 0.00 0.00 0.00 89.90
2 modes: user dominant -> transition -> idle
vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 788956 57016 3166064 0 0 1409 401 2 7 11 1 79 9 0
0 0 0 806280 56804 3165040 0 0 0 568 1855 2904 1 0 97 3 0
2 0 0 805776 56836 3165040 0 0 0 392 2433 3274 1 1 97 2 0
0 0 0 805040 56880 3165040 0 0 0 680 10403 11584 3 2 93 3 0
1 0 0 803320 56892 3165424 0 0 96 5716 4870 8121 27 1 69 3 0
5 0 0 803320 56944 3165424 0 0 0 576 6559 11370 40 1 58 2 0
Mostly r <= 4, but a couple of times r = 5 despite being a 4 CPU machine.
Using pidstat alongside tail -f /var/log/prysm/beacon.log -f /var/log/besu/besu.log
with some besu debug enabled to show the engine API calls…
We can see that the periodic CPU spikes from prysm-beacon occur during this log:
{"batchSize":64,"blocksPerSecond":"3.2","estimatedTimeRemaining":27122000000000,"latestProcessedSlot/currentSlot":"8832096/8918887","level":"info","msg":"Processing blocks","peers":80,"prefix":"initial-sync","startingFrom":"0x9013a27c...","time":"2024-04-23T05:37:47Z"}
Then besu receives a batch of newPayloads, all but one of which it already has, but also doesn't receive an FCU…
{"@timestamp":"2024-04-23T05:37:56,713","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"AbstractEngineNewPayload","message":"New payload: number: 19630000, hash: 0x496ee8b8e1cddd7d43d9ab63ad68e4b78ad11e6a673aa0529fa938972f124344, parentHash: 0x555bb91b816369f0dda0f3df8919165d9e6948836967cf358a9097fc434bcb87, latestValidHash: 0x496ee8b8e1cddd7d43d9ab63ad68e4b78ad11e6a673aa0529fa938972f124344, status: VALID","throwable":""}
==> /var/log/prysm/beacon.log <==
{"error":"could not process block in batch: timeout from http.Client: received an undefined execution engine error","level":"warning","msg":"Skip processing batched blocks","prefix":"initial-sync","time":"2024-04-23T05:38:04Z"}
{"error":"could not process block in batch: timeout from http.Client: received an undefined execution engine error","level":"warning","msg":"Skip processing batched blocks","prefix":"initial-sync","time":"2024-04-23T05:38:04Z"}
{"error":"beacon node doesn't have a parent in db with root: 0xafadb55cbf7f88cda7a3eeff22921a4c4383c2b527ee0baabc97c2ebbd62ad96 (in processBatchedBlocks, slot=8832160)","level":"warning","msg":"Skip processing batched blocks","prefix":"initial-sync","time":"2024-04-23T05:38:04Z"}
{"error":"beacon node doesn't have a parent in db with root: 0xafadb55cbf7f88cda7a3eeff22921a4c4383c2b527ee0baabc97c2ebbd62ad96 (in processBatchedBlocks, slot=8832160)","level":"warning","msg":"Skip processing batched blocks","prefix":"initial-sync","time":"2024-04-23T05:38:04Z"}
==> /var/log/besu/besu.log <==
{"@timestamp":"2024-04-23T05:38:06,874","level":"INFO","thread":"vert.x-worker-thread-0","class":"AbstractEngineNewPayload","message":"Imported #19,630,001 / 134 tx / 16 ws / 1 blobs / base fee 12.78 gwei / 11,558,776 (38.5%) gas / (0x126cc21c45fa3ea60aef24d4593816729c728d75865c9848b58413acac1cc5fe) in 10.136s. Peers: 25","throwable":""}
{"@timestamp":"2024-04-23T05:38:06,874","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"AbstractEngineNewPayload","message":"New payload: number: 19630001, hash: 0x126cc21c45fa3ea60aef24d4593816729c728d75865c9848b58413acac1cc5fe, parentHash: 0x496ee8b8e1cddd7d43d9ab63ad68e4b78ad11e6a673aa0529fa938972f124344, latestValidHash: 0x126cc21c45fa3ea60aef24d4593816729c728d75865c9848b58413acac1cc5fe, status: VALID","throwable":""}
This log is also troubling…
{"error":"could not process block in batch: timeout from http.Client: received an undefined execution engine error","level":"warning","msg":"Skip processing batched blocks","prefix":"initial-sync","time":"2024-04-23T07:32:47Z"}
Prysm is not in sync and doesn't seem to successfully getting besu in sync due to lack of FCUs.
This is clearly a memory-constained system, but it is not clear whether the memory issues have caused what now appears to be a misbehaving application.
Not all the logs and metrics were available from the time of the "incident".
Recommend another round of analysis as soon as the issue begins occuring.
UPDATE: Following the prysm resync, the root cause could actually be disk saturation https://hackmd.io/kJ4nxfr4Qj6wV3BBA0hW6A#2-DONE-Resync-Prysm
This azure node doesn't have disk caching enabled like the other canaries do.
# stopping
{"level":"info","msg":"Stopping beacon node","prefix":"node",
"time":"2024-04-23T07:16:36Z"}
{"currentSlot":8919381,"level":"info","msg":"Synced to finalized epoch - now syncing blocks up to current head","prefix":"initial-sync","syncedSlot":8832223,
"time":"2024-04-23T07:16:36Z"}
# starting
{"level":"info","logFileName":"/var/log/prysm/beacon.log","msg":"Logs will be made persistent",
"time":"2024-04-23T07:18:07Z"}
It was 8919381 - 8832223 = 87158 slots behind. We may as well resync prysm.
Upon update, memory drops but "RAM Cache + Buffer" returns to the 16GB limit quickly. RAM Used is also climbing.
This has triggered a backwards sync
{"@timestamp":"2024-04-23T07:42:58,857","level":"INFO","thread":"vert.x-worker-thread-0","class":"BackwardSyncContext","message":"Starting a new backward sync session","throwable":""}
BWS is painfully slow: ~11s per block, not much faster than the chain progresses!
{"@timestamp":"2024-04-23T07:45:37,063","level":"INFO","thread":"ForkJoinPool.commonPool-worker-3","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2, 0.01% completed, imported 6 blocks of at least 86605 (current head 19630090, target head 19716689). Peers: 25","throwable":""}
{"@timestamp":"2024-04-23T07:45:49,016","level":"INFO","thread":"ForkJoinPool.commonPool-worker-3","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2, 0.01% completed, imported 7 blocks of at least 86607 (current head 19630091, target head 19716691). Peers: 25","throwable":""}
{"@timestamp":"2024-04-23T07:46:04,662","level":"INFO","thread":"ForkJoinPool.commonPool-worker-1","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2, 0.01% completed, imported 9 blocks of at least 86608 (current head 19630093, target head 19716692). Peers: 25","throwable":""}
Could be due to I/O saturation…
$ iostat -xz 1
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 44.00 604.00 93.00 67.88 1.68 13.73 8.00 36156.00 2.00 20.00 2.50 4519.50 0.00 0.00 0.09 9.20
sdb 480.00 21708.00 0.00 0.00 4.00 45.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.92 100.00