# VC stuck (codefi) https://grafana.observability.consensys.net/explore?orgId=3&left=%7B%22datasource%22:%22aXuRxy3nz%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22aXuRxy3nz%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22%7Bcluster%3D%5C%22default-prod%5C%22,%20pod%3D%5C%22teku-validator-0-0%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221677806780628%22,%22to%22:%221677808893628%22%7D%7D Logs go from earliest to oldest. Times are in UTC. ``` 2023-03-03 01:38:15.016 WARN - Skipping aggregation for slot 5915264 due to unexpected delay in slot processing 2023-03-03 01:33:15.000 DEBUG - Performing attestation duties for slot 5915264, last production slot 5915263 2023-03-03 01:33:11.002 DEBUG - Performing block duties for slot 5915264, last production slot 5915263 2023-03-03 01:33:07.118 INFO - Validator *** Published aggregate Count: 1, Slot: 5915263, Root: 47cd2c7036e0a88d2c888e1d24d50ab7082123bfdf526a2fcd22fd05d3b0a67a ``` VC seems stuck while `performing attestations`. (5 minutes) Then it fired quickly all missed events to keep up ``` ... 2023-03-03 01:38:15.016 WARN - Skipping aggregation for slot 5915280 due to unexpected delay in slot processing 2023-03-03 01:38:15.016 WARN - Skipping aggregation for slot 5915279 due to unexpected delay in slot processing 2023-03-03 01:38:15.016 WARN - Skipping aggregation for slot 5915278 due to unexpected delay in slot processing ... ``` It gets stuck again afterwards on `performing attestations` (4 minutes) ``` 2023-03-03 01:46:15.264 WARN - Skipping aggregation for slot 5915309 due to unexpected delay in slot processing 2023-03-03 01:42:15.175 DEBUG - Performing attestation duties for slot 5915309, last production slot 5915289 2023-03-03 01:42:15.175 DEBUG - Performing block duties for slot 5915309, last production slot 5915289 ``` There are quite a few `Failed to load validator indices` timeout exceptions. It seems like timeout is 30 seconds for the call and the primary beacon node was hanging. The primary node was still considered ready to accept requests, so it never automatically failed over immediately. Probably should set timeouts for `tryRequestUntilSuccess()` calls to ensure primary does not take too long to receive a response and don't leave time for failover to be called. ``` 2023-03-03 01:57:18.838 DEBUG - http://teku-beacon-0.teku-beacon:5051/ is in sync and ready to accept requests ``` ``` Caused by: java.util.concurrent.TimeoutException at java.lang.Thread.run(Unknown Source) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?] at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?] at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture$Timeout.run(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?] java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException 2023-03-03 01:56:48.456 WARN - Failed to load validator indices. Retrying. ``` ## Logging Improvements ``` 2023-03-03 02:01:30.532 DEBUG - Remote request (publish_attestation) which is sent to all configured Beacon Node endpoints failed on the primary Beacon Node http://teku-beacon-0.teku-beacon:5051/. Will try to use a response from a failover. 2023-03-03 02:01:30.469 DEBUG - http://lighthouse-beacon-1.lighthouse-beacon:5051/ is in sync and ready to accept requests 2023-03-03 02:01:30.460 DEBUG - http://teku-beacon-3.teku-beacon:5051/ is in sync and ready to accept requests 2023-03-03 02:01:30.460 DEBUG - http://lighthouse-beacon-0.lighthouse-beacon:5051/ is in sync and ready to accept requests 2023-03-03 02:01:30.459 DEBUG - http://teku-beacon-2.teku-beacon:5051/ is in sync and ready to accept requests 2023-03-03 02:01:30.458 WARN - Validator *** The primary beacon node is not ready to accept requests (offline or not in sync). Future requests will use one of the configured failover beacon nodes until the primary one is ready again. ``` First line is confusing. It should directly go to failovers if configured