###### tags: `pubsub` `experiment`
<link rel="stylesheet" type="text/css" href="https://stackpath.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css">
<style>
.markdown-body { max-width: calc(100% - 450px); margin-right: 0; }
.markdown-body .highlight pre, .markdown-body pre {border-width : 2px;}
</style>
# Pubsub Experiment analysis - 10/2022
## XP before fix on messages size
A bug made the payload null
| | slots | endorsers | block time | Time to endorse the whole slot | comment | msg sent/rcv per level slot prod | msg sent/rcv per level endo | |
|--------------------------------------|-------|-----------|------------|--------------------------------|-----------|----------------------------------|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 4aa6ee06-1e53-44f9-b654-77ff0acef998 | 1 | 30 | 30 | 1.6 | | 43691/22213 | 1795/2491 | |
| bee87394-133e-484f-8d32-13094cbfef76 | 2 | 30 | 30 | 2.5-3.5 | | 54613/27479 | 4272/6104 | |
| 621845fc-7414-43e0-bc6d-00f6f3d32a39 | 1 | 60 | 30 | 1.9 | | | | |
| 53c9f98f-ced5-414f-b13a-07cb117c9079 | 2 | 60 | 30 | 3.3 | | | | |
| 59afa4a8-3f18-43f4-8083-83d81ce88706 | 1 | 200 | 30 | 8.6 | | | | |
| c2310130-b92f-48d5-8147-d2ec46b426a7 | 1 | 300 | 30 | 12 | | 54624/39956 | 277/335 | |
| 55fcf29b-bf87-46b0-b906-51ba5d4cae51 | 1 | 400 | 180 | 43 | | | | |
| 88b6f12f-73a4-4916-b03b-9520e0cb478f | 1 | 400 | 30 | 60 | | | | |
| b918bbfe-bd25-42e1-877f-9acca96977e5 | 1 | 400 | 30 | 60-120 | | | | |
| 2ea970e1-c72b-4d5b-ad4d-f935de46fc68 | 256 | 400 | 300 | 700 | | | | |
| 9d66ae0d-bd5c-4723-94f3-5d938b4c | 256 | 400 | 900 | > 700 | | 30478/ 15428 | 24194/34212 | http://34.243.93.97:3000/d/JsqjheWVz/p2p-boards?orgId=1&var-JobId=9d66ae0d-bd5c-4723-94f3-5d938b4c7ac6&from=1666100271025&to=1666101140741&editPanel=68&tab=query |
| 1a63e093-d1ab-4af4-978a-408369edf263 | 1 | 30 | 30 | | added log | | | |
| f9dab7c2-7f9e-42af-8291-d95d2303 | 1 | 300 | 30 | 14 | added log | | | |
| 9fbc8c6e-b7a2-4b6c-a5a4-44c607bd4c96 | 1 | 300 | 30 | | | | | |
### Analyse
Looking at
(A) 4aa6ee06-1e53-44f9-b654-77ff0acef998 30 endorser, 1 slot
(B) bee87394-133e-484f-8d32-13094cbfef76 30 endo, 2 slots
(C) c2310130-b92f-48d5-8147-d2ec46b426a7 300 endorser, 1 slot
Nb of emitted/received msg in slot producer | endorser for one level:
A : 43691/22213 | 1795/2491 - slot producer a 30 connections à servir, 1.6s
B : 54613/27479 | 4272/6104 - slot producer a 30 connections à servir, 3s
C : 54624/39956 | 277/335 - slot producer a 300 connections à servir, 12s
#### problème 1
Théorie :
- on observe une perte de performance lineaire par rapport au nombre de
connection, à quantité de données transmises constantes.
- Serait-ce le IO scheduler ?
#### Probleme2 : Pourquoi ce nombre fou de messages !????
- implem du test ? implem de send on topics ?
des tests manuels semble montrer un nombre normal de message, serais-ce la
metrique qui est faussée ? le code semble correct.
## XP after fix on messages size
8kb payload
| | slots | endorsers | block time | time to endorse | comment | msg sent/rcv per level slot prod | msg sent/rcv per level endo |
|--------------------------------------|-------|-----------|------------|-----------------|----------------------------------|----------------------------------|-----------------------------|
| 46b6c489-1f66-4e4e-bf94-d2fb15e15d70 | 1 | 30 | 30 | 4.1 | | 54613/ 28521 | 2267/ 3185 |
| 7e5c483c-108f-483b-8464-47f133859053 | 2 | 30 | 30 | 4.3 | | 54613/ 27836 | 4487/ 6312 |
| 7254bfa3-5a01-4239-8e37-eb804357 | 1 | 300 | 30 | 21 | (2023 shards, 4 aws node failed) | | |
| f6fe7860-4a6a-48c1-990e-aeedc4c2eb3b | 2 | 300 | 75 | 14-16 | WTF | 32768/ 26097 | 397/ 467 |
| cb69e438-8b7c-44b9-885f-170aa33ef9ef | 1 | 300 | 75 | 15 | | 32768/ 23172 | 157/ 179 |
| cc3624c4-7ceb-4f29-9c51-f4e762e2f0ac | 4 | 300 | 75 | 18 | | | |
| e43473e2-0c7d-416f-b844-b05fb056dfe5 | 1 | 30 | 45 | 4 | | 32553/ 17441 | 1458/ 1995 |
| 1eb0c547-5960-4133-9832-e91e274f33cb | 256 | 30 | 300 | >300 | only 1100 shards seen | | |
|
### Analyse
Fait intéressant, si on regarde le graphe des messages echangés, dans les XP avec
un blocktime de 75 sec, on se rend compte qu'on échange de nombreux messages
bien après les 15 secondes nécessaires à l'endorsement de tous les shards
# Experiments to pinpoint the scaling issue
## XP with forwarders on all topics, before patching topics of messages
grafana data lost
| testid | jobid | slots | prod- | endo | for- | block | timeout | avg | #msgs | #msg | XP date |
|--------|--------------------------------------|-------|-------|------|------|-------|---------|------------|----------------|----------|------------------|
| | | | **ucer** | | **ward** | **time** | | | **producer** | **endo** |** (day/hour) ** |
| | | | | | | | | | sent/rcv | sent/rcv | |
|--------|--------------------------------------|-------|-------|------|------|-------|---------|------------|----------------|----------|------------------|
| 5 | ea9aa964-4de7-433b-bd35-14065443ea91 | 1 | 1 | 400 | 0 | 60 | 900 | 229.684069 | | | |
| 0 | 7d59fae3-895e-4100-a225-804283bc0ef6 | 10 | 10 | 50 | 0 | 30 | 600 | 6.274089 | | | |
| 1 | 4a627fad-2330-4727-af8f-6c6329c9686f | 50 | 50 | 50 | 0 | 30 | 600 | 21.639989 | | | |
| 2 | 4711bd66-53c6-4194-977f-116f00e09bf3 | 100 | 100 | 50 | 0 | 30 | 600 | 62.685817 | | | |
| 3 | b2f2e0dc-fe41-4225-9b44-8195547ebd31 | 200 | 200 | 50 | 0 | 30 | 600 | -nan | | | |
| | | | | | | | | | | | |
## Use forwarder that are in no topics, they just generate additionnal connections
| testid | jobid | slots | prod- | endo | for- | block | timeout | avg | #msgs | #msg | XP date |
|--------|--------------------------------------|-------|-------|------|------|-------|---------|------------|----------------|----------|------------------|
| | | | **ucer** | | **ward** | **time** | | | **producer** | **endo** | **(day/hour)** |
| | | | | | | | | | sent/rcv | sent/rcv | |
| 0 | 675a1c45-1a93-437a-ac1d-584238851030 | 1 | 1 | 50 | 150 | 30 | 600 | -nan | 316_222/15_252 | 10/14 | 2022-10-26 13:52 |
| 1 | 75508d96-3e5e-4d20-ac45-705961464f2c | 1 | 1 | 50 | 350 | 30 | 600 | -nan | 1422819/507 | 4/4 | 2022-10-26 13:02 |
| 0 | 64272f21-eca0-4047-96c6-7b1d3a0ce409 | 1 | 1 | 25 | 0 | 30 | 600 | 3.364217 | 8192/3596 | 164/328 | 2022-10-26 13:15 |
| 1 | a539204b-ff08-4149-a74c-6f4fe2f58c43 | 1 | 1 | 50 | 0 | 30 | 600 | 3.474912 | 43691/21845 | 437/875 | 2022-10-26 13:26 |
| 2 | 0ac4c682-71e1-4ed2-9883-bbf58254d121 | 1 | 1 | 100 | 0 | 30 | 600 | 6.620178 | 8192/4096 | 42/84 | 2022-10-26 13:35 |
| 3 | e8a55b3f-76f2-46d0-854a-2a18fb7eae1b | 1 | 1 | 200 | 0 | 30 | 600 | 10.015017 | 8192/4096 | 22/44 | 2022-10-26 13:43 |
| 4 | 9f99c600-13b7-487a-96e8-76e610492308 | 1 | 1 | 300 | 0 | 60 | 900 | 18.431300 | 8192/4096 | 14/28 | 2022-10-26 13:52 |
| | e9fe9733-d69d-4c69-83a4-3b23b9f3 | 1 | 1 | 50 | 350 | 30 | 900 | 47.5 | 8175/4079 | 82/152 | 2022-10-28 10:33 |
| | 5947f884-57ca-49ad-9098-016d3322313d | 1 | 1 | 50 | 350 | 60 | 900 | 35.8 | | | 2022-10-28 10:33 | VCPU 0.25
| | d8ade056-40e0-4d9c-adfe-f56cb4ff671b | 1 | 1 | 50 | 0 | 30 | 900 | 3.9 | | | 2022-10-28 10:33 |
| | b3db99a0-b2a0-4b08-b262-da8f9c5ed423 | 1 | 1 | 50 | 350 | 60 | 900 | 4.9 | | | 2022-10-28 12:25 | VCPU 0.5
Notes :
- For the exact same number of messages exchanged, adding passive connections (forwarder on no topics) produce a huge slowdown of data distribution.
- Increasing CPU performance almost cancel the slow down
Theory:
Somewhere in lib_p2p we do to much call that read the connection table, probably to get topics connections or something alike.
Course of action:
- Add CPU/memory/io monitoring (with netdata)
- inspect the lib_p2p code for all query of the connection table
- use perf on our machine on a slot producer connected to lot of dummy forwarders to pinpoint the hotspot
## 16/11/2022
### 0.25 cpu / 512 mem
| jobid | slots | producers | endorsers | forwarders | blocktime | timeout | avg |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 8a5f2bba-6b5a-4e65-94ee-3fed293c798e | 1 | 1 | 50 | 150 | 30 | 600 | -nan |
| 27590fb8-0091-47f3-8779-beffb38ce5df | 1 | 1 | 50 | 350 | 30 | 600 | -nan |
| ee345dc5-d34c-4386-b0e5-4836d7d1291f | 1 | 1 | 25 | 0 | 30 | 600 | 4.463002 |
| 67f55f1f-afd4-4c38-9c97-8781ef2f45fa | 1 | 1 | 50 | 0 | 30 | 600 | 4.901103 |
| e4032b7a-21b1-4095-8522-5ba34edac91d | 1 | 1 | 100 | 0 | 30 | 600 | 5.303841 |
| 458d89ca-d893-4154-9522-c252c4d8daad | 1 | 1 | 200 | 0 | 30 | 600 | 15.358196 |
| 9f8be8be-90a8-4b26-9289-47dc1dc17f14 | 1 | 1 | 300 | 0 | 60 | 900 | 25.941539 |
| b6651143-2a8d-4d45-845f-c107f4166ef7 | 1 | 1 | 400 | 0 | 60 | 900 | 222.708074 |
| 33d05e91-76c0-4742-bb11-adc566e71ab0 | 10 | 10 | 50 | 0 | 30 | 600 | 6.360722 |
| a7d25dd5-87de-4663-bdb7-f7563a0ed2c4 | 50 | 50 | 50 | 0 | 30 | 600 | 21.282526 |
| 99fad6e6-31fd-4d2d-9681-82ce365412a8 | 100 | 100 | 50 | 0 | 30 | 600 | 64.550792|
| cdb07d4a-cd17-4397-a427-315f3f99372b | 200 | 200 | 50 | 0 | 30 | 600 | -nan |
### 0.50 cpu / 1024 mem
| jobid | slots | producers | endorsers | forwarders | blocktime | timeout | avg |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 85bc758a-539c-40c4-a3e3-7c20e38c5d7b | 1 | 1 | 50 | 150 | 30 | 600 | -nan |
| d47320b3-4c61-4d90-9091-a6a1bbc3f130 | 1 | 1 | 50 | 350 | 30 | 600 | -nan |
| 12f4c468-3159-409f-89e4-c6ea18a9ca83 | 1 | 1 | 25 | 0 | 30 | 600 | 1.095815 |
| 3a71932f-2022-4f1b-b10d-84dbe3ef1fa7 | 1 | 1 | 50 | 0 | 30 | 600 | 1.311979 |
| 0446537b-7f12-437a-9648-d962190073a6 | 1 | 1 | 100 | 0 | 30 | 600 | 1.609642 |
| 51b39e28-ed22-4005-ace2-41c680e525de | 1 | 1 | 200 | 0 | 30 | 600 | 3.822891 |
| 8fc4ca46-aed5-4d84-abb1-4d33817bed8d | 1 | 1 | 300 | 0 | 60 | 900 | 9.746629 |
| 4b80af17-f5b3-4251-b69c-ae997a732db1 | 1 | 1 | 400 | 0 | 60 | 900 | 8.354329 |
| 8f078b1f-34e4-45f0-9b62-dbe9c5a5c5c4 | 10 | 10 | 50 | 0 | 30 | 600 | 2.141770 |
| 156ced03-2671-4ef3-b0b3-89615cf6edfb | 50 | 50 | 50 | 0 | 30 | 600 | 7.384255 |
| d3a24fc1-7792-4619-8a05-0f1a256d46ca | 100 | 100 | 50 | 0 | 30 | 600 | 17.072348 |
| 6609194a-8ed3-4c37-94ad-4ac7db27f9f1 | 200 | 200 | 50 | 0 | 30 | 600 | 91.854090 |
| 8770c152-bd9c-459f-ac09-914a87cac091 | 10 | 10 | 400 | 0 | 60 | 900 | 19.042846 |
| 89dffa06-b129-45c5-9ffe-a27db86ff0cd | 25 | 25 | 400 | 0 | 120 | 1200 | 21 | loosing connections at level 4
### 0.50 cpu / 2048 mem
| jobid | slots | producers | endorsers | forwarders | blocktime | timeout | avg |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 0932c44a-ccb5-4279-a7e4-3252be13fbde | 25 | 25 | 400 | 0 | 30 | 1200 | 16 | Loosing connections at level 8
### 0.50 cpu / 2048 mem / Part 1.5
| jobid | slots | producers | endorsers | forwarders | blocktime | timeout | avg |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 24d20625-6acc-4274-a9fa-d2ced36dcfe7 | 1 | 1 | 50 | 150 | 30 | 600 | 36.535|
| faf1a962-b3ec-4bd9-a4dd-e0d0d60f4c7b | 1 | 1 | 10 | 350 | 60 | 600 | 9 |
| 3060a786-2363-49d6-9b85-1c957bbb96b9 | 1 | 1 | 25 | 0 | 30 | 600 | 1.166 |
| 6d44bbb2-a9bc-4773-8d0b-50947f73d85e | 1 | 1 | 50 | 0 | 30 | 600 | 1.4 |
| | 1 | 1 | 100 | 0 | 30 | 600 | |
| 943d74e3-90c3-48f0-929e-e968a009484a | 1 | 1 | 200 | 0 | 30 | 600 | |
| | 1 | 1 | 300 | 0 | 60 | 900 | |
| | 1 | 1 | 400 | 0 | 60 | 900 | |
| | 10 | 10 | 50 | 0 | 30 | 600 | |
| | 50 | 50 | 50 | 0 | 30 | 600 | |
| | 100 | 100 | 50 | 0 | 30 | 600 | |
| | 200 | 200 | 50 | 0 | 30 | 600 | |
| | 10 | 10 | 400 | 0 | 60 | 900 | |
| | 25 | 25 | 400 | 0 | 120 | 1200 | |
### XP with new recv
| jobid | slots | producers | endorsers | forwarders | blocktime | timeout | avg | RAM | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 28f0c479-18de-4852-9068-3dcdd8780e2d | 1 | 1 | 200 | 0 | 60 | 600 | 3.3 | 1024 | |
| 93a5c9c6-8eff-489f-b91e-154da58100a3 | 25 | 25 | 400 | 0 | 60 | 600 | 20. | 1024 | 2003 shards max |
| 53e723a6-a1a2-4bf4-9b92-690f6eaa7afe | 25 | 25 | 400 | 0 | 60 | 600 | 49 | 2048 | 2038 shards max due to batch loss |
### XP with bootstrap exponential backoff
| jobid | slots | producers | endorsers | forwarders | blocktime | timeout | time | RAM | CPU | comment | monitoring |
| 3ff75241-aea8-4990-9ef4-91dda2e275b7 | 25 | 25 | 400 | 0 | 60 | 600 | 2 levels à 4 sec. , A à 1 min, puis entre 20s et 15 sec | 4096 | 1 VCPU | 2038 puis 2043 shards max due to batch loss | http://34.243.93.97:3000/d/ZDpHtoN4z/p2p-boards-second-take?from=1670017727904&to=1670018634954&var-p2pfilter=job%7C%3D%7C3ff75241-aea8-4990-9ef4-91dda2e275b7 |
# mainnet distrib
| jobid | slots | producers | endorsers | forwarders | blocktime | timeout | avg | RAM | VCPU | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ac6c32bc-3648-4da2-be02-5b39588e3b8a | 25 | 25 | 225 | 0 | 60 | 600 | 4.5 | 4096 | 1 | 2045 shards max due to batch loss |
| 77a1f8ed-a031-47c8-ab10-14f659873eae | 25 | 100 | 225 | 0 | 60 | 600 | 16 | 4096 | 1 | 2048 shards max due to batch loss |
| 4283fd76-c731-4c16-9535-04744c55a097 | 25 | 200 | 225 | 0 | 60 | 600 | 30 | 4096 | 1 | 2022 shards max due to batch loss |
# Slot consumer
| jobid | slots | prod/cons | endorsers | forwarders | blocktime | timeout | avg | RAM | VCPU | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| b0c39edc-eb0f-430a-8364-e0eb8e8b0c70 | 25 | 25 | 225 | 0 | 60 | 600 | 4.5 | 4096 | 1 | |
| 675bb37e-a9c7-4913-bb91-6561ad976f58 | 25 | 200 | 225 | 0 | 60 | 600 | 42 | 4096 | 1 | |
# Slot consumer, new_recv, mainnet distrib
| jobid | slots | prod/cons | endorsers | forwarders | blocktime | timeout | avg | RAM | VCPU | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| adc932e5-c1fa-476e-bc81-57d72857b27e | 25 | 25 | 225 | 0 | 30 | 600 |18 | 4096 | 1 | |
| c7f2493e-dd17-4b1b-9733-1cff010459ce | 25 | 256 | 225 | 0 | 60 | 600 | | 4096 | 1 | |