owned this note
owned this note
Published
Linked with GitHub
# Kusama Mem Leak Task Force
Issues:
- https://github.com/paritytech/substrate/issues/5106
- https://github.com/paritytech/substrate/issues/4249
- https://github.com/paritytech/substrate/issues/4679
Discussion:
- [Substrate Matrix/Riot Channel](https://matrix.to/#/!aenJixaHcSKbJOWxYk:matrix.parity.io?via=matrix.parity.io)
## Problem
We can see kusama nodes running out of memory quickly, sometimes over the run of just of a couple of hours, we see them ramp up gigabytes of memory-most likely unused.
## Overview
_subsystems can and should appear multiple times, confirmed by multiple methods_.
| Subsystem | Status | Method | Details |
| -------- | -------- | ----|-------- |
| Libp2p | ✔️ _cleared_ | added verbose logging to the internals | [Cleared by @tomaka](#Libp2p) |
| Libp2p & sc_network except for mostly sync.rs & protocol.rs | ✔️ _looks normal_ | extracted into a different binary, and while connected to Kusama showed a constant ~40MB memory usage with ~80 connections | |
| futures 0.1 <-> 0.3 compat layer | ✔️ _looks correct_ | Code review by @tomaka | Suspicious because shows up in heaptrace |
| _all_ | ❌ _unlear_ | tracking with heaptrack over 24h | turned out not that useful |
| sc-peerset | _unclear_ | Code review | Known small leak being fixed; could be investigated more afterwards, but unlikely to be the cause according to @tomaka |
| Rocksdb | ✔️ _looks normal_ | | Niko: Added memtest in [parity-common#349](https://github.com/paritytech/parity-common/pull/349/files), run it for few hours on gcp, memory never grow beyound constrained ~100mb
| Polkadot `TableRouter` & `ValidationNetwork` | ❌ _open_ | | Might not be garbage-collected correctly. This code is being removed in a live PR |
| Consensus Gossip | ⚠️ _confirmed minor leak_ | Review | [Uncovered a minor HashMap-Leak of `peer`-info and `hash(Message)` ](https://github.com/paritytech/substrate/pull/5104), Basti / Ben
## Suspects
We have a group of subspects that might be causing memory being kept longer than it should. Among them are uses of `HashMap`s, `HashSet`s `mspc::Unbounded` and generic `Buffer`s. We need to make sure neither of them leaks. For
## Methods
### To reproduce the issue
The issue can be seen on [`flaming-fir` ](https://telemetry.polkadot.io/#list/Flaming%20Fir), just run latest substrate with the following command and watch your memory usage continously increase (for example with `htop`)
```bash
$ cargo run --release -- --chain flaming-fir --pruning 10 --wasm-execution Compiled --db-cache 0 --execution-import-block Native --offchain-worker Never
```
### Using specific tooling
#### To use massif
....
#### Heaptrack
_doesn't seem to be that useful_
...
To get heaptrack to display function names correctly, you seem to need to change the binary being profiled to use the system allocator:
```rust
// main.rs
use std::alloc::System;
#[global_allocator]
static GLOBAL: System = System;
```
Then you can run it with `heaptrack target/release/polkadot`.
\- _Ashley_
### Code Review
Reviewing a specific sub system for suspicious usage patterns, like unbound queue, potentially growing HashMaps/-Sets or any other Buffer.
### Buffer Tracking
Found an unbound queue, bind it and/or add logging to track whether there goes more in than out.
#### using `malloc_size_of`
## Details
### Heaptrack runs
- https://gist.github.com/arkpar/dd0305b4d0706bca8073b43290f1c6e2
### Database
#### rocksdb
I observe running with '--db-cache 0' remove lots of rocksdb noise out of memory profiling (with memory profiler), but does not seems to improve things. (emeric).
### Executor
#### Wasm Executor
### Libp2p
_Method_: @tomaka replaced the Libp2p dependencies via cargo.patch with a more intensly reporting libp2p. **BRANCH**? There's no branch, I just overrid with a local version with printlns everywhere, then discarded everything when nothing was out of the ordinary.
_Result_: No suspicious activity – [see](https://matrix.to/#/!aenJixaHcSKbJOWxYk:matrix.parity.io/$1583158698100909wgzsO:matrix.parity.io?via=matrix.parity.io)
### Networking
_Method_: @tomaka copy-pasted `sc-network`, removed `sync.rs`, `light_client_handler.rs`, `light_dispatch.rs`, `blocks_request.rs`, `on_demand_layer.rs` (because not easily extractable), hacked the content of `protocol.rs` to send back a handshake. The resulting binary connected to Kusama.
_Result_: Around a constant 40MB memory usage over an hour with ~80 connections.
#### Peerset Manager
Known leak because we never clean up the entries of the peerset. An entry is added every time we discover a node through the discovery process. Each entry, however, should be only around ~35 bytes. Being fixed in https://github.com/paritytech/substrate/pull/5108
#### Network Gossip
A [code review uncovered a minor leak](https://github.com/paritytech/substrate/pull/5104) because the internal state of `peerId`s and the `messages` sent to them wasn't cleared on a peer-disconnect. This is probably not significant enough to cause the known gigabyte-of-data-leaks, in particular because this happpens only in an unstable network with a lot constantly changing peerIds.
A [disabled branch](https://github.com/paritytech/substrate/compare/ben-memleak-disable-gossip?expand=1) shows [even faster increasing memory usage](https://matrix.to/#/!aenJixaHcSKbJOWxYk:matrix.parity.io/$1583184496102364NChBP:matrix.parity.io?via=matrix.parity.io).