Shadow support for Polkadot-SDK

# Shadow support for Polkadot-SDK research ## Summary The “Shadow support” work investigated the ability to execute **polkadot-sdk** networks inside the **Shadow** discrete-event network simulator. After some Polkadot-SDK modifications described below we enabled deterministic, and reproducible simulation of Polkadot network on a single Linux machine, while still running (largely) real node binaries rather than a simplified model. The necessary Polkadot-SDK modifications are delivered via the PR: - **PR:** Shadow support - **URL:** https://github.com/paritytech/polkadot-sdk/pull/10055 - **Branch:** `xDimon:shadow_support` (referenced in instructions) Shadow itself is an event-driven network simulator that “directly executes real application code”, allowing large private-network experiments with realistic network conditions (latency, bandwidth, packet loss) without requiring real distributed infrastructure. - Shadow project: https://github.com/shadow/shadow ## What was implemented / delivered ### 1) Ability to run Polkadot networks under Shadow The core outcome is that a Polkadot network (validators, collators, subxt-client) can be launched as a Shadow simulation, where: - nodes run as Shadow-managed processes, - system calls are intercepted by Shadow, - network conditions are controlled by the simulation configuration, - the experiment becomes deterministic when a fixed seed is used ### 2) Tooling to generate a Shadow-compatible network configuration A helper script (`network_builder.sh`) is provided to: - generate chainspecs/genesis configuration for a requested topology, - build any required binaries with “Shadow compatibility” enabled (via `x-shadow` features), - emit a ready-to-run Shadow configuration (`shadow.yaml`) plus the directory layout Shadow expects. This reduces the effort from “manual per-node setup” to a single command that outputs a runnable simulation bundle. ## Polkadot-SDK modifications for Shadow support In order to run Polkadot nodes under Shadow, several modifications were necessary to ensure compatibility with Shadow’s execution model and to allow deterministic behavior. The changes are implemented behind a single compile-time feature flag x-shadow so that non-simulation builds remain unaffected — these changes are not meant for production. They are acceptance-test/simulation-only modifications. The key modifications include: * Disable jemalloc under x-shadow builds (jemalloc’s internal locking patterns can deadlock under Shadow). * Replace Unix-domain-socket (UDS) based IPC between relay node and PVF workers with TCP sockets for x-shadow builds (Shadow does not support UDS). ## How to use (test procedure) Below is the documented procedure to launch a Polkadot network under Shadow. ### Prerequisites - Linux machine (Shadow targets Linux) - Shadow installed per its README: - https://github.com/shadow/shadow ### Minimum example: dev network 1. Install Shadow: - Follow Shadow’s readme: https://github.com/shadow/shadow 2. Check out the Polkadot-SDK branch that includes Shadow support: - https://github.com/xDimon/polkadot-sdk/tree/shadow_support 3. Build Polkadot-SDK with Shadow support: ```bash cargo build --release --features x-shadow ``` 4. Create `shadow.yaml` for a minimal dev network: ```yaml general: stop_time: 100s model_unblocked_syscall_latency: true network: graph: type: 1_gbit_switch hosts: server: network_node_id: 0 processes: - path: /path/to/polkadot-sdk/target/release/polkadot args: --dev start_time: 0s expected_final_state: running ``` 5. Run the Shadow simulation: ```bash shadow --seed 42 --progress true --parallelism $(nproc) /path/to/shadow.yaml > shadow.log ``` 6. Inspect logs: - Node logs appear under the `shadow.data` directory created by Shadow during the run. - Note, that hashes of blocks produced are the same between runs depending on the seed. 7. Cleanup between runs: - Remove `shadow.data` before subsequent runs to avoid mixing outputs: ```bash rm -rf shadow.data ``` ### Network example: 8 RC validators, 1 parachain, 4 collators Same steps as above, but replace 4. with generation of a more complex network using`network_builder.sh` script: 1. Fetch the `network_builder.sh` script: ```bash cd scripts wget https://raw.githubusercontent.com/xDimon/polkadot-sdk/refs/heads/shadow_support_squash/scripts/network_builder.sh cd .. ``` 4. Generate the simulation bundle (chainspec + binaries + `shadow.yaml`): - Example: **8 RC validators**, **1 parachain**, **4 collators** - Recommended: remove `target/` before first run to avoid mixing incompatible build artifacts. ```bash HOST_BW="200 Mbit" SHADOW_LATENCY="75 ms" ./scripts/network_builder.sh 8 1 4 /tmp/shadow-net ``` 5. Run the Shadow simulation: ```bash shadow --seed 42 --progress true --parallelism $(nproc) /tmp/shadow-net/shadow.yaml > shadow.log ``` ## Benefits of Shadow support ### 1) Deterministic, reproducible networking experiments With Shadow, the network is simulated deterministically under a fixed seed. This is particularly valuable for Polkadot-SDK because: - networking and timing issues are often nondeterministic in real deployments, - reproducing “rare” behaviors becomes feasible by pinning the seed + configuration, - flaky tests can be stabilized into reliable scenarios, - debugging becomes faster and more reliable. ### 2) Realistic network conditions without physical infrastructure Shadow allows injecting controlled network characteristics (latency, bandwidth, etc.) to study: - block production and finality behavior under constrained links, - gossip and propagation dynamics, - performance of collators and validators under bandwidth pressure - cencoring of certain nodes (e.g. when malicious collators are cencoring an honest collator) This can replace (or greatly reduce) the need for expensive multi-machine testbeds or cloud deployments for many classes of experiments. ## Case study: collators isolation We utilized the new Shadow support to simulate a specific attack vector targeting the Polkadot availability layer. ### The Attack Scenario **Setup**: A parachain with 4 collators (**A**, **B**, **C**, and **D**) assigned to consecutive block production slots. - **Collator B** is the honest victim. - **Collators A, C and D** are colluding attackers. **Mechanism**: 1. **Collator A** produces a block and distributes it to the backing validators but selectively **withholds** the block data from **Collator B**. 2. **Collator B**, unable to import A's parent block (missing data), cannot build its own block on top of it immediately. It must attempt to recover the data from the availability layer. 3. **Collator C** (the next attacker) produces their block. Because B was delayed waiting for data, C can submit their block to the network before B recovers and produces. 4. **Result**: B's slot is effectively skipped or "censored," and the chain progresses from A directly to C. ### Simulation Implementation Using the `network_builder.sh` script, we simulated this network topology inside Shadow: - **Topology**: 4 Collators, 8 Relay Chain Validators. - **Isolation**: We used the `ISOLATE_COLLATOR_IDX` parameter to simulate the network censorship. - This sets `packet_loss: 1.0` (100% loss) for P2P connections between the victim Collator B and its peers (A, C and D). - Connections to Validators were left intact (to allow backing), strictly simulating the "selective withholding" behavior between collators. - CLI command to generate the setup: ```bash HOST_BW="200 Mbit" SHADOW_LATENCY="75 ms" ISOLATE_COLLATOR_IDX=2 ./scripts/network_builder.sh 8 1 4 /tmp/shadow-net ``` ### Results The Shadow simulation demonstrated the efficacy of the attack in a deterministic environment: - **Observation**: The block production rate of the isolated honest collator B (collator-2000-2) dropped significantly compared to the non-isolated baseline. - **Root Cause**: The latency introduced by the need to fetch missing data via alternative paths (availability recovery) caused B to miss its strict slot deadlines. ![image](https://hackmd.io/_uploads/HJO0giLQ-x.png) ``` ============================================================ BLOCK PRODUCTION AND FINALIZATION BY COLLATOR ============================================================ collator-2000-1: Total blocks produced: 98 Blocks finalized: 41 (41.8%) Blocks NOT finalized: 57 collator-2000-2: Total blocks produced: 14 Blocks finalized: 11 (78.6%) Blocks NOT finalized: 3 collator-2000-3: Total blocks produced: 104 Blocks finalized: 41 (39.4%) Blocks NOT finalized: 63 collator-2000-4: Total blocks produced: 100 Blocks finalized: 54 (54.0%) Blocks NOT finalized: 46 ``` ### Mitigation Verification Using the same simulation setup, we tested a mitigation strategy: - **Fix**: Increasing the **slot duration** configuration in the chainspec for the parachain from the default (6s) to a longer duration (18s). - **Outcome**: With longer slot times, the honest collator (B) had sufficient time to recover the withheld data from the availability layer and produce a valid block before the slot deadline expired. - **Conclusion**: Shadow allowed us to verify that extending slot duration effectively neutralizes this specific censorship vector without requiring a complex physical testbed. ![image](https://hackmd.io/_uploads/r1eJZsUX-e.png) ``` ============================================================ BLOCK PRODUCTION AND FINALIZATION BY COLLATOR ============================================================ collator-2000-1: Total blocks produced: 41 Blocks finalized: 21 (51.2%) Blocks NOT finalized: 20 collator-2000-2: Total blocks produced: 28 Blocks finalized: 22 (78.6%) Blocks NOT finalized: 6 collator-2000-3: Total blocks produced: 27 Blocks finalized: 21 (77.8%) Blocks NOT finalized: 6 collator-2000-4: Total blocks produced: 33 Blocks finalized: 21 (63.6%) Blocks NOT finalized: 12 ``` ## Future work ### CI integration with x-shadow builds Set up CI pipelines to build Polkadot-SDK with `x-shadow` feature to ensure continued compatibility and catch regressions early. ### Zombienet-sdk integration Integrate Shadow support into the zombienet-sdk framework to allow users to easily launch Shadow-based simulations as part of existing zombienet workflows, enabling more deterministic network testing scenarios and avoid flaky tests. ## Conclusion Shadow support provides a practical, repeatable, and scalable way to run Polkadot-SDK networks under a simulated but realistic network environment. It improves the developer workflow for network-centric testing and debugging, makes regressions easier to reproduce, and enables larger experiments on local hardware than traditional integration setups. **References** - PR: https://github.com/paritytech/polkadot-sdk/pull/10055 - Tracking issue: https://github.com/paritytech/polkadot-sdk/issues/9748 - Shadow: https://github.com/shadow/shadow