Evaluation of Parallel Transaction Cancellation in Besu

# Evaluation of Parallel Transaction Cancellation in Besu **PR under test:** [https://github.com/hyperledger/besu/pull/9360](https://github.com/hyperledger/besu/pull/9360) ## TL;DR Across ~18k blocks there’s plenty of ±<10% noise. When focusing on slower blocks and on blocks involving **known heavy contracts**, the cancellation fix shows **neutral to mildly positive** impact, including a **–38%** outlier on a >2s block and a **–30%** win on another contract. No systematic regressions were observed in slow-block cuts. --- ## What’s being tested (context) Besu executes blocks in **parallel and serial** concurrently; whichever finishes first “wins.” When serial catches up first, the parallel path should be cancelled. This PR fixes tracking of parallel `CompletableFuture`s so the losing parallel work can be **cancelled promptly**. **Hypothesis:** Correct cancellation reduces wasted work on heavy, state-touching blocks without hurting others. The effect may be **data dependent**: not every transaction from the same contract is heavy. --- ## Methodology (brief) * Two runs on identical chain data: * **Control:** current behavior * **Test:** PR #9360 (parallel-tx cancellation fix) * Parsed per-block execution time from logs (`Xms exec` / `Ys exec`), normalized to **milliseconds**, and matched by **block number**. * Reported per-block deltas: **Δ(ms)** and **Δ(%) = (test − control)/control × 100**; flagged when |Δ%| ≥ 10%. * Filters: * `--min-exec-millis` to focus on slow blocks. * `--blocks-csv` to restrict to **specific contracts** known to often produce slow blocks: * `MCT-XEN_Batch_Minter` (0x2f84…0479) * `MCT_MXENFT_Token` (0x0000…f20b) * `CoinTool_XEN_Batch_Minter` (0x0de8…f628) * `XEN_Minter` (0xc3c7…c226f) > Sign convention: **Δ% > 0** = test slower; **Δ% < 0** = test faster. --- ## Results ### A) All blocks (no CSV filter) | Filter | Compared | Flagged (≥10%) | Mean Δ% | Median Δ% | | ------------------------ | -------: | -------------: | ---------: | ---------: | | none | 17,937 | 8,414 | **+2.17%** | **+1.90%** | | `--min-exec-millis 1000` | 165 | 11 | **−0.64%** | **−0.32%** | | `--min-exec-millis 2000` | 9 | 1 | **−3.58%** | **+0.79%** | **Notable ≥2s example** ``` Block Control(ms) Test(ms) Δ(ms) Δ(%) Flag ----------------------------------------------------------- 23647133 2386.0 1478.0 -908.0 -38.1 *** ``` ### B) Contract-focused cuts (each CSV run separately) **MCT-XEN Batch Minter ([0x2f84…0479](https://etherscan.io/address/0x2f848984984d6c3c036174ce627703edaf780479))** | Filter | Compared | Flagged | Mean Δ% | Median Δ% | | ------------------------ | -------: | ------: | ---------: | ---------: | | none | 32 | 7 | **−2.84%** | **−1.86%** | | `--min-exec-millis 1000` | 10 | 0 | **−0.88%** | **−0.39%** | | `--min-exec-millis 2000` | 3 | 0 | **+0.85%** | **+1.90%** | **MCT_MXENFT Token ([0x0000…f20b](https://etherscan.io/address/0x0000000000771a79d0fc7f3b7fe270eb4498f20b))** | Filter | Compared | Flagged | Mean Δ% | Median Δ% | | ------------------------ | -------: | ------: | ---------: | ---------: | | none | 382 | 163 | **−0.39%** | **+0.57%** | | `--min-exec-millis 1000` | 3 | 0 | **+0.33%** | **−0.80%** | **CoinTool XEN Batch Minter ([0x0de8…f628](https://etherscan.io/address/0x0de8bf93da2f7eecb3d9169422413a9bef4ef628))** | Filter | Compared | Flagged | Mean Δ% | Median Δ% | | ------------------------ | -------: | ------: | ---------: | ---------: | | none | 2,783 | 1,285 | **+1.36%** | **+2.13%** | | `--min-exec-millis 2000` | 4 | 1 | **−8.68%** | **+0.03%** | **XEN Minter ([0xc3c7…c226f](https://etherscan.io/address/0xc3c7b049678d84081dfd0ba21a6c7fdcc31c226f))** | Filter | Compared | Flagged | Mean Δ% | Median Δ% | | ------------------------ | -------: | ------: | ----------: | ----------: | | none | 116 | 63 | **−3.28%** | **−2.14%** | | `--min-exec-millis 1000` | 1 | 1 | **−30.22%** | **−30.22%** | > The additional contract reinforces the trend: **neutral to mildly positive overall**, with **material wins** on some slow blocks (e.g., ~–30%). --- ## Interpretation * **Noise dominates short blocks** (small positive median when unfiltered). * **Slow-block focus shows benefit**: filtering to ≥1–2s shifts means negative (faster). * **Per-contract**: effects are **data dependent**—not all txs from a “heavy” contract are heavy every time. Still, we see solid wins (–38%, –30%) on some slow blocks, consistent with the idea that **stopping the losing parallel work** helps the serial winner finish earlier. --- ## Could cancellation harm performance? Potential downsides of cancelling Java `CompletableFuture`-based work (in general, and how they might show up in Besu): * **Cancellation ≠ immediate stop:** `CompletableFuture.cancel()` marks the stage cancelled, but **doesn’t forcibly stop** underlying work unless the code cooperatively checks for cancellation or responds to interruption. If low-level calls (e.g., DB/IO) aren’t interruptible, some **work still runs**, muting the benefit. * **Extra signalling & exception paths:** Cancellation completes futures **exceptionally** (`CancellationException`) and propagates through dependent stages. If this happens frequently, you pay for **additional atomic state changes** and **exceptional-control-flow overhead** (especially if stack traces are logged). * **ForkJoin/Executor churn:** Launching many tasks that are then cancelled can create **queue churn** and **lost locality** (warm caches discarded), with small overheads that add up if the cancel rate is high. * **Locks & resources:** If a task is cancelled while holding **locks** or native resources (e.g., RocksDB iterators), incorrect cleanup can cause **longer lock holds**, **contention**, or **leaks**. Cooperative checks should happen at **safe points** (between critical sections) and always release resources. * **GC pressure:** Aborted partial results and short-lived objects can increase **allocation churn**, nudging GC overhead upward in some phases. * **Interruption handling pitfalls:** If interruption is used, **interrupt status leaks** or swallowed interrupts can lead to subtle bugs. Besu code must handle `Thread.interrupted()` carefully and keep invariants intact on early exit. **Why we likely don’t see regressions here:** The observed distributions are near neutral, with improvements concentrated in the slow tail, suggesting cancellation is generally **helpful or harmless** in this workload. Any overhead from signalling/cascade appears **smaller than** the saved work on slow blocks. **What to validate next (low effort, high signal):** * Emit counters/timers for: *parallel started / cancelled / actually aborted early*, and *who won (serial vs parallel)* per block. * Sample **CPU utilization** and **GC logs** to ensure cancellation reduces CPU time when serial wins. * Add a **percentile view (P50/P90/P99)** for blocks ≥1s and ≥2s, plus histograms of Δ%. --- ## Repro (commands) ```bash # All blocks ./compare_besu_exec.py control-besu.log test-cancellation-besu.log # Slow blocks ./compare_besu_exec.py control-besu.log test-cancellation-besu.log --min-exec-millis 1000 ./compare_besu_exec.py control-besu.log test-cancellation-besu.log --min-exec-millis 2000 # Contract-specific (each CSV run separately) ./compare_besu_exec.py control-besu.log test-cancellation-besu.log \ --blocks-csv 0x2f848984984d6c3c036174ce627703edaf780479_MCT-XEN_Batch_Minter_export-transaction-list-1761523431099.csv ./compare_besu_exec.py control-besu.log test-cancellation-besu.log \ --blocks-csv 0x0000000000771a79d0fc7f3b7fe270eb4498f20b_MCT_MXENFT_Token_export-transaction-list.csv ./compare_besu_exec.py control-besu.log test-cancellation-besu.log \ --blocks-csv export-0x0de8bf93da2f7eecb3d9169422413a9bef4ef628_CoinTool_XEN_Batch_Minter.csv ./compare_besu_exec.py control-besu.log test-cancellation-besu.log \ --blocks-csv export-0xc3c7b049678d84081dfd0ba21a6c7fdcc31c226f_XEN_Minter.csv # Optional filters --min-exec-millis <ms> # e.g., 1000 or 2000 --threshold 10 # flag when |Δ%| ≥ 10 --csv results.csv # write detailed rows ``` ---