Zcash Throughput Consensus Imp

Zcash Manhattan Project: Throughput Improvements - Note: The design is based on the [Zebra (Rust) full node](https://github.com/ZcashFoundation/zebra).This document is a WIP. ### Scope Phase 1: Core Throughput Enhancements - Reduce block time to 15 seconds - Implement legacy transaction limits to reserve capacity for Tachyon-optimized transactions (WIP) Phase 2: Fee Market Optimization - Dynamic fee market implementation (future scope) - Enhanced transaction prioritization mechanisms (future scope) ## Phase 1 Technical Design ### 1. Block Time Reduction (75s to 15s) Objective: Reduce block production interval from the current post-Blossom 75 seconds to 15 seconds, achieving ~5x throughput increase in block production rate. #### Implications - Difficulty: with the same difficulty algorithm would be reacting ~5x faster in wall-clock time, which helps with hashrate shocks, however also amplifies noise and risk of oscillations in difficulty, along with timestamp game risk being introduced. Carefully considering the averaging window and damping config becomes critical. - Propagation/orphans: Orphan rate is a function of propagation delay / block time. At 75s, we're comfortably sitting outside the danger zone, but at 15 sec propagation delay becomes a bigger proportion of the block time therefore that must be optimised in order to keep the same orphan/stale rate as current mainnet. - State growth & witnesses: Pre-tachyon, faster blocks don't just mean more headers, they mean 5x more nullifier growth and witness updates which is a direct tax on wallet side. #### 1.1. Key Parameters Modified Block Timing Constants - POST_BLOSSOM_POW_TARGET_SPACING: Change from 75 to 15 seconds. This is the target time between blocks, that the difficulty adjustment algorithm tried to maintain. To illustrate the purpose of it, if a block comes faster than this constant -> increase difficulty, if a block comes slower than it -> decrease difficulty (https://github.com/ZcashFoundation/zebra/blob/main/zebra-chain/src/parameters/network_upgrade.rs) Economic Model Adjustment - BLOSSOM_POW_TARGET_SPACING_RATIO: Need to adjust this to maintain halving schedule. This is the ratio of pre-Blossom to post-Blossom block times. Used to adjust economic parameters (block rewards, halving schedules) when block time changes. This directly dictates the emission rate. - Current value: 2 (150s / 75s) - Proposed value: 10 (150s / 15s), if we follow the same pattern as Blossom and want to preserve halving cadence in wall-clock time. This would need to be paired with appropriate block-subsidy/emission adjustments, similar in spirit to ZIP 208 - INVENTORY_ROTATION_INTERVAL: 53s Currently must be < 75s block time. With 15s blocks, might need reduction to 11 or 13 (also a prime, like 53 which avoids the synchronisation problem). This dictates how often the protocol rotates its invesotry of which peers have which blocks/transactions, similar to a cache expiration policy for peer state. Network Synchronization Parameters: - SYNC_RESTART_DELAY: Reduce from 67s to 11s-13s and reduce BLOCK_DOWNLOAD_TIMEOUT proportionally. This dictates the cool-down period between sync attempts when zebra finishes downloading a batch of blocks or needs to restart the sync process. - PEER_GOSSIP_DELAY: Need to be reduced from 7s to something more appropriate (1-2s?). This constant dictates the throttle delay between gossip messages to peers. When Zebra learns about a new block/transaction, it waits this long before gossiping it to peers. - MIN_PEER_RECONNECTION_DELAY: May need to be adjusted - REQUEST_TIMEOUT: Need to be adjusted - Timestamp validation rules: Need to be reviewed. Current rule allows blocks up to 2 hours ahead. At 15s blocks, that's 480 blocks of slack (vs current 96 blocks) #### 1.2. Difficulty Adjustment Impact A consequence of smaller block time is that the difficulty adjustment algorithm will respond 5x faster to hashrate changes. Zcash is currently using a moving average with median time past difficulty adjustment model that updates every block. Below is the formula: 1. Calculate mean difficulty of last 17 blocks: mean = sum (last_17_diffs)/17 2. Calculate actual timespan using median time past: actual_timespan = median(newest_11_times) - median(oldest_11_times) 3. Apply damping to reduce oscillations: damped_timespan = target_timespan + (actual_timespan - target_timespan) / 4 4. Bound the adjustment: bounded_timespan = clamp(damped_timespan, target * 0.84, taget * 1.32) [max 16% increase and max 32% decrease in difficulty] 5. Calculate new difficulty: new_diff = mean_target * bounded_timespan / target_timepsan At current state, reducing block time without changing difficulty adjustment parameters pose a high oscillation risk (adjustment window 12 min -> 4.25 min). Due a faster feedback loop, there is a higher chance of over-correction, and vice versa. Hashrate spike -> Difficulty increases too much -> Blocks slow down -> Difficulty drops too much -> Blocks speed up -> Repeat Existing difficulty parameters can remain unchanged but damping needs to be considered along with the floor and ceilings: ``` pub const POW_AVERAGING_WINDOW: usize = 17; // Difficulty averaging window pub const POW_MEDIAN_BLOCK_SPAN: usize = 11; // Timestamp median window pub const POW_ADJUSTMENT_BLOCK_SPAN: usize = 28; // 17 + 11 (total context needed) pub const POW_DAMPING_FACTOR: i32 = 4; // Reduces volatility pub const POW_MAX_ADJUST_UP_PERCENT: i32 = 16; // Max +16% per block pub const POW_MAX_ADJUST_DOWN_PERCENT: i32 = 32; // Max -32% per block ``` However, the adjustment window based on the new block time offers various pros and cons: | **Benefits** | **Drawbacks** | |----------------------|-------------------| | Faster response to hashrate changes | More sensitive to short-term fluctuations | |Better stability under volatile hashrate, can adjust in ~4 instead of 21 min | Amplifies oscillations; for e.g. if oscillations happen in a period of ~5 min, they will be fully captured | | Faster recovery from attacks (e.g. selfish mining attacks) | More vulnerable to timestamp manipulation | We should give a serious consideration to hashrate oscillation amplification. An alternative could be to scale POW_AVERAGING_WINDOW to maintain same wall-clock adjustment period, but needs more consideration. One lever is keeping the 17 blocks but optionally expanding the window if we see oscillations. We can also try to simulate the system behaviour based on a virtual net to see how these parameter tweaking might respond. #### 1.3 Stale Rate + Network Propagation Impact At 75s blocks, propogation delay, if we assume it to be ~1-2s (propagation delay here means the time it takes for 95% of nodes have seen a new block through gossip and propagation. So we are assuming it to be 2s). At the assumed propagation time (benchmarks yet to be computed), we have a stale rate of ~0.4% (computed by adding custom metrics to zebra), propagation is roughly 2.6% of the block time. Since stale rate is a function of block time and propagation delay, at a 15s block, without reducing propagation delay proportionally, stale rate will likely spike. A simple illustration: ``` Let x = typical block propagation time end-to-end (seconds) Let T = block interval (seconds) If x << T, stale/orphan rate is low. As x approaches T, stale rate increases. Based on this, we can say stale_prob ~ x/T So, going from block time of 75s -> 15s, the stale probability increases by ~5x, given block propagation time stays constant. That means stale rate will go from 0.4% to 2%. 2% is very high. Therefore, block propagation time must be optimised, Tachyon aims to help with this. The question is, to what degree? ``` However, on a positive note, Tachyon aims to help reduce propagation delay by completely shifting the transmission of ciphertexts to out-of-band, which means the blocks will be shorter. Here's some details on that. To understand Tachyon's impact on block propagation, lets look at the current block construction. The block consists of: 1. Header (~1,487 bytes fixed) 2. Transactions (variable, up to 2 MB total, controlled by MAX_BLOCK_BYTES = 2,000,000 (2 MB) and MAX_BLOCK_SIGOPS = 20,000 (signature operations)) The header consists of: ```rust version: u32 // 4 bytes previous_block_hash: Hash // 32 bytes merkle_root: merkle::Root // 32 bytes Commits the block header to ALL transactions in the block commitment_bytes: [u8; 32] // 32 bytes (varies by network upgrade) time: DateTime<Utc> // 4 bytes difficulty_threshold: CompactDifficulty // 4 bytes nonce: [u8; 32] // 32 bytes solution: Solution // ~1,347 bytes (Equihash) ``` For transactions, we have tx count and for each transaction, the transaction bytes. Transaction creation happens off-chain in wallets, the final serialised transaction goes on-chain which is what's being referred to here. The serialised V5 transaction looks like so: ```rust V5 { network_upgrade: NetworkUpgrade, // Which network upgrade (NU5, NU6, etc) lock_time: LockTime, // Earliest time/height for inclusion expiry_height: block::Height, // Latest height for inclusion inputs: Vec<transparent::Input>, // Transparent inputs outputs: Vec<transparent::Output>, // Transparent outputs sapling_shielded_data: Option<sapling::ShieldedData<sapling::SharedAnchor>>, orchard_shielded_data: Option<orchard::ShieldedData>, // Orchard bundle } ``` For orchard_shielded_data: ```rust pub struct ShieldedData { /// The orchard flags for this transaction. /// Denoted as `flagsOrchard` in the spec. pub flags: Flags, /// The net value of Orchard spends minus outputs. /// Denoted as `valueBalanceOrchard` in the spec. pub value_balance: Amount, /// The shared anchor for all `Spend`s in this transaction. /// Denoted as `anchorOrchard` in the spec. pub shared_anchor: tree::Root, /// The aggregated zk-SNARK proof for all the actions in this transaction. /// Denoted as `proofsOrchard` in the spec. pub proof: Halo2Proof, /// The Orchard Actions, in the order they appear in the transaction. /// Denoted as `vActionsOrchard` and `vSpendAuthSigsOrchard` in the spec. pub actions: AtLeastOne<AuthorizedAction>, /// A signature on the transaction `sighash`. /// Denoted as `bindingSigOrchard` in the spec. pub binding_sig: Signature<Binding>, } pub struct AuthorizedAction { /// The action description of this Action. pub action: Action, /// The spend signature. pub spend_auth_sig: Signature<SpendAuth>, } pub struct Action { /// A value commitment to net value of the input note minus the output note pub cv: commitment::ValueCommitment, /// The nullifier of the input note being spent. pub nullifier: note::Nullifier, /// The randomized validating key for spendAuthSig, pub rk: reddsa::VerificationKeyBytes<SpendAuth>, /// The x-coordinate of the note commitment for the output note. #[serde(with = "serde_helpers::Base")] pub cm_x: pallas::Base, /// An encoding of an ephemeral Pallas public key corresponding to the /// encrypted private key in `out_ciphertext`. pub ephemeral_key: keys::EphemeralPublicKey, /// A ciphertext component for the encrypted output note. pub enc_ciphertext: note::EncryptedNote, /// A ciphertext component that allows the holder of a full viewing key to /// recover the recipient diversified transmission key and the ephemeral /// private key (and therefore the entire note plaintext). pub out_ciphertext: note::WrappedNoteKey, } ``` Each action's encrypted note and output results in 580 bytes and 80 bytes respectively. Currently, 2,262 actions per block Practical maximum (with multiple txs): ~2,000 actions. Post tachyon, thats a minimum (80.5% reduction per action). Overall saving post-Tachyon: 1. Removing ciphertexts: 66,000 bytes saved (~50% of block) 2. Aggregating proofs: 33,000 bytes saved (~25% of block) 3. Total savings: 99,000 bytes (~75% of block) This offsets the stale rate spike increase dramatically because now we have smaller blocks for the same # of tx's. That said, to realise this benefits, the tx limit logic would need to be revised #### 1.4 Timestamp based attack vectors Difficulty adjustment needs trustworthy timestamps; but miners can skew block timestamps within allowed bounds. With shorter block times, there are more blocks in an hour, so if someone plays aggressive timestamp games, that error enters the difficulty loop in minutes instead of tens of minutes, so we need to validate whether the existing median-time logic still gives us enough inertia. TBD: Do we want to keep the 17-block window and accept a ~4 minute real-time response, or consider a longer averaging window to recover some of the 21-minute inertia? TBD: Have we profiled how sensitive the current DA is to realistic timestamp distributions at 15s? TBD: Is there appetite to add safety rails against oscillation (e.g., soft bounds on ActualTimespan) before we ship the new block time? A couple of good papers on this: - https://arxiv.org/pdf/2308.15312 - https://arxiv.org/html/2505.05328v1 #### 1.3. Key Risks & Mitigation - Propagation delay & stale/orphan rate increase. The more often two miners finish work within the same propagation window, the more often they create competing blocks. Moreover, this results in hashrate reduction and more frequent short-lived forks. - Mitigation: - Lower security per unit wall-clock time. This results from the impact on confirmations. Users/exchanges typically wait N confirmations, say 6 confirmations @ 75s = 7.5min -> 6 confirmations @ 15s = 1.5min. - Mitigation: Need higher confirmations for the same security - Ensuring the gossip layer can handle 5x message volume efficiently - Mitigation: gossip layer stress-testing in controlled environment with benchmarks - Shielded tx proofs take time to generate, so will that impact the users in being able to geenrate proofs quickly enought for the block? - Migitation: Wallet UX becomes more urgent, requires benchmarking of current proof times to gauge impact. - 5x more block headers = impact on rocksDB - Mitigation: Analyse rocksDB write performance at 5x the load - Pre-tacheyon issue is that nullifiers accumulate 5× faster, commitment tree grows 5× faster, wallets must update witnesses 5× faster, scanning burden increases proportionally. The state growth impact of a 5× block rate only matters pre-Tachyon; Tachyon’s accumulator + pruning eliminates historical tree bloat entirely. - Mitigation: this should be rolled out post-tacheyon. The feasibility of 15s block times dramatically improves after Tachyon Phase 2 (block-level aggregated proof verification). Pre-aggregation, user-provided Orchard proofs may exceed the 15s block interval under load. - Network level latency issues. The key risk being that propagation delay will become a considerable % of the new block time. - Mitigation: Relay engineering, push-based propagation #### Network Considerations #### Testing Requirements Critical test paths for validation: - Network timing synchronization tests - Difficulty adjustment algorithm validation at new spacing - Halving interval calculations - Block propagation performance benchmarks - Orphan rate measurements under various network conditions ### 2. Legacy Transaction Limits #### Objective Reserve block space capacity for post-Tachyon shielded transactions by constraining legacy transparent and pre-Tachyon transaction types. #### Transaction Classification Classification logic must be tied to explicit version bits or a dedicated transaction flag, not heuristic detection of ciphertexts or pool types, because Tachyon modifies Orchard circuits and nullifier formats. Perhaps a Tachyon version marker might be a simple, elegant solution here. Transactions are classified as legacy if they meet any of the following criteria: - Fully transparent: Only transparent inputs/outputs, no shielded components - Old shielded pools: Using Sprout instead of Sapling/Orchard - In-band encryption: Contains ciphertexts in transaction data (pre-Tachyon model) Tachyon-Optimized Definition: Transactions using Orchard shielded pool with future support for: - Proof aggregation markers - Out-of-band payment indicators ##### Problem statement Current Zcash protocol has three transaction categories: - Fully transparent: Bitcoin-style UTXOs with no privacy - Shielded (Sprout/Sapling/Orchard): Privacy-preserving but with in-band encrypted notes - Mixed: Combinations of transparent and shielded Tachyon-optimized transactions will be significantly more efficient due to the smaller size (since no ciphertexts, out_ciphertext and enc_ciphertext will be removed from Orchard actions), lower validation costs (due to aggregated proofs across multiple transactions) and better scalling due to near zero state contention through PCD. After block aggregation is live, the fee market will naturally penalize legacy txs. Hard-coded limits may need to be phased out once the Tachyon tx cost model stabilizes. Without capacity reservations, legacy transactions could crowd out Tachyon transactions during the migration period, preventing the network from realizing Tachyon's scaling benefits. Given Tachyon’s staged rollout, pre-aggregation Zcash will likely need an uncle policy to prevent excessive miner variance at 15s. This aligns with Ethereum’s pre-merge uncle design. Need to benchmark the math for uncle existence probability to prove it is super small @1s network gossip + execution time #### Implementation Architecture Let's assume we want to reserve **75%** of the block capacity for Tachyon transactions while maintaining **25%** for legacy types. We can difne a binary classification system like so: | **LEGACY (25%)** | **TACHYON (75%)** | |----------------------|-------------------| | Fully transparent | Orchard V5/V6 | |Uses Sprout | Out-of-band | | Has ciphertexts | No ciphertexts | | Pre-V5 format | Future PCD | Based on this, blocks can track two independent capacity pools: - Legacy pool (500kb, 25%), limited sigops (5,000) - Tachyon pool (1500kb, 75%), High sigops (15,000) Legacy should be able to overflow if tachyon pool is under-utilised. But legacy capacity is hard-capped to prevent crowding out Tachyon. The rationale for this is illustrated by a early rollout scenario: - Low Tachyon Adoption (Month 1) - Legacy transactions: 800 KB - Tachyon transactions: 200 KB - Result: Legacy uses 500 KB (limit) + 1,000 KB (overflow from unused Tachyon) = 1,500 KB A transaction can be classified as *legacy* if it meets any of these conditions: | Criterion | Rationale | Code Check | |------------------------|---------------------------------------------------|--------------------------------------------| | Fully Transparent | No privacy benefit, uses UTXO model inefficiently | No sapling/orchard/sprout data | | Uses Sprout | Deprecated shielded pool, inefficient | Has joinsplit data | | Has In-Band Encryption | Contains ciphertexts (pre-Tachyon model) | Sapling outputs OR Orchard without marker | | Pre-V5 Format | Old transaction versions (V1–V4) | Transaction version < 5 | if this criteria is not met, we can assume the transaction to be *Tachyon optimised*. Moreover, to distinguish Orchard transactions with vs. without ciphertexts, we need a marker: perhaps a out_of_band_payments flag field. ##### Block production algorithm 1. Receive mempool transactions 2. Classify each as legacy/tachyon 3. Sort by fee priority (ZIP-317: the existing ZIP for fee-weighted random selection remains) 4. Fill Tachyon capacity first 5. Fill legacy capacity (limited) 6. Allow legacy overflow if Tachyon capacity unused 7. Return selected transactions ```python # For each block at height >= tachyon_activation_height: legacy_bytes = 0 legacy_sigops = 0 for tx in block.transactions[1:]: # Skip coinbase if tx.is_legacy(): legacy_bytes += tx.serialized_size() legacy_sigops += tx.sigop_count() if legacy_bytes > MAX_LEGACY_BYTES: return INVALID("TachyonLegacyBytesExceeded") if legacy_sigops > MAX_LEGACY_SIGOPS: return INVALID("TachyonLegacySigopsExceeded") return VALID ``` #### Risks & Mitigation - Risk: Users don't migrate to Tachyon wallets, legacy capacity fills up, transactions stuck. - Mitigation: Generous 25% legacy capacity. Overflow mechanism allows legacy to use unused Tachyon space and configurable limits allow emergency adjustments - Risk: Transactions misclassified, causing incorrect capacity enforcement. - Mitigation: Comprehensive unit tests for classification logic, along with integration tests with real transaction vectors and testnet regression. - Risk: Tachyon takes longer, limits deployed prematurely - Mitigation: Limits are disabled by default until activation height, and activation height set conservatively (only when Tachyon is fully ready) #### Obervability & Metrics - Legacy vs. Tachyon transaction counts in mempool - Block composition (legacy bytes vs. Tachyon bytes) - Fee rates for each category - Overflow frequency (how often legacy uses Tachyon space) - Wallet migration rate (what % of wallets support Tachyon) ### 3. Uncle Block Inclusion (not in scope anymore) [WIP] Early notes on how geth handles it: - Uncle PoW is not added to the winning block - TotalDifficulty(block) = TotalDifficulty(parent) + Difficulty(block) so uncles are not included in this calculation. Uncle PoW is verified independently but not added to main chain's total difficulty - MaxUncles = 2, and the uncle must be within 7 generations - Each uncle must satisfy some conditions: no duplicate, uncle can't be a main chain block, uncle's parent must be an ancestor and uncle must have a valid PoW seal - Re rewards, uncle miner reward = (uncle_number + 8 - block_number) * block_reward / 8 - Block miner bonus = block_reward / 32 per uncle included - Uncle rewards are only economic incentives, Uncle PoW doesn't contribute to chain security/cumulative difficulty - However, uncles indirectly affect difficulty via the Byzantium algorithm. The reason for this is blocks with uncles indicate higher network hash rate or poor propagation, so difficulty increases faster. Having said that, this is just difficulty adjustment, not uncle difficulty accumulation. Tl;dr: this mechanism has some benefits of GHOST (reduced orphan rate) via uncle rewards, but maintains simple longest-chain rule based on cumulative difficulty of main chain only. I think this is to prevent nothing-at-stake kinda attacks because if uncle PoW counted, miners could build on multiple forks and include uncles to boost total difficulty artificially ### Baseline Metrics Collection - Mempool Monitoring - Transaction count: zcash_mempool_size_transactions - Total bytes: zcash_mempool_size_bytes - Queued transactions: mempool_currently_queued_transactions - Collection: Direct from Prometheus endpoint - Dashboard: grafana/mempool.json - Block Statistics - Cumulative transactions: state.finalized.cumulative.transactions - Derivation: Use Prometheus rate() function for per-block calculation - Block Size Distribution - Current state: Blocks serialized but size not exposed - Implementation: Add histogram metric during block commit - Block Propagation Latency - Measurement points: - Block hash first seen (gossip receipt) - Block download completion - Block verification completion - Transaction Propagation Latency - Measurement: Time from gossip receipt to verification - Orphan Rate Tracking - Events: Chain fork abandonment - Metrics: Counter for orphaned blocks, reorg events - Reorg Depth Distribution - Measurement: Blocks between fork point and reorg tip - Metric: Histogram of reorg depths Fee markets (WIP): - ZIP-317 interacts heavily with post-Tachyon cost structures. Fee weighting must be re-tuned after proof aggregation and accumulator deployment Initial metrics (wip): ![image](https://hackmd.io/_uploads/ryI0JQBzWg.png)