Concept of Distributed Systems

# Concept of Distributed Systems A distributed system is a collection of independent computers (nodes) that appear to the users as a single system. These nodes coordinate and communicate via a network to achieve a common goal. :::success Examples: Blockchain ::: ### Key Characteristics: - Concurrency: Multiple nodes/processes running at once. - Fault Tolerance: Able to continue working despite some node failures. - Scalability: Can grow in size/performance by adding more nodes. - No global clock: Each machine runs independently. ## Safety and Liveness in Distributed Systems These are two critical properties for ensuring correctness of distributed protocols: ### Safety: Nothing bad ever happens. :::success Example: No two nodes agree on different values (consistency). Violated if consensus is broken. Or 2 nodes trying to access a critical resource at the same time ::: ### Liveness: Something good eventually happens. :::success Example: The system eventually reaches consensus or responds. Violated if the system is stuck or halts. ::: Trade-off: In some failure scenarios (e.g., network partitions), you can’t guarantee both safety and liveness at the same time. ## The Byzantine Fault This refers to a failure where a component (node) may act maliciously, including lying or colluding. :::success Examples of Byzantine Faults: - In a blockchain, a Byzantine miner might propose invalid transactions or create a malicious fork to double-spend. - In a distributed database, a Byzantine node might send conflicting updates to different replicas, causing inconsistency. ::: ### How Ethereum Handles Byzantine Faults - Pre-Merge: Proof of Work (PoW): PoW assumes that less than 50% of the computational power is controlled by Byzantine nodes. If malicious miners try to create a false chain (e.g., to double-spend), they need more computational power than the honest majority, which is computationally expensive. During a network partition, Byzantine nodes might exacerbate forks by proposing conflicting chains. Ethereum resolves this by choosing the longest chain, ensuring safety but potentially sacrificing liveness (transactions on minority chains are discarded). - Post-Merge: Proof of Stake (PoS): During a partition, if a supermajority of validators is in one partition, finality can proceed, maintaining safety. If validators are split, finality may pause (sacrificing liveness) until the partition heals, but safety is preserved as no conflicting blocks are finalized. Byzantine validators could try to exploit partitions, but the 2/3 supermajority requirement ensures consistency. ## Block Reorgs and Reorg Attacks Block Reorganization (Reorg): Happens when a blockchain client replaces the current longest chain with a longer valid one (typically due to network latency or a more powerful miner). Reorg Attack: An attacker (with enough hashing power or stake) intentionally mines an alternative chain that eventually becomes longer, causing a rollback of recent blocks. This can enable: - Double-spending - Transaction censorship :::success Mitigation: Wait for multiple confirmations (e.g., 6 blocks on Bitcoin) before considering a transaction final. For Ethereum POS, finality provides a stronger guarantee than confirmations, making reorgs of finalized blocks economically infeasible. Also slashing and inactivity leaks deter Byzantine behavior and ensure liveness. ::: ## Uncle Blocks / Ommer Blocks These are valid blocks that were mined almost at the same time as the main block but didn't become part of the main chain (due to latency or network partition). Ethereum includes references to uncles for security and decentralization. Miners who produce uncle blocks still get a partial reward. ## Concurrency and Parallel Programming ### Concurrency: Multiple tasks make progress independently, but not necessarily at the same time. Think of multitasking — interleaved execution. :::success Managed via threads, goroutines, channels, etc. ::: ### Parallelism: Multiple tasks run at the same time on different cores/processors. :::success Focus is on performance — actual simultaneous execution ::: Example: - Ethereum is concurrent at the network level: Multiple nodes in the Ethereum network are concurrently: - Validating blocks - Gossiping transactions - Syncing data - Listening to peer updates For Parralelism, - Multiple miners/validators work in parallel to solve PoW (legacy) or validate blocks (PoS). - Clients can download block data, sync headers, and verify signatures in parallel across CPU cores. ## The CAP Theorem This is a fundamental principle in distributed systems, stating that a distributed system can only guarantee two out of three properties at any given time: Consistency, Availability, and Partition Tolerance. - Consistency: All nodes in the system see the same data at the same time. Reads always return the most recent write. - Availability: Every request (read or write) receives a response, even if some nodes fail, ensuring the system remains operational. - Partition Tolerance: The system continues to function despite network partitions (communication failures between nodes). :::success The theorem asserts that in the presence of a network partition ( P ), a distributed system must choose between Consistency or safety ( C ) or Availability or liveness (A), as achieving all three simultaneously is impossible. :::