Lecture 18: Distributed Systems

# Lecture 18: Distributed Systems Distributed systems is how most modern compution is done. we are constantly talking to other computers. even the OS on a single computer needs to worry about distributed issues (it plays a role). (file download) We use distributed systems becuase they have: - Better scalability and performance - improved reliability and availability - ease of use, with reduced operation expenses - enabling new collaboration and business models A few problems include: - different machines do not share memory - they do not know the state of another - The only way to interact remotely is to use a network - usually asynch, slow and error prone - not controlled by any single machine - failures of one machine are not visivle to other machines Ideally, a distributed system would be just like a single machine system but it has more resources, is more reliable and faster. **Transparent** distributed systems look as much like a sinlge machine system as possible. ### Deutsh's "Seven (Eight) Fallacies of Network Computing" 1. The network is reliable 2. There is no latency (instant response time) 3. The available bandwith is infinite 4. the network is secure 5. the topology of the network does not change 6. there is one admin for the whole network 7. the cost of transporting additional data is zero 8. all locations on the network are equivalent bottom Line: true transparency is not achievable ## Distributed System Paradigms **parallel processing** which rely on tightly coupled special hardware - not widely use and not gonna discuss **single system images** which make all the nodes look like one big computer. (somewhere between hard and impossible) **Loosely coupled systems** which work with difficulties as best as you can - this is the typical modern approach **cloud computing** which is a recent variant ## Loosely Coupled Systems it is a parallel group of independnent computers which are connected by a high speed LAN, serving similar but independent request and minimal coordination and cooperation required the motivation for this method is it has scalability, good prive performance, availability (if protocol permits stateless servers), ease of management and reconfigurable capacity most common examples are web servers and app servers ### Horizontal Scalability Each node is largely independent so you can add capacity just by adding a node "on the side". Scalability can be limited by network, instead of hardware or algorithms. Reliabilty is also high as failure of just one of N nodes just reduces capacity. ![Screenshot 2024-06-04 at 3.16.39 PM](https://hackmd.io/_uploads/SyeHLMzpN0.png) ### Elements of Loosely Coupled Architechture **farm of independent servers** for running the same software and serving different requests. They also may shared a common back end database. **front-end switch** is a switch that can distribute incoming requests amoung available servers which can do both load balancing and fail-over. **Service protocol** are stateless servers and idempotent (same results for n times) operations. successive requests may be sent to different servers. ### Horizontally Scaled Performance individual servers are very inexpensive (e.g. blade servers may only be $100 to $200 each) scalability is execellent where 100 servers = 100x performance. Service availability is excellent where frong-end automatically bypasses failed servers and stateless servers and client retries fail-over easily. The challenge is managing thousands of servers with automated installation, global configuration services, self monitoring, self-healing systems, scaling limited by mangaement, not HW or algorithms. ## Cloud Computing This is the most recent twist on distributed computing where in principle anything can run on the cloud. Much of the work is run using special tools. These tools support particular kinds of parallel/distributed processing using methods such as map-reduce or horizontal scaling This means the use does not need to be a distributed systems expert. ### Map-Reduce Perhaps the most common cloud computing software tool. It uses a method of dividing large problems into compartmentalized pieces. Each of which can be performed on a separate node with an eventual combined set of results. lets say there is a single function you want to perform on a lot of data such as searching it for a partiuclar string. It will divide the data into disjoint pieces, perform the function on each piece on a separate node (map) and combine the results to obtain output (reduce). EX: 64 megabytes of text data gets divides into 4 chunks (16 M bytes), each chunk goes to one processor and performs the map function of "count words" on each We might have two more nodes assigned to doing the reduce operation and they will each recieve a share of data from a map node. The reduce node performs a reduce operation to "combine" the shares and outputting its own result. We can have one reduce node the combines everything ### Synchronization in MapReduce Each map node produces an output file for each reduce node and it is produced atomically. The reduce node cannot work on this data untile the whole file is written which forces a synchronization point between the map and reduce phases. ### Cloud Computing and Horizontal Scaling this is an excellent match as you can rent some cloud nodes to be your web servers. If the load gets heavy you can ask for more nodes. As the load lightens you can release uneeded nodes. No need to buynew machines and there is no need to administer your own machines. ## Distributed Synchronization This can be quite difficult as it has to deal with: - **spatial separation** where different processes run on different systems. There is also no shared memory for atomic instruction locks. may even have different OS systems - **temporal separation** where we cannot "totally order" spatially separated events. before/simultaneous/after lose their meaning - independnet modes of failure where if one partner can die, while others continue as a solution to the lock issue we could consider **leases** as a more robust lock these are obtained from resource manager which gives client exclusive right to update the file. the lease "cookie" must be passed to server on update and can be released at the end of the critical section this will only be valid for a limited period of time after which the lease cookie expires after which new leases can be granted. This can handle a wide range of failures such as process, client node, server node, and network revoking an expired lease is fairly easy and it includes a "good until" time to check. any operation with a stale cookie fails. This makes it safe to issue a new lease where the old lease-hold can no longer access object. object must be restored to last "good" state where we roll back to a state prior to the aborted lease (all or none transactions) ## Security for Distributed Systems problems with security: - your OS cannot guarantee privacy and integrity - authentication is harder - the wire connecting the user to the system is insecure - even with honest partners, hard to coordinate distributed security - the internet is an open network for all Goals of Network Security: - secure conversations - only you and reciever knows what is said and no one can tamper with your messages - positive identification of both parties - availability: nodes and network must be reachable when they need to be Elements of network security: - cryptography for transporting data and authentication - cryptographic hases to detect message alterations - digital signatures and public key certificates - filtering technologies (firewalls, ..., to keep bad stuff from reaching our machines) check sums are often used to detect data corruption where it adds up all bytes in a block and recipient adds up all the recieved bytes and compare (pretty weak) cryptographic hases are very strong check-sums as it is very hard to find two messages with the same hash. They are one way so they cannot infer original input from output and is well distributed as any change to inout changes output. using Cryptographic Hashes: - start with a message you want to protect - compute hash - transmit the hash securely - recipient does the same computation on recieved text - if both hash results agree the message is intact the secure hash transport has to be encrypted and, unless secrecy required, cheaper than encrypting entire message or via a secure channel (secure channel unlikely) ## Putting it Together: Secure Socket Layer (SSL) A general solution for securing network communication which is built on top of existing socket IPC. It establishes a secure link between two parties which guarentees privacy and integrity. there is also a certificate-based authenticaiton of server, optional certificate authentication of client, PK used to distribute a symmetric session key, rest of data transport switches to symmetric crypto. ### Digital Signatures encrypting a message with private key to sign it so only you could have encrypted it. encrypting everything with your private key is a bad idea as asymetric encryption is extremely slow. If you only care about integrity you do not need to encrypt it all such as using crypto hash and encrypting that hash with your private key instead of the whole message. ![Screenshot 2024-06-06 at 2.33.55 PM](https://hackmd.io/_uploads/ByRBosJrA.png) ### Signed Load Modules how do we know we can trust a program? we can by designating a certification authority that will verify the reliability of the software by code review and testing. they will sign a certified module with their private key. we can verify signature with their public key as it proves the module was certified by them and has not been tampered with. ### Important Public Key Issue If I have a public key I can authenticate messages, but how can I be sure who owns the private key? I can get Microsoft's public key when I first buy their OS so I can verify their load modules and updates. More generally we can get a certificate of authenticity to guarantee who the real owner of a public key is. **PK certificate** is a data structure the contains an identity and a matching public key, a digital signature of those items, and signature usually signed by someone I trust. if I know public key of the authority who signed it I can validate that signature is correct and that the certificate has not been tampered with. I can also ture they authenticated the certifcate owner (same way we trust drivers licenses). it is a little bit of a chicken and egg problem where I can learn the public key of a new partner using his certificate but to use his certificate I need the public key of whoever signed it. So how do I get that public key commonly solved by **out of band communication** by having the key in a trusted program, like a web browser or hand delievered.