Educative : Grokking the Modern System Design Interview

# Educative : Grokking the Modern System Design Interview > https://www.educative.io/courses/grokking-the-system-design-interview # System Design Interviews ## 1. Getting Ready for the System Design Interview - It is important to note that System Design questions not only test the technical knowledge of the candidate but also their ability to approach a problem, think critically, and make trade-offs - Preparing for a System Design interview is not only about understanding the technical details but also about understanding the problem, breaking it down, and finding the most optimal solution. - ![image](https://hackmd.io/_uploads/B1TvsLW11l.png) - Communication with the interviewer is critical. It’s not a good idea to silently work on the design. Instead, we should engage with the interviewer to ensure that they understand our thought process. - Things will change, and things will break over time because of the following: - There’s no single correct approach or solution to a design problem. - A lot is predicated on the assumptions we make. - Distributed systems give us guideposts for mature software principles. These include the following: - Robustness (the ability to maintain operations during a crisis) - Scalability - Availability - Performance - Extensibility - Resiliency (the ability to return to normal operations over an acceptable period of time post-disruption) ## 2. Key Concepts to Prepare for the System Design Interview To crack the System Design interview, we’ll need to prepare in four areas: - Fundamental concepts in System Design interview - Fundamentals of distributed system - The architecture of large-scale web applications - Design of large-scale distributed systems ![image](https://hackmd.io/_uploads/By4dSvbJyx.png) ### Fundamental concepts in System Design interview --- **PACELC theorem** The CAP theorem doesn’t answer the question: “What choices does a distributed system have when there are no network partitions?”. The PACELC theorem answers this question. The PACELC theorem states the following about a system that replicates data: - `if statement`: A distributed system can tradeoff between availability and consistency if there’s a partition. - `else statement`: When the system normally runs without partitions, the system can tradeoff between latency and consistency. ![image](https://hackmd.io/_uploads/HJXQ8vbkJx.png) - The first three letters of the theorem, PAC, are the same as the CAP theorem. The ELC is the extension here. - The theorem assumes we maintain high availability by replication. When there’s a failure, the CAP theorem prevails. If there isn’t a failure, we still have to consider the tradeoff between consistency and latency of a replicated system. > Examples of a PC/EC system include BigTable and HBase. > - They’ll always choose consistency, giving up availability and lower latency. > Examples of a PA/EL system include Dynamo and Cassandra. > - They choose availability over consistency when a partition occurs. Otherwise, they choose lower latency. > An example of a PA/EC system is MongoDB > - In the case of a partition, it chooses availability but otherwise guarantees consistency. **Server-sent events (SSEs)** ![image](https://hackmd.io/_uploads/B1mDPPWk1g.png) ### Fundamentals of distributed system --- We can understand the limitations of specific architectures and the trade-offs needed to achieve particular goals (e.g., consistency vs. write throughput). At the most basic level, we must start with the strengths, weaknesses, and purposes of distributed systems. We need to be able to discuss topics like: **Data durability and consistency** We must understand the differences and impacts of storage solution failure and corruption rates in read-write processes. **Replication** Replication is the key to unlocking data durability and consistency. It deals with backing up data but also with repeating processes at scale. **Partitioning** Also called sharding; partitions divide data across different nodes within our system. As replication distributes data across nodes, partitioning distributes processes across nodes, reducing the reliance on pure replication. **Consensus** Given the travel time of data packets, can this be recorded and properly synchronized in the remote nodes, and can it be concurred? This is a simple problem of consensus—all the nodes need to agree, which will prevent faulty processes from running and ensure consistency and replication of data and processes across the system. **Distributed transactions** Once we’ve achieved consensus, now transactions from applications need to be committed across databases, with fault checks performed by each involved resource. Two-way and three-way communication to read, write, and commit are shared across participant nodes. ### The architecture of large-scale web applications --- **N-tier applications** Processing happens at various levels in a distributed system. Some processes are on the client, some on the server, and others on another server—all within one application. These processing layers are called tiers, and understanding how those tiers interact with each other and the specific processes they are responsible for is part of System Design for the web. **Stream processing** Stream processing applies uniform processes to the data stream. If an application has continuous, consistent data passing through it, then stream processing allows efficient use of local resources within the application. ### Design of large-scale distributed systems --- This can seem like a lot, but it honestly takes only a few weeks of prep—less if we have a solid foundation to build on. Once we know the basics of distributed systems and web architecture, it is time to apply this learning and design real-world systems. Finding and optimizing potential solutions to these problems will give us the tools to approach the System Design interview with confidence. Once we are ready to practice our skills, we can take on some sample problems from real-world interviews, and tips and approaches to build ten different web services. --- Common System Design interview questions include creating a URL shortener with web crawlers, understanding the CAP theorem, discussing SQL and NoSQL databases, identifying use cases for various data models, addressing latency issues, constructing algorithms and data structures, and so on. Consumers and businesses alike are online, and even legacy programs are migrating to the cloud. Distributed systems are the present and future of the software engineering discipline. As System Design Interview questions make up a bigger part of the developer interview, having a working knowledge of distributed systems will pay dividends in our career. ## 3. Resources to Prepare for a System Design Interview ### Ask why a system works By asking themselves the right questions, candidates can think through dense and ambiguous situations. - Learn how popular applications work at a high level—for example, Instagram, Twitter, and so on. - Start to understand and ask why some component was used instead of another—for example, Firebase versus SQL. - Build serious side projects. Start with a simple product and then improve and refine it. - Build a system from scratch, and get familiar with all the processes and details of its construction. ## 4. The Do’s and Don’ts of the System Design Interview ### Strategize, then divide and conquer ![image](https://hackmd.io/_uploads/rJv4gKygyg.png) ### Ask refining questions We can put on our product manager hat and prioritize the main features by asking the interviewer refining questions. The idea is to go on a journey with the interviewer about why our design is good. These interviews are designed to gauge if we’re able to logically derive a system out of vague requirements. We should ensure that we’re solving the right problem. Often, it helps to divide the requirements into two groups: - Requirements that the clients need directly. - ex: the ability to send messages in near real-time to friends. - Requirements that are needed indirectly. - ex: messaging service performance shouldn’t degrade with increasing user load. ### Handle data Some important questions to ask ourselves when searching for the right systems and components include the following: - What’s the size of the data right now? - At what rate is the data expected to grow over time? - How will the data be consumed by other subsystems or end users? - Is the data read-heavy or write-heavy? - Do we need strict consistency of data, or will eventual consistency work? - What’s the durability target of the data? - What privacy and regulatory requirements do we require for storing or transmitting user data? ### Discuss the components At some level, our job might be perceived as figuring out which components we’ll use, where they’ll be placed, and how they’ll interact with each other. Front-end components, load balancers, caches, databases, firewalls, and CDNs are just some examples of system components. ### Discuss trade-offs These are some of the reasons why such diversity exists in design solutions: - Different components have different pros and cons. We’ll need to carefully weigh what works for us. - Different choices have different costs in terms of money and technical complexity. We need to efficiently utilize our resources. - Every design has its weaknesses. As designers, we should be aware of all of them, and we should have a follow-up plan to tackle them. We should point out weaknesses in our design to our interviewer and explain why we haven’t tackled them yet. An example could be that our current design can’t handle ten times more load, but we don’t expect our system to reach that level anytime soon. We have a monitoring system to keep a very close eye on load growth over time so that a new design can be implemented in time. **This is an example where we intentionally had a weakness to reduce system cost.** ### What not to do in an interview - Don’t write code in a system design interview. - Don’t start building without a plan. - Don’t work in silence. - Don’t describe numbers without reason. We have to frame it. - If we don’t know something, we don’t paper over it, and we don’t pretend to know it. --- # Introduction ## 1. Introduction to Modern System Design ### Understanding System Design **System design** is the process of defining components and their integration, APIs, and data models to build large-scale systems that meet a specified set of functional and non-functional requirements. ![image](https://hackmd.io/_uploads/B1PvUtJxyl.png) **System design aims to build systems that are reliable, effective, and maintainable, among other characteristics.** - **Reliable systems** handle faults, failures, and errors. - **Effective systems** meet all user needs and business requirements. - **Maintainable systems** are flexible and easy to scale up or down. The ability to add new features also comes under the umbrella of maintainability. # Abstractions ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ # Non-functional System Characteristics ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ # Back-of-the-envelope Calculations ~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ # Building Blocks ## 1. Introduction to Building Blocks for Modern System Design The purpose of separating the building blocks is to thoroughly discuss their design just once. This means that later we can use them anywhere without having to go over them in detail again. We can think about building blocks as bricks to construct more effective, capable systems. Many of the building blocks we discuss are also available for actual use in the public clouds, such as Amazon Web Services (AWS), Azure, and Google Cloud Platform (GCP). **Building blocks example:** 1. **[Domain Name System](https://www.educative.io/courses/grokking-the-system-design-interview/introduction-to-domain-name-system-dns):** This building block focuses on how to design hierarchical and distributed naming systems for computers connected to the Internet via different Internet protocols. 2. **[Load Balancers](https://www.educative.io/courses/grokking-the-system-design-interview/introduction-to-load-balancers):** Here, we’ll understand the design of a load balancer, which is used to fairly distribute incoming clients’ requests among a pool of available servers. It also reduces load and can bypass failed servers. 3. **[Databases](https://www.educative.io/courses/grokking-the-system-design-interview/introduction-to-databases):** This building block enables us to store, retrieve, modify, and delete data in connection with different data-processing procedures. Here, we’ll discuss database types, replication, partitioning, and analysis of distributed databases. 4. **[Key-Value Store](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-the-key-value-store):** It is a non-relational database that stores data in the form of a key-value pair. Here, we’ll explain the design of a key-value store along with important concepts such as achieving scalability, durability, and configurability. 5. **[Content Delivery Network](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-the-content-delivery-network-cdn):** In this chapter, we’ll design a content delivery network (CDN) that’s used to keep viral content such as videos, images, audio, and webpages. It efficiently delivers content to end users while reducing latency and burden on the data centers. 6. **[Sequencer](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-sequencer):** In this building block, we’ll focus on the design of a unique IDs generator with a major focus on maintaining causality. It also explains three different methods for generating unique IDs. 7. **[Service Monitoring](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-distributed-monitoring):** Monitoring systems are critical in distributed systems because they help analyze the system and alert the stakeholders if a problem occurs. Monitoring is often useful to get early warning systems so that system administrators can act ahead of an impending problem becoming a huge issue. Here, we’ll build two monitoring systems, one for the server-side and the other for client-side errors. 8. **[Distributed Caching](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-the-distributed-cache):** In this building block, we’ll design a distributed caching system where multiple cache servers coordinate to store frequently accessed data. 9. **[Distributed Messaging Queue](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-the-distributed-messaging-queue):** In this building block, we’ll focus on the design of a queue consisting of multiple servers, which is used between interacting entities called producers and consumers. It helps decouple producers and consumers, results in independent scalability, and enhances reliability. 10. **[Publish-Subscribe System](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-the-pub-sub-abstraction):** In this building block, we’ll focus on the design of an asynchronous service-to-service communication method called a pub-sub system. It is popular in serverless, microservices architectures and data processing systems. 11. **[Rate Limiter](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-the-rate-limiter):** Here, we’ll design a system that throttles incoming requests for a service based on the predefined limit. It is generally used as a defensive layer for services to avoid their excessive usage—whether intended or unintended. 12. **[Blob Store](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-a-blob-store):** This building block focuses on a storage solution for unstructured data—for example, multimedia files and binary executables. 13. **[Distributed Search](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-the-distributed-search):** A search system takes a query from a user and returns relevant content in a few seconds or less. This building block focuses on the three integral components: crawl, index, and search. 14. **[Distributed Logging](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-distributed-logging):** Logging is an I/O intensive operation that is time-consuming and slow. Here, we’ll design a system that allows services in a distributed system to log their events efficiently. The system will be made scalable and reliable. 15. **[Distributed Task Scheduling](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-the-distributed-task-scheduler):** We’ll design a distributed task scheduler system that mediates between tasks and resources. It intelligently allocates resources to tasks to meet task-level and system-level goals. It’s often used to offload background processing to be completed asynchronously. 16. **[Sharded Counters](https://www.educative.io/courses/grokking-the-system-design-interview/system-design-the-sharded-counters):** This building block demonstrates an efficient distributed counting system to deal with millions of concurrent read/write requests, such as likes on a celebrity’s tweet. --- # Domain Name System ## 1. Introduction to Domain Name System (DNS) ![image](https://hackmd.io/_uploads/ry2Wu5Jxyx.png) ### Important details - **Name servers**: It’s important to understand that the DNS isn’t a single server. It’s a complete infrastructure with numerous servers. DNS servers that respond to users’ queries are called name servers. - **Resource records**: The DNS database stores domain name to IP address mappings in the form of resource records (RR). The RR is the smallest unit of information that users request from the name servers. There are different types of RRs. The table below describes common RRs. The three important pieces of information are `type`, `name`, and `value`. The name and value change depending upon the type of the RR. ### Common Types of Resource Records # Load Balancers # Databases # Key-value Store # Content Delivery Network (CDN) # Sequencer # Distributed Monitoring # Monitor Server-side Errors # Monitor Client-side Errors # Distributed Cache # Distributed Messaging Queue # Pub-sub # Rate Limiter # Blob Store # Distributed Search # Distributed Logging ## 1. System Design: Distributed Logging ### Need for logging **Issues with using print statements as an alternative to logging** ![image](https://hackmd.io/_uploads/Sy8LB91xJg.png) Logging allows us to **understand our code**, **locate unforeseen errors**, **fix the identified errors**, and **visualize the application’s performance**. This way, we are aware of how production works, and we know how processes are running in the system. Log analysis helps us with the following scenarios: - To troubleshoot applications, nodes, or network issues. - To adhere to internal security policies, external regulations, and compliance. - To recognize and respond to data breaches and other security problems. - To comprehend users’ actions for input to a recommender system. ## 2. Introduction to Distributed Logging ## 3. Design of a Distributed Logging Service # Distributed Task Scheduler # Sharded Counters