Identify the Bottleneck

# Identify the Bottleneck --- ## You Will See The Sky In the following mermaid diagram there is an N-tier distributed web application. ```mermaid graph LR; DNS[Domain Name Server]-->WSGI WSGI-->SC SPA[Single Page App] SPA-- issues 100 requests / min-->LB1 SPA-->LB2 SPA-.-SC subgraph Virtual Private Cloud WSGI[Web Server Gateway Interface] LB1[Load Balancer 1] LB2[Load Balancer 2] LB1-- handles 500 requests / min -->API1 LB2-- handles 2500 stateful connections -->WS1 WS1-- issues 200 requests / min -->LB1 SC{Serverless Cloud} WS1[Websocket Server] API1[API Server] S3[S3 Bucket] SC-->S3 SC-->Database API1-->S3 API1-->Database end API1-->AI AI[OpenAI] ``` --- When a user enters https://youwillseethesky.com/ into their URL bar, they'll be taken to the Web Server Gateway Interface. From there, a serverless cloud will respond to the request and serve the single page app or some assets from a database or S3 bucket. After the user has the single page app on their client, they'll have a direct path to send requests to one of two load balancers which will direct their traffic to a node that can fulfill their request. --- The single page app can make queries to an API server, and it can make queries to a websocket server. Once a client establishes a connection to a websocket server it may remain alive for an indeterminate period of time. They may also make any number of requests to the API server from the client. The Websocket server will also make requests to the API server periodically, which will depend on levels of user activity. --- ## Classic Scaling Problems The following are some classic problems in scaling a distributed system that I will run into soon, that are really authentic to the stuff you'll run into. I'll pose them as problems then talk about what my plan was. --- ## Bottlenecks Can you identify the most probable bottleneck in the system above before we move forward? Assume there is one server running for each. --- ## Systems Engineering Assume the Web Server Gateway Interface can handle infinite incoming connections but will take at least one second to respond every time. - What can we put in place to improve this response time? --- ## Network Engineering - Assume we have one websocket server that can handle ~2500 stateful connections at a time. - What problems might we face when scaling this? --- ## Network and Systems Engineering - Assume we have one websocket server that can handle ~2500 stateful connections at a time. - Imagine the user's current game is stored in the database, but right now the interface makes you explicitly load your game upon reconnecting. What needs to change? --- ## Network and Systemns Engineering - Assume we have one websocket server that can handle ~2500 stateful connections at a time. - Disconnecting users forcefully is a very expensive action, because of the save-reload problem. What is the process for updating a server to the newest version? --- - Assume the Websocket Server issues ~200 requests per second to the API server when the websocket server is at full load, and all front-end clients in total issue about 100 requests per second to the API server. - Assume we have one API server that can process 50 requests per second. - What problems might we face when scaling this server? - How many API servers do we need to handle 10,000 users? - 100k? - 1m? - 10m? - What other systems need to be looked at once our scaling hits 1 million users? - What questions do you have (what does this depend on?) - 10 million? - 1 billion users! [dr evil pinky.jpg] --- - Here are some of the answers to the calculations: https://docs.google.com/spreadsheets/d/1xJS7B0RvEd1tCSWtL2Iy8BvrgA1jS9FD0R0aSF6IiaI/edit?usp=sharing --- ## Question 1: In the following mermaid diagram there is an N-tier distributed web application. We want you to find the bottleneck in the system given each component's speed in "Requests Per Minute". Assume we have naievely only launched the minimum number of servers required to handle each stage of the request. ```mermaid graph LR; A[Web Client 100k] -- 10 requests min --> B[Web Server 20]; C[Application Server 20]; D[Database Server 30]; E[Cacheing Layer]; B -- 100 requests min --> C B -- 50 requests min --> E C -- 500 requests min --> D ``` --- ## Question 2: In the following mermaid diagram there is an N-tier distributed web application. We want you to find the bottleneck in the system given each component's speed in "Requests Per Minute". Assume we have naievely only launched the minimum number of servers required to handle each stage of the request. ```mermaid graph LR; A[Web Clientx 2,000,000] -- 50 requests/min --> B[Web Server x 50]; C[Application Server x 1000]; E[Cache] F[Mongo Database x 2000] B -- 300 requests/min --> C B -- 200 requests/min --> E C -- 500 requests/min --> D[SQL Server x 500]; C -- 1500 requests/min --> F ``` --- Review (stuff you might get asked in an interview): -What techniques do you use to identify a bottleneck in a distributed system? What metrics do you use to measure a bottleneck in a distributed system? What are some common methods used to reduce bottlenecks in distributed systems? What strategies do you employ to troubleshoot and fix bottlenecks in distributed systems? How do you analyze monitored data to identify the root cause of a bottleneck in a distributed system?