Build a scalable system

Build a scalable system == Consider a scenario of user placing a bet - ```mermaid sequenceDiagram participant User participant UI participant API participant DB User ->> UI: click "place bet" button UI ->> API: place bet API ->> DB: get user balance API ->> DB: get price API ->> API: check balance and price API ->> DB: update balance API ->> DB: save the bet API ->> UI: ok UI ->> User: display confirmation ``` At the first glance, the logic looks simple, but let's take a closer look of the API server and what it needs to do. 1. client sends TCP handshake to the server 2. server (kernel) opens (and maintains) the tcp connection 3. clients sends payload to the server. The payload may be separated to different packets and send with delays, i.e. Nagle algorithm could introduce 200ms; network MTU/MSS could affect how the packets are segmented 4. Server (kernel) buffers up the packets received from the client 5. Application reads the payload from the kernel buffer, parse the HTTP request header, then hand over to application code for handling the request 6. As the first step in application code, it talks to database to get user balance, which will hit DNS first to resolve IP address of database server, then open up TCP connection for database communication 7. The database will open up a session, implicitly open up a transaction for the SQL statement, create a shared read lock, retrieve the data, release the shared read lock, then return the data to application server 8. Depend on how the code is written, when application reads price or writes into the database (update balance / save a bet), the same process will happen, e.g. open a DB connection, create a session, start a transaction, lock the row (shared read lock or exclusive write lock), commit, then return response to application server 9. Application sends the response to client, which goes through TCP connection to client. Again, Nagle algorithm and MTU/MSS plays a part in response and could affect the latency 10. Upon completion of response, application will go through a 4 way close process to release the resource. The amount of things happening under the hood is already enormous, let alone adding all the errors that possibly could occur in the flow - 1. Max available socket allowed for a server at a time is 65535, but cloud providers may impose a much lower limit. A medium size Azure instance can support max 4096 concurrent connections. 2. At different stages of TCP connection, a malicious client could attack the system to force exceptions (SYN flooding, RST attack, session hijacking), or exceptions could come from a congested network (e.g. large number of time_waits) 3. Application server needs to handle the exception if the connection is terminated before full payload is received 4. Application server needs to handle the exception of unexpected payload, either at the TCP / HTTP level, or invalid data at the application level 5. When talking to the database, it could multiple error cases, i.e. failed to reach database server (DNS error, too many database connections), database slowness or timeout (no available database worker thread, waiting for a lock, IO congestion, network congestion, etc), or database transaction completed but failed to return value to the application (timeout, network errors or power loss). 6. When sending back response to client, it could be errored out, e.g. connection lost, or time out from client side, causing the transaction to complete at the server side, but appear to be failed at the client side. 7. Power loss could happen at any stage, causing different implications (stale DB transaction and long running locks until abort, transaction completed at server side but failed to reach client, etc) ### Why the technical details matter? Any tech issues are seen as customer issues, and affect customer experience, no matter it is networking, database, or application logic. Developers are in the front line to troubleshoot customer issues, then escalate to relavant party (infrastructure, network, etc) when needed. Broader system knowledge is essential to do it effectively. Another reason for developers to understand the technical details is for scalability and stability. Business growth usually means a larger user base, which requires the system to be able to scale. Also, with more customers, the stability requirement becomes more important. Small business may be able to live with 99% (2 nines) availability (7 hours downtime per month), but any scale up company will need 3 - 4 nines availability. With critical systems it may require 5 nines, meaning just over 5 minutes of downtime per year. To achieve required level of scalability and stability, apart from knwoing application logic, it is crucial for developers to have a deep understanding of operating system, networking, database and infrastructure. A better way to put it, is the Law of Leaky Abstractions *All non-trivial abstractions, to some degree, are leaky.* As a result, to build a reliable software, developers must learn the abstraction's underlying details anyway. ### Principles, Patterns, Paradigms, Practices The followings series of the article will cover how to build a scalable system from the following four perspectives #### Principles They are the foundation rules when building systems for scalability and stability. - [Manage complexity - separation of concerns](https://tbd/) - [Resiliency - design for failure](https://tbd/) - [Evolutionary architecture - open close principle](https://tbd/) - [Evaluate the design - high cohesion low coupling](https://tbd/) #### Patterns Established architectural / engineering patterns to solve particular sets of problem - Concurrency - pessimistic locking vs optimistic locking - CQRS, event driven architecture, event sourcing - Caching - write around, write through, write back - Scaling - scale out vs scale up - Latency - debounce and batching - Transaction control - DB transactions, 2 phase commit, reconciliation, saga - Self healing - sentinel, probe, leader election #### Paradigms Paradigms are different ways of viewing / solving a common problem, and often requires mindshift or out-of-box thinking. They are not silver bullets but are the sparks inspiring thinking from different angles. - Dependency injection - Callbacks, Continuation Passing Style (CPS) and Coroutines - Iterators, Streams and Reactive Programming - Convert concurrency back to single threading - Actors model - Data locality optimization - Domain Specific Language (DSL) - Declarative programming and desired state configuration - Separating what and hows - functional programming - Redux and Mobx #### Practices Practices are a compilation of prescribed ways of working, to ensure consistency across the organization, especially when building medium to large size applications. It is a living document owned, maintained and updated by engineering teams, for high level alignment across multiple engineering teams. - Domain Driven Design - Structured Logging - Readable code (what's the purpose and outcome) - Unit test (randomly change / remove code should fail testing)