High Level Design

# High Level Design --- [TOC] ## Intro [HLD Notes 1](#https://github.com/KingsGambitLab/Lecture_Notes/tree/non-dsa/non_dsa/hld) [HLD Notes 2](#https://docs.google.com/spreadsheets/d/1Zed_C4BHrF2LFer1ZVpQivIU947KvWnRzR9EL1iNv_k/edit?usp=sharing) --- # Load Balancing and Consistent Hashing https://notability.com/n/1XJJIW_KWAUcIWKNuZ1dDH ## Load Balancing x x x ## Statefull VS Stateless Load Balancing x x x ## Load balancing Algorithms ## State Less type algorithems ### Round Robbin - Type state less load balancing - first will go to 1st then 2nd to 2nd and so on - if there are 5 machines (Servers), Request 6 will to server 1 :::success **Advantages -** ::: - Easy and light to implement - Each machine will get same trafic (ie eaqual distribution of load) :::warning **When server do down -** ::: - Remove that server from DNS - Start sending the request to next server :::danger **Limitation -** ::: - All machines should have same capabilities - Say if some machine can handle 100 request, while other can 1000, but RR redirects same trafic to all the machines irrespective of the capacity ### Weighted type of balancing - Based on Reponse time The server whose responce time is higher get more request.Numbers quntified based on guesstimation - Based on no. of requests for each server each server is configed with no. of request it is process at a time. Numbers quantified based ion gusstimation ## Statefull Load balancing ### Simple Stateful Load Balancing Maintain Key value pair at load balancer of user_id and Machine_num :::warning **If Machine Goes Down -** ::: - Remove entry from key value pair - in this case it will rout to random machine - Replace the machine with new one in the key value table :::danger **Disadvantages -** ::: - Data Heavy, for example sites like facebook with millions of user per server ### Range based Statefull load balancing - Define range based on lets say user_id per server - Light to maintain on load based, example if ID between 1 - 1000 route to server 1, simple conditions saved on LB :::danger **Disadvantages -** ::: *If machine goes down or if new machine added* - in that case will need to redistribute the whole range. - in such situation the user who are not affected by faulty machine will get remapped to new server and unnecessarily affected ### Modulo Algorithm - User_id % N, where N = Number of Machines :::danger **Disadvantages -** ::: *If machine goes down or if new machine added* - in that case will need to redistribute the whole range. - in such situation the user who are not affected by faulty machine will get remapped to new server and unnecessarily affected *Consistent Hashing Algorithm will solve all disadvantages.* ### Consistent Hashing Algorithm - Try to minimise the chaos or data movement as much as possible in case a machine goes down. - Easy to maintain and impliment. - Divide the load as uniformlly as possible :::info **Logic -** ::: - Imagine a circle, called consistant hashing ring (CHR) and divided it in segments where the number of segment is very large, say 10^18 or N, segments marked in clockwise direction - Pass every ID inside hash function (H-user) and modulo the value by N = Assigned secment (AS) - This will limit the hash value between range 0 - N - We have taken N is so high and hashing, *to avoid collision error with the ID*. for eaxmple if N is 10 then the ID 1 and 11 will get segment 1. - There is another hashing function for server side (H-server) - map AS on CHR - Start itterating in clockwise direction and assign user to first machine encountered during itteration. :::info **Implimentation** ::: - Mapping Server on CHR - A' = H-server(server_ID) % N - B' = H-server(server_ID) % N - . - . - Saving server segment value in a list --> - server_seg_map = [A', B' . .] - Mapping User on CHR --> - U' = H-user(User_ID) % N - Upper Bond = Smallest number in array greater then or equal to given number - Use binary search for upper bond - Assigning Server to User --> - assign.upperbond(array : server_seg_map, given_num : U') :::success **If Machine Goes Down** ::: - if a machine goes down, do not assign next server. - Because in that case the next server will get double the traffic - This is called cascading effect failuer - To solve this, *assign multiple segment on CHR to one server by having multiple H-server functions --- # Caching - Have seperate machines for code and data base, so that they are not tighly coupled - During code deployement data base is still available - Server resources are not shared between code processing and data requests - For situations like login requirement like user data need to be saved on all servers - Scaling processing capability or DB data storage size based on traffic or data situation becomes possible - Different hardware requirement for computation and processing of code (CPU) and storing the data (HDD/SSD) :::info **General Terms -** ::: Cache HIT - When reading from cache and get the data Cache MISS - When data not available in cache, get data from DB, give to user and add to cache as well ## Sharding Part 1 - For scaling the database, to store high volume data we need to have multiple database. - To have data split in multiple DB machines is called sharding. - We need some kind of load balancer between server and DB which will identify which DB machine contains required data. - Hence this Db LB is always statefull. - Sharding is process of distributing large data accross multiple DB machines. - :::info :information_source: **Note** - ::: We will use same Consistant hashing Ring (CHR) to only to fetch the data from DB machine but we will also the same algorithm to store the data. :::info **If DB machine goes down** ::: - In this case we will have data replica(backup). - This is called *master-slave data* (Details Below). :::info **If DB-LB machine goes down** ::: - The backup LB is provided by services like AWS. - They split the request accross the LB as well. - Details below. :::warning :warning: There is an extra network hop between server and DB which increase latency. ::: ### Latency Comparison Numbers (~2012) | Operation | Latency (ns) | Latency (us) | Latency (ms) | Notes | |------------------------------------------|-----------------|--------------|--------------|--------------------------------------| | L1 cache reference | 0.5 | – | – | – | | Branch mispredict | 5 | – | – | – | | L2 cache reference | 7 | – | – | 14× L1 cache | | Mutex lock/unlock | 25 | – | – | – | | Main memory reference | 100 | – | – | 20× L2 cache, 200× L1 cache | | Compress 1K bytes with Zippy | 3,000 | 3 | – | – | | Send 1K bytes over 1 Gbps network | 10,000 | 10 | – | – | | Read 4K randomly from SSD* | 150,000 | 150 | – | ~1 GB/sec SSD | | Read 1 MB sequentially from memory | 250,000 | 250 | – | – | | Round trip within same datacenter | 500,000 | 500 | – | – | | Read 1 MB sequentially from SSD* | 1,000,000 | 1,000 | 1 | ~1 GB/sec SSD, 4× memory | | Disk seek | 10,000,000 | 10,000 | 10 | 20× datacenter roundtrip | | Read 1 MB sequentially from disk | 20,000,000 | 20,000 | 20 | 80× memory, 20× SSD | | Send packet CA→Netherlands→CA | 150,000,000 | 150,000 | 150 | – | **Notes** - 1 ns = 10⁻⁹ seconds - 1 µs (us) = 10⁻⁶ seconds = 1,000 ns - 1 ms = 10⁻³ seconds = 1,000 µs = 1,000,000 ns :::success **Solution -** ::: - RAM is faster HDD or SSD - prefetch the data based on guesstimate and store it in RAM of server - for next request, it is a statefull LB, this request will do to same previous server - the further data needed is fetched from RAM - you are saving time further as DB network jump latency is also eliminated - We will need to findout which users data need to be stored in RAM **Caching is process of storing frequently accessed data near to the user.** - Above content is one of the example of caching. - there are other type of caches. - example of some other cache are - - If a data is needed by all DB servers we store it in global cache - IP addresses/DNS cache is stored in router/browser [Continued](#CAP-Theorem-and-Master-Slave) ## Type of Caching - Client side Cache | Browser Cache - Local or Application Server Cache - Content Delivery Networks (CDN) - Global cache ### Client side Cache | Browser Cache - Application use internet to fetch the data - Application then stores the data in users local application cache - Examples - - facebook posts loaded, can be seen for some time even if internet disconnected - buffer data of video - google typeahead suggesion ### Local Cache - Cache maintained locally in each server - same as client side in all other regards ### Content Delivery Networks (CDN) - Example like media content - video, Images, files, live TV or blob - These type of data is large size data - in cases of poor internet, text gets uploaded but image gets time - CDN is infra developed to fetch the data in best possible way - Akamai, cloudflare, cloudfront is a company that provide CDN, for client like hotstar - The data need to be static, for example video once uploaded does not gets updated. - used in client facing use-cases ### Global Cache - Redis, MEM is an example of global cache softwares - Loosely you can consider this external RAM - They are specilised server which we use to store some frequently access data - Instead of multile db machine, we maintain some general data which is common for all users/ services - This cache is not limited to RAM of server - This cash server capacity can be made more distributed, increase capacity (ie horizontally scale) not disturbing the server ## Cache Invalidation Strategies - Cache can get stale - We need to invalidate/expire the cache *TTL : Time to live* - A time limit is specified, after which the cache expire - We will need to go back to Db for fresh data every TTL cycle - System has eventual consistency ## Cache Writing Strategies Decides the behaviour of cache ### Write Through Cache (Sync) - Any update is first written in Cache - Once done and response received by server - Then it is updated in DB - Only the data available in cache is updated in cache, rest is updated directly in DB **Flow -** ```mermaid flowchart TD A[App Server]-->|write|B[Cache] B[Cache]-->C{Write Success} C{Write Success}-->|no|D[END] C{Write Success}-->|yes|E[(Data Base)] E[(Data Base)]-->|Acknowledgment|A[App Server] ``` ::: warning :warning: Limitations ::: - Latency increase - If response lost and not received cause error in data - Before updating in DB the cash went down, the data is lost permanently ### Write Back Cache - Any update is first written in Cache - Confirmation to user is given - It is updated in DB periodically ```mermaid flowchart LR; A[App Server]-->B[Cache]; B[Cache]-->|Ack|A[App Server]; B[Cache]-->|Periodic|C[(Data Base)]; ``` :::success **Advantages -** ::: - Faster write operation - Low latency :::danger **Disatvantages -** ::: - There is chance of permanent data loss, if cache goes down before periodic DB update ### Write Around Cache - Update DB - Periodically update cash - Simillar to TTL approch - No Need for TTL ```mermaid flowchart LR; A[App Server]-->C[(Data Base)]; C[(Data Base)]-->|Ack|A[App Server]; C[(Data Base)]-->|Periodic|B[Cache]; ``` ## Cache Eviction Strategies - Strategies to delete the unused cache ### 1. Least Recently Used (LRU) - Data which has not been used in recent time is removed ### 2. Least Frequently Used - Data which have been used least number of time is removed ### 3. First in First Out - Data added first is removed first ### 4. Last in First Out ### 5. Most Frequerntly Used --- # CAP Theorem and Master Slave - Sharding - Replication - CAP Theorem - PACELC Theorem ## Sharding Cont [Previous](#Sharding-Part-1) - Process of distributing the database on multiple servers(Machines) - Every shard contains mutually exclusive and collectively exhaustive data - For the data in one shard the shard is single point of failure - Hence we need replica of data in each shard which is called replication :::spoiler Code Deployment Strategy (Not in Scope) This is done by config using Elastic bean stalk Rolling deployment strategy - - Remove server from Load balancer - Deploy code in that server - Test and if ok on then connect the server to load balancer - do this for all machines - This is called rolling deployment ::: ### Sharding Data distribution strategy - Data is evenly distributed - Most common queries are intra shard query - Some strategies discussed in Sharding part 1 - Example consistent hashing ring (CHR) ### Selecting Sharding Key - In Sharding part 1 we had used user ID for sharding the data but such case is not suitable everytime. - *Example Amazon* -> if we use Product ID for sharding, the query for same category will FAN OUT - Seller ID and location sharding will also lead to fan out query - Category ID will lead to most optimized sharding [Continued](#Sharding-Key-Selection) ## Replication - We have replica of each shard to eliminate single point of failure - These replica can also be used to distribute the read request ### Master Slave Relation - Write request is sent to single machine - These is called Master-Slave relationship - Master to slave data transfer process is based on type of application - You can have strong or eventual consistency - Strongly consistent System - User get the latest accurate data - Eventual consistent system - User get accurate and latest information in cycle and not real time ### Synchronous Master-Slave Shard Update - Each slave server is synced one by one - This system is eventually consistent system - To make it strongly consistent - Write happens to master. - Sync happen with all slaves. - Once all slaves are synced, then only user get write success response. - If even one sync fails roll back happens. - This increase the latency. ### Asynchronous Master-Slave Shard Update - Write happens on Master. - Response of success is returned. - The slaves are updated asynchronously every x time cycles. - For example coupon code once appilied but purchase cancelled it takes an hour to reactivate the coupon code. - Faster write operation. - In case the master failure before the cycle of transfering the data to slave the write operations will be lost ### Quorum Based Approach - Write happens on Master. - Response of success will be sent when write to x slaves machines, lets say atleast 1 slay machine also happens - 1 or x slaves will have synchronous approah, rest will have asynchronous approach - which slave machines to choose is decided by leader-Election algo (Covered in Kafka-zookeeper) - Eventually consistent - Fast write - Worst case data loss risk managed :::info :information_source: Sharding and Replication is done out of the box by cloud service providers like AWS etc. But we do need to config it. Hence the understanding needed ::: ### CAP Theorem **C** - Consistency **A** - Availability **P** - Partition tolerant (Partition => Network Partition) #### *Consistency -* - Data can be immediate or eventual consistent. #### *Availability -* - If request and server is healthy, but still server denies the request for some internal reason, this situation is comes under unavailability - This system is considered to be a system with low availability. #### *Partition tolerant-* - If two machine can not talk to another then they are partitioned - A system is said to be partition tolrant if it continues to function even if there is a partition in the system. ```graphviz graph Network { rankdir = "LR"; subgraph cluster_0 { style=filled; color=lightgrey; edge [len = 0.1]; node [style=filled,color=white, height = 0.25]; A0 -- A1; A0 -- A2; A1 -- A2; } subgraph cluster_1 { node [style=filled]; color=white; rankdir = "LR"; edge [len = 0.1]; node [style=filled,color=lightgrey, height = 0.25]; B0 -- B1; B0 -- B2; B1 -- B2; } A2 -- B1 [label = "Partition"] } ``` ### partition Tolerant System - If the partition connection fails one partition becomes disconnected with the world - Partition here refers to physical and logical partitions both - Partition tolerant means that we will design the code with an assumption that even if two machines are partitioned (Not connected) they will eventually talk to each other - Now after assuming eventual communication (partition Tolerance) we have two options - Highly consistent and low available - Highly available and low consistency ![Partition Tolarent System](https://hackmd.io/_uploads/BJxwtEtDex.png) #### **CAP Theorem -** - Out of C, A, P Only two can be achieved - All three can not be achieved. - There can not be a system which is immediate consistent, Highly available and partition tolerant (i.e eventual communication with partitioned server OK) - In distributed Systems (network partition can not be avoided, so partition tolerant system is a given), So we must choose between consistency or availability. ### A+P System - Available and Partition Tolerant - Even if there is a partition in the system, the system will continue to serve the request - Eventual Consistent ### C+P System - Immediate consistent and Partial Tolerant - If there is a partition in the system, the system will become unavailable and deny the request. - Financial Institution generally follow this system ### C+A System - Situations where there is not a distributed system and all the data is stored on one single server - There will be no partition in this case - For example Bombay Stock exchange (BSE/NSE/HFT) will have multiple very powerful kind of like super computer - But these machines will not be geographically distributed. - They will have multiple physical connection between them. So they will behave like one giant single machine. - Replicas will be there but they will not be live in production to deal with situation like fire or other hazard. ### PACELC Theorem **P** - Partition **A** - Availability **C** - Consistency **E** - Else **L** - Latency **C** - Consistency **ELC** - Else "Compromise" between consistency and latency # SQL vs NoSQL ## ACID **A** - Atomic Every transaction should be atomic, i.e should execute completly or NOT execute at all. Data should not be at intermidiate or partial state at any point of time. **C** - Consistency DB consistency [ACID Consistency <> CAP Consistency](#SQL-Brief) **I** - Isolation Transactions running simulaniously should not interfare with each other **D** - Durability Data should be persisted on HDD ## Cardinality relations between columns - One-to-one (primary key) - One-to-many (Foreign key: foreign key is stored in the “many” side Table) - many-to-one (Foreign key: foreign key is stored in the “many” side Table) - many-to-many (Extra table for linking both many side tables) ## SQL Brief - Structured Query language - SQL is language to query a relational database. - RDBMS = Relational database management systems - These database follow relational algebra - In these data base we keep the data in the form of tables - These tables will be related to each other. ### Strength of RDBMS - Mature system and have plenty of support as they have existed for very long time - Data is in Normalised sate, that is a state that prevent redundancy and anomalies in the database - Normalisation eliminate redundancy using principals like primary, foreign key and detail relations - Normalisation eliminate anomalies by having rules like cascade logics around delete, insert etc. - This together is called [ACID Principal](#ACID) - Strong/Fixed schema design and [cardinality](#Cardinality) :::warning - Its is possible to alter the table (add/remove a column) but it would be very slow as it would require a table rewrite, as each row is a continuous storage, to add column another memory with larger continuous storage is designated and the older data is migrated to new table with extra column. ::: :::spoiler ACID VS CAP Consistency | ACID | CAP | |--------------------------------------|--------------------------------------| | Db Consistency is intact | No Stale Read | | Applicable for Single node machine | Applicable for Distributed node system | | It talks about data integrity | it talks about data sync in a network | ::: ### Weakness of SQL - All the strengths becomes weaknesses at very large scale - large scale means highly distributed systems on cloud/network with sharded data - In sharded Db system ACID does not work properly and becomes extremely slow - Example Amazon Product DB - 10K + category - Each category will have different set of attribute - Schema design is not possible - Even if we take max(len(product.attribute())) columns, a lot of space will be wasted. ## NoSQL - Not only SQL - Non relational databases - It is a family of database ### BASE **B**asically **A**vailable (Highly available) **S**oft State **E**ventual Consistency - BASE means they are A+P systems - Transaction can be in partial state, called distributed transactions ([SAGA Pattern](#SAGA-Patterns)) ### Horizontally Scalable - They are designed to be horizontal scalable and automatically sharding out of box, while mySQL requires mannual sharding. ### Denormalisations and Replication Denormalised data means you are fetching the data from multiple places, combining them and then showing them. But this leads to redundancy. ### Weakness of NoSQL - Joins are not there - Aggregation and grouping is complex - ACID guaranties not there - Partial states of transaction are there ### NoSQL Internals ### How SQL and NoSQL stores data internally **In SQL,** - based on schema (columns) will allocate fixed memory for each row, Example in below mentioned case each row will be allocated approx 100 Bytes | ID | Title | Discription | Prize | | ------ | ------- | ----------- | --- | | 8 Byte | 20 Byte | 50 Byte | 8 Byte | - The data structure used is B+ Tree - The memory unused is filed with null data - But once for example discription memory allocation is set to 50 byte, anything more then this will truncate - Benefit of this method is the starting address of each column is fixed and gives quick access. the read happens in O(1) time complexity - Disadvantage is schema changes need creation of new Table and migration of existing data from old to new table - $TC_{write} = O(logN)$ - $TC_{read} = O(logN)$ where N is no of entries **In NoSQL,** - There is no schema and data is unstructured - We can update memory allocated to one attribute later - This is achieved using following process - Assume K:V Type database. - Write ahead log (WAL) is created - This file is append only - File maintains a pointer which represent the end of file (EOF) - For creation and updation new values are updated - For deletion the key with null value is appended - For read we return the last encountered value of the needed key - Iterating from end is not possible because the memory addresses are in increasing order - Disc is divided in sectors and each sector is futher divided in blocks at a phase of 90 degree - The disc rotate in a direction and pointer is placed on memory block, hence read happens only in one direction - Hence the read can only happen in increasing order of memory address - In case of using only WAL, - $TC_{write} = O(1)$ - $TC_{read} = O(N)$ where N is no of entries - In combination with WAL, we store a hashmap which stores latest address of a key - since hashmap do not have duplicate value, address of the key get updated instead ```plaintext +-------------+--------------++-------------+--------------+ | RAM || HDD | | +-----v-----+ || +-----v-----+ | | | Hash Map | || | WAL | | | +-----------+ || +-----------+ | | || | +-------------+--------------++-------------+--------------+ ``` :::warning - $TC_{write} = O(1)$ for both hashmap and WAL - $TC_{read} = O(1)$ getting key from hashmap and going to a address in WAL to get data is O(1) ::: :::danger - Since hashmap is a data structured and stored in RAM, if machine restart hashmap needs to be rebuild - Since hashmap have all the keys in RAM and RAM requirement is high and system is overall costlier. ::: ### LSM Tree - We ==replace the hashmap with tree map==, the key will be sorted because that is the property of tree map. - We also instead of address of key in WAL, we save the value of the key in tree map itself along with storing it in WAL - After lets say 1 Hour we backup the tree map in HDD - After lets say 24 hours we ==merge tree map backup in HDD into one file== in sorted manner. We will in this scenario have 25 files max. This is called ==**Compaction**== - This can be done using two pointer approach and take the latest value if both pointer is same and we reset the tree map in RAM - For write nothing change $TC_{write} = O(logN)$ - For read, - If tree map contain the key and value, we get the value directly from RAM, $TC_{read} = O(logN)$ - If tree map do not contain the key, we will start ==reading files from latest file== - The backup data stored is ==sorted== but since as we are storing key and value in tree map backup the size of each cell is not the same, hence we can not use the ==binary search== for get the value. - Split the backup data in logical block of 64 bite - In RAM maintain a very small meta data (hashmap) where we are storing the block number and the first key in the block - If a key data is over spilling to next block, we do not consider the spilled key as first key but the fresh key whose data storage start in that block as first key. - In case of system shutdown, Hashmap can be built again - Nor the Hashmap is ==sorted== and have ==same size cells== we can use ==binary search== to get block. - $TC_{read} = <no. of files> * O(log(N))$ where N is number of block - $TC_{read} = O(log(N))$ - Data Structure is called LSM tree - Log File = WAL or log file or commit file or transaction file - Sorted strings = tree map backup or sorted filed - Merge - 2 pointer merging - Tree Map = balanced binary search tree(Tree map) in RAM - ==LSM Trees = Log File (WAL) + Sorted Strings + Merge + Tree Map== ```plaintext LSM Tree Structure +-------------+------------------++-------------+--------------+ | RAM || HDD | | +---v---++--------v------+ || +--v--+ +-----v-----+ | | | Tree || HashMap | || | WAL | | Tree Map | | | | MAP || Block:FirstKey| || | | | Back Up | | | +-------++---------------+ || +-----+ +-----------+ | | || | +---------------+----------------++-------------+--------------+ ``` ### LSM Tree Mind Map ![LSm_tree](https://hackmd.io/_uploads/ByvmdNYwgg.png) ### LSM Tree Read write flow ![LSM_tree_flow](https://hackmd.io/_uploads/BJ0xdVKDgl.png) ## Type of NoSQL DB - Key Value Pair - Document Based - Column Family/ Wide Column DB - Graph DB - File storage - S3 ### Key-Value Pair - Simple NoSQL DB - Gaint distributed hashMap - DynamoDB and Redis is an example, it is Inmemory and option support for persistence if want to be used DB 9NoSQL) **Structure** Value : Array/JSON/SET/SortedSets/etc **Function** - get(key) - set(key,Value) :::success Pros ::: Simple Fast (In Memory) :::danger Cons ::: No relations No complex Query No Indexes **Sharding Key** - Key **Primary Key** - Key Master class - Gaming Leader board ### Document Based NoSQL Most Popular example MangoDB, CouchDB, cockroachDB, elastic search - doc_id:document.json/jsonb - doc_id is generally uuidv4 and not provided by us - This is different from key-value as it ==allow== level 1(top level) attributes for ==indexing== :::warning Index are maintained locally, no global index ::: - This means that doc in one shard will be indexed but independent of other shard. - For example we sharded using doc_id. - Docs in one shard will be indexed, lets say on category. - So for getting docs based on category, query will fan out. - But once in shard sinse docs are indexed the docs from one shard will be fetched very quickly. :::danger Cons ::: - No relations - No global index - No ACID, if available very slow **Sharding Key** - doc_id (default/config available from top level attributes) **Primary Key** - doc_id ### Column Family/ Wide Column DB ![NoSqL Wide Column](https://hackmd.io/_uploads/rkmVS45vxl.jpg) - A particular attribute of a entry is stored together - Still in a tabular format - This is used for aggregate queries - Very fast writes and used generally when we have write heavy system Example cassandra, Hbase **Time series data base is a sub category of column Family** --- # Sharding Part 3 ## Sharding Key Selection - Key which is used to decide on how to distribute the data across multiple Db machines - It will also be used to route the queries to the right DB machine while fetching the data - Primary key may not necessarily the sharding key, example category ID can be used to shard the product in case like Amazon - It can be composite key ## Properties of a Sharding Key 1. Equal load distribution 2. Frequently used get data queries should possible be intra shard query or close. i.e. No FAN OUTS - Most frequest accessed quired should only hit 1 or 2 servers 4. It should have high cardinality - Example if age.range(0,125) is used as sharding key - It limits the server to 126, and 127 server will not get traffic - In case of millions of user 126 server is not sufficient - Hence the sharding key must have high cardinality 5. Sharding should be part of read and write request ### Sharding Key of Banking System - User can have accounts across cities - Most frequent Queries - - Balance - Transaction History - List of accounts - New Transaction #### Analysis - Balance is of a single user - user can have account multiple cities so sharding with location will fan out - Transaction ID is not part of request of create transaction because it is generated later. - Transact history is also a user - user can have account in multiple cities so sharding with location will fan out - List of account is of a user - user can have account in multiple cities so sharding with location will fan out - user can have multiple account, list of account and balance will FAN OUT - New transaction is made by user under a account of user - Transaction ID is not part of request because it is generated later **USER_ID is best option for sharding data** ### Sharding Key of UBER - Most frequent query - - Nearby Driver by vehicle type #### Analysis - Driver ID can not be used as we might need to go to all DB to find nearby driver - User ID same as driver ID - **City ID** - All the driver of a city will be in a shard, will be simple to retrieve near by driver ### Sharding Key of IRCTC - Prevent double booking - Cater peak load during TATKAL - 20X traffic #### Analysis - ticket ID - Not part of request - Time of booking - not relevant, change continuously - User_id - FAN OUT during TATKAL duration while checking seat availability - User 1 is booking ticket for day X of train 1 - User 2, 3, 5 have already booked for day x of train 1 - to check availability every server need to be reached as all data including booked seat have been split based on user id - **Train_trip ID** - All operations are related to one train, hence shard all data using it. - We can archive older train id data which have already departed ### Sharding Key of Facebook Messenger - send and receive messenger #### Analysis - Message_id - not part of request (Most probably) or FAN OUT - User_ID - in a chat user A and B both are involved, so store message in both the server designated to A and B - Both sent and received message for one user is available in one machine - Downside is data replication across server :::info For group chat Use Group ID ::: # System Design Flow ## MVP (Minimum Viable Products) Minimal set of features to start with called core features ## Scale Estimation(Guestimations) - Number of users - Number of read and write request - Read heavy or write heavy system or read-write heavy system - Amount of data (Sharding) ## Design Tradeoffs/ Goals - Consistency vs availability (CAP) - Latency (PACELC) - Statefull vs Stateless (Caching) - Type of caching and config - Type of database (SQL or NoSQL) ## Final Design deep dive - API - System diagram (Architectural diagram) - Data flow :::info tools that can be used - draw.io or excalidraw ::: # Zookeeper and Kafka Kafka is a message queue ## Message Que Engine One to one communication ```mermaid flowchart LR A[A Service]-->|1 Request|B[service] B[B Service]-->|2 Responce|A[service] A[A Service]-->|3 Action after response|C@{shape: processes, label: "Processes"} ``` ## Asynchronous communication - Client makes a call like upload a video - Upload service will receive the video and send acknowledgment to client that it will take xyz time to upload - It will upload the video to S3, create event in the queue and create an event - The services that add event to queue are called ==**producer**== - The services like add caption, translate etc will take video from queue and work on it - Those who receive from queue are called ==**consumer**== - Benefit of queue instead of direct call is producer do not need to wait for all consumer to give acknowledgment to producer one by one and producer is free to take on tasks - This model is publisher-subscriber ==**(pubsub) model**== - all the required acknowledgment and notification is handled by this model ## Pros of async communication - In this model services are not tightly coupled. - This helps make services scalable. - Adds fault tolerance between services, can handle pressure in situation like sudden surge in requests, queue act as buffer and stops from micro service from crashing ## Use Cases - Communication b/w microservices - Async communication ## Cons - sync communication not possible - can add a little latency - additional infrastructure ## Topics and partitions - lets say order placed event will call both notification and invoice service - order cancel event will only call notification and not invoice - this is handled by topics - topic is category or type of event - for example order placed is subscribed to notification and invoicing but order cancelled is only subscribed to notification - there is a topic called order placed - if volume of one topic becomes huge they are split (sharded) into multiple system. This is called **Partitions** - :::warning - One msg/event can only be read by one service in a ==consumer group== - This is to ensure strict message ordering and avoid race condition as locks are counter productive - in this case excess consumer sit ideal - so **no. of partitions > consumers** - Kafka provide strict message ordering within one partition ::: :::success - One msg can be read by different services in different consumer group ::: ## Type of delivery Guarantees - Atleast once - Atmost once - Exactly once ### Consumer offset When consumer consumer a message it notify kafka that consumer is done with this message It is call ==commiting the offset== **Read => Commit => Process** - Consumers responsibility - Consumer should be crash resilent - Channel the message is lost - This way you get ==atmost once delivery guarantee== without turning off idempotency **Read => Process => Commit** - If consumer crash after processing the message will be processed twice - This is ==atleast once delivery guarantee== **Kafka Streams** - Idea is in event driven pipeline - Someone writes an event, it is consumed and proccesed, after processing the process add another event in queue - Someone else consumes that event and it is a stream of event - We can also emit the result and the offset in same transaction we get ==exactly once delivery.== - We need to code the consumer in such manner to get exact once delivery or use streams ## Brokers - A server or machine - We need broker for sharding, replication - Kafka replicates in Master slave config - There are multiple brokers - There are multiple topics and each topic have multiple partition - Same topics partitions are stored on different broker - Replica of topic-partition is also saved on different broker - No broker is identical as each broker have different combination of topic-Partition.master/slave - This randomness is ti ensure that if a broker goes down due to overloading any identical broker sole have high risk of going down - To manage this we have an architecture zookeeper ## Zookeeper Zookeeper maintains all configs like what do do if broker go down, where master or slave of which topic which partition is running. ### ZNodes (Files and directories) **Persistent Files and directories (Znodes)** - Normal files and directories **ephemeral files and directories(Znodes)** - - Zookeeper maintain persistent connection with files - if something changes in the file it will immediately notify server - A server can own the znode - As long as the server is alive the file remain - The movement the server disconnect the file will be deleted by zookeeper - These type of znodes are called ephemeral file or directly - Ephemeral znodes have exactly one owner(server) - Only owner have modify access to these files ### Usecase of File storage and watch (Zookeeper) - We can also store small files in zookeeper - We can store a data for example an API-Key - Now on application server start server can store APIKEY string and call the function to keep watch on the file(Znode) - Now we can set watch on this znode - Whenever the watched file(Znode) changes call required function. - This ensures all servers will have updated API keys - When ever some changes happens centrally all servers will have updated key. ```python api_key= str() def fetch_key(): zookeeperClient.setWatch( "/api_key".onKeyUpdate ) def onKeyUpdate(value): api_key = value if onServerStart: fetch_key() ``` ### Zookeeper leader election Algorithm This can be done in following manner - - Master server have a folder master-slave config. - In that folder it will have files like master_ip, slave_ip, these files contain ip addresses - These files are owned by servers (master or slave) whose ip address is saved in the file - Slaves will set a watch on master_ip file, all slaves knows who the master is and can connect to the master ip address saved in master_ip file to ==sync== **Election -** - When master crash, hearbeat will fail. - Since the master_ip was owned by master it was ephemeral znode. - Since all slave were watching the file, zookeper will send the notification the file is deleted - All slaves will stop syncing with master and race to become new master. - Whichever slave write master_ip file becomes the master. - Since the file is owned by new master server and only it can edit it, rest of the server request to write are rejected. - This new master will sync with other slaves. - depending on consistency requirement the sync can happen in background and master can start accepting the writes or it will wait for sync to finish. - Since all other slaves already have watch on the master_ip file, zookeeper sends a notification that the file have been updated. **Note - Watch on the path never gets deleted, only the file does.** - Now all slaves are connected to new master ==Zookeeper has a limit of 2,00,000 partition== Latest version of kafka is migrating to raft ### Key-Sharding-Replication (Pushing event) - Each event is for a specific topic which is the key. - Partition used [Modulo Algorithm](#Modulo-Algorithm) from load distribution methods, on key for deciding partition (**Sharding**) - Identification of broker that will have master and slaves by manager (Zookeeper) - Writing on master and slave Kafka uses [Quorum method](#Quorum-Based-Approach) - **Event posting is idempotent as default, it can be disabled** [More on Partitions](#https://blogsarchive.apache.org/kafka/entry/apache-kafka-supports-more-partitions) [Delivery Symantics](#https://kafka.apache.org/documentation/#semantics:~:text=Since%200.11.0.0%2C%20the,of%20them%20are.) [Documentation](#https://kafka.apache.org/documentation/#persistence) [Persistence](#https://kafka.apache.org/documentation/#persistence) [Broker Configs Documentation](#https://kafka.apache.org/documentation/#brokerconfigs) [Topic Config Documentation](#https://kafka.apache.org/documentation/#topicconfigs) [Replication Documentation](#https://kafka.apache.org/documentation/#replication) ### Other queue - Kafka - AWS SQS - managed, use kafka, Costly - Rabbit MQ # Elastic Search ## Full Text Search for example if we store job posting in SQL, and we have to search "Senior java developer" Standard Query will be O(N^3), and if we consider various combinations such as "Senior Developer in Java", "Senior Developer - Java", "Devloper Java (Senior)" the system will crash Here we will use Document based NoSQL data base, where each post is a document And to optimize the string serach in these documents we will use Inverse Indexing ## Inverted Indexing It is kind of glossary at the end of the book. It will have words with all the pages the word is present mentioned against it. If out of 1000 page book, let say word developer is present in 20 pages, we only need to search those 2o pages Indexing just tell us this document/data is present at this index. The index will not help reduce the search load Example difference between index and reversed index ```python a = [1,2,3,4,5,6,x] a[index = 2] = 3 # reversed index reversed_index(x) = a[6] ``` :::spoiler NLP Basic for Reversed Index Natural languge processing have a word corpus word corpus example USA -- {US, United states, united states of America, america} Basic process steps D1 = "India Defeated Australia in champions Trophy by chasing 250 runs in just 35 overs" `remove soft words` D1.1 = "India Defeated Australia champions Trophy chasing 250 runs just 35 overs" `Stemming//Root words (other process like lammetization also used)` D1.2 = "India *defeat* Australia champions Trophy *chase* 250 *run* just 35 *over*" `Tokenisation` Token = [ 'India:d1,d6,d10', 'defeat:d1,d7,d10', ... 'champions:d1,d11', 'champions trophy:d1d2,d6' ] Tokens can have unigram, bi gram and even trigram, example, unigram - champions, trophy, semi, final bigram - champions trophy, semi final qudragram - champions trophy smi final The documents index is order as per popularity/relevancy score Documents are arranged in relevancy score after intersecting the combination of all the words in search The algorithm used is TF-IDF (term frequency-Inverse document frequency) to find out the importance of word with respect to a document we check term frequency (high TF) in a document relative to the term frequency of word in all document (Low IDF) ::: all the algorithems for elastic search using reversed indexing are implemented by an open source java library **APACHE LUCENE**, It need document list and word corpus to do all necessary operations and give you reversed index. but apache lucene work on a single machine elastic search is a databse which supports full text search in an efficient way using inverted indexing on a distributed network. you can look it as sharded lucene *Open search is managed elastic search.* **Sharding Key** - For document the sharding key is document ID and for inverted index if we shard with Token (from inverted index document) - Tokenisation of doc will become fanout hence insert doc also become fan out query - Search will also become fan out - For both document and inverted index we store using sharding key - This means all the inverted indexes of a perticular document will be present of the same machine as the document itself. - On the single machine we run the apache lucene. now each server will run apache lucene and each server will have docs and all the docs inverted index in that machine - In this the insertion is single database query - In this case we will have duplicate indexes in separate servers - For search we will have fanout query because the index can be in multiple machines - Search will always be fanout query because it is anyway not possible to have all docs of a token in single machine it is not possible to optimize - This is mitigated using cache and top results can be sent not all the result, also the consistency of result order need not be high. # Design S3 Database (File storage system) Binary large storages - S3 (By AWS) Block Storage (By Azure) HDFS - Hadoop distributed file storage - It is a large file dump - Generally the file is not modified, for example in a video file the frames on a Perticular timestamp are not modified. The file remains as unit whole (Version management is different game altogether) - Most interactions are with CDN and not S3,even in case data is not on CDN and query reaches S3, first few parts will be fetched from S3 meanwhile in the back end rest of the file (Example - video) will be uploaded on CDN for optimization. ## Expectation of File storage system - Large file storage - Durability - Faster uploads and downloads - Situation like if networks do down during upload and download of large should be handled ## Managing Expectation of File storage system ### Option 1 - Storing file as 1 unit :::danger Con ::: - File size is limited by the m/c size - Parallel upload/download not possible - In Network break situation, upload and download will restart from scratch :::success Pro ::: - Cost of entries low as no need to maintain fragments information(Start,end,sequence etc) - No need to collate/rearrange the fragments at the time of download. ### Option 2 - Storing file as fragmented units** :::danger Con ::: Pros of option 1 :::success Pro ::: Cons of Option 1 :::warning When going with Fragmenting large file ::: - Too many fragments - managing becomes difficult - Too few fragments - effectiveness of the pros of this method diminishes - Fragments should be as distributed as possible ## HDFS - Hadoop Dist File storage - Data Node - Where we store file nodes. - Name Node - Store meta data - file_fragment.master/slave:data_node mapping - Not a big file so no need to shard it (Replication will be there) - It is suggested to not shard it, instead create a new cluster of Data and Name node server Rack Aware Strategy (Algo) - For distribution of fragment (master/slave) distribution - If master is present in one rack of storage slave will not be present on the same rack ### Read Write flow ![S3 read write.drawio](https://hackmd.io/_uploads/ByMr5WpDxe.png) # Quad Tree and Notification - Uber Consider following table - ``` create table restaurants( id bigint primary key, name varchar(50), address varchar(200, latitude real, longitude real) ) ``` - We have indexes on lat and long - we have user(lat,long), find K nearby restaurants. $distance = (lat-lat_{user})^2-(long-long_{user})^2$ Method - 1 - Ordered list ``` select *, distance from restaurants order by distance asc limit k ``` Time complexity = $O(NlogK)$ Method - 2 - All inside radius ``` select *, distance from restaurants where distance <=R limit k ``` Time complexity = $O(N)$ Method - 3 - Inside rquare 2L X 2L ``` select * from restaurants where latU between (lat - L) and (lat + L) longU between (long - L) and (long + L) limit k ``` Time complexity = $O(N)$ (worst case) - There are 2 indexed coloumn in where query - SQL can only use one index at a time and can not use 2 index together - SQL will binary search all restaurant that satisfy the lat condition - then it will binary search all restaurant that satisfy the long condition - SQl will then give intersection of two searches Method - 3 - Grid method - divide the world in grid - when ever we insert the restaurant we will save cell query - For a given user location, we can find user cell ID - Now that cell id can used to get restaurant with cell id - cell id is indexed, the search O(logN) ``` select * from restaurant where cell_id = get_dell_id(latU, longU) ``` Time complexity = $O(logN)$ :::warning Con ::: - If user falls in blank cell, no restaurant is returnd, and as soon we expand the area cell count increase - If user is at the edge of cell, the restaurant of in adjacent cell might be closer than the restaurant in cell ## Quad Tress - Gives dynamic sized grid - 2D index - The complete area is root node - Root node have all the restaurant - We will apply recursive function that says If any mode has more then N restaurant we will split the node into 4 node, representing the 4 geographic section of the area and move the restaurant to the leaf nodes. - The recursive function does this again for the created 4 nodes till the node have less then 10 restaurant. - Search, Add, Del -> Time complexity = $O(log_4N)$, Worst case $O(N)$ ```python class Restaurant: id : int lat : float long : float # rest info in DB table not in quad tree class QuadTree: id : int top, left, bottom, right : (float, float, float, float) # this will be populated only for leaf node restaurant : list[restaurant] # this will be populated only for intermediate nodes chidren : list[QuadTree] # root node diagonla nodes are [{(0,0),(100,100)}] # # first 4 nodes will have co-ordinate [{(0,0),(50,50)},{(0,50),(50,100)},{(50,0),(100,50)},{(50,50),(100,100)}] def find(self,lat:float,long:float) -> list[Restaurant]: ''' Given the users location, find the restauraant that falls in the sale cell as this location ''' if not (self.left <=lat<self.right and self.top <=long < self.bottom) return [] # user out of bond of the node if not self.children: # we are a leaf node return self.restaurants # recurse over chidren for child in self.children: restaurant = child.find(lat,long) if restaurants: return restaurants def insert_res (self, restaurant:Restaurant): '''Insert restaurant ''' if not (self.left <=restaurant.lat <= self.right and self.top <=restaurant.long <= self.bottom): return # location out of current nodes bond if not self.children: # leaf node self.restaurant.append(restaurant) if len(self.restaurant) > SPLIT_THRESHOLD: # create 4 new children # with the appropriate co-ordinates # move restaurants to the children return # recurse over children for child in self.children: child.insert_res(restaurant) def del_res (self,restaurant:Restaurant): '''Delete restaurant ''' if not (self.left <=restaurant.lat <= self.right and self.top <=restaurant.long <= self.bottom): return # location out of current nodes bond if not self.children: self.restaurant.pop(restaurant) if len(self.restaurant) > SPLIT_THRESHOLD: # check all 4 leaf # if sum of restaurant in all 4 leaf is less then threashold move all restaurant to parent node # delete parents leafs/ remove mapping leaf nodes in parents self.children return # recurse over children for child in self.children: child.del_res(restaurant) ``` ## Quad Tree Space - $Restaurant_{size} = id + lat + long = 8 + 8 + 8 = 24 bytes$ - $QuadTreeNode_{size} = id + children + co-ordinate = 8 + 8*4 + 8*4 + 8*4 = 72 byte$ - $Total Space = (24* N_{res}) + (72 * N_{nodes})$ - $N_{res} = 100 * 10^6$ - Assume each leaf node have 1 restaurant on average - Leaf = $N_{nodes} = 100 * 10^6$ - Parent = $N_{nodes} = \displaystyle\frac{100 * 10^6}{4}$ - Grand Parent = $N_{nodes} = \displaystyle\frac{100 * 10^6}{4^2}$ - Great Grand Parent = $N_{nodes} = \displaystyle\frac{100 * 10^6}{4^3}$ - Root = $N_{nodes} = 1$ - This is a geometric progression - $N_{res}\frac14 + N_{res}\frac18 + \cdots + N_{res}\frac14^{n-1} = \displaystyle\sum_{i=0}^{i=n}N_{res}\frac14^i = \displaystyle a \left(\frac{\frac14^n-1}{\frac14-1}\right)$ - Sum of GP when common ration < 1 - - $N_{res}\frac14 + N_{res}\frac18 + \cdots + \infty\text{ Terms} = \displaystyle\sum_{i=0}^{i=\infty}N_{res}\frac14^i = \displaystyle\frac{N_{res}}{1-\frac14} = \displaystyle\frac{100 * 10^6 * 4}{3} \approx 133 * 10^6$ $Total Space = (24* 100 * 10^6) + (72 * 133 * 10^6) \approx 11 GB$ - Can be stored in RAM so we do not need sharding - Quad tree is lost it can be built - $N = 100M$ - $Nlog_4N = 100M * 13\approx$ - Can be done in few micro seconds. ## Uber using Quad Tree **MVP** - Nearest driver allocation - Dynamic prizing **Tracking Driver** - Driver ping server - We will store the location in DB - it will be fed to quadtree for nearest driver location **Scale Estimation** $Request_{PS} \approx 500 \displaystyle \frac {trips}{sec}$ $LocationUpdate_{PS} = 8 \text { million driver} * 12 \frac{\frac{hour}{day}}{user} * \frac{1}{10} \text { update per sec} \approx 400,000 \text{ update per sec}$ $NearestDriverQuery_{PS} = 30 \text { million trips per day} * 3 \text{ queries per trip} = 100 \text { million queries per day}$ $QPS = 1000 \text{ queries per sec}$ $Write = LocationUpdate_{PS} \approx 400,000 \text{ update per sec}$ $Read = QPS = 1000 \text{ queries per sec}$ ==$Write = 400 \text { X Read}$== ==**write Heavy system**== **Choice of Database and Sharding key** - Driver location - Wide column database for driver locations - Keep track of history and time based query, efficient time based query - we need very fast writes as write heavy system - time wise pagination - we need aggregation queries - **Sharding key** - city_id => QuadTree - user_id => wide col database - User and Driver profile - Can be SQL Database - No full text search needed or attribute search - We will need to do atomic update - Updated not too often - Relational as driver to car etc - Data is not large - But for compliance documents - non structures, different for different city - joins not needed - transactions not needed - Analytics not needed - **Sharding Key** - user_id - Bookings data - SQL Database - Strong schema - Relational with driver-rider-rating-payment - Need atomic transaction - **Sharding Key** - user_id - Each booking should be stored in driver and rider shard ## Uber Flow design **Basic Request Flow** - Async request flow - Booking entry will be created and will be stored in the DB, with driver id null - Response will be sent with booking id - It will search for driver in the background - It will go to message queue with topic assign driver - It will then be consumed by multiple services - Driver allotment being one of the consumer will consume this message - Driver allotment will send request for nearest driver to nearest driver server - Nearest driver server have quad tree - The nearest driver server will send the list of nearest driver - Driver allotment service will send the notification to drivers via notification service - When the driver accepts the data driver allocation service will be added to the db - The ride will be notified using the notification service - Notification is sent one by one or else the driver experience will be bad. - Notification will have timeout **Nearby Driver Service** - Optimize write query - Batch wise (write back cache) - eventual consistent - sampling - no location update when ideal, at client side same location should not be sent to server (can have tolerance of x meter and fallback method of x secs ) - Driver unavailable location should not be sent, break, offline, have already accepted ride (Only for Quadtree location write and not DB or some other service), Optimisation - when they are close to destination (x meter cap) - If node boundary not crossed node boundary - when driver goes available, the driver sends location to quad tree - The Quadtree returns the node and node boundaries - This can be cached locally - The new location update can be referred with load boundary stored in cache, if crossed send to quad tree - Cache required = driver_id + 4 node boundary = 40 * 8 Mil byte = 320 MB, easy to cache :::info Grid Update ::: - Corn Job can be used for grid update as per schedule time table - instead of hard condition for node split we can have soft split - Split when exceed 15 merge only when less then 5, this creates an allowed range - This affect multiple drivers and takes time as tree needs to be restructured ## Notification Service ### **Long Pull** - You keep pinging the server and pulling the notification as long as you are online - This is not scalable **Duplex Server connection** - Some application service is needed on client side that receive notification, that server should be connected with server (duplex connection) web_socket/web_hook. - App might be running in background and not in foreground. - The connection is persistent and do not terminate till the time you are online. - Both mobile platforms allow apps to run in background, for web browser you can use service workers - Once the connection is established the server can send notification to the application. ### **Push Notification** - If the ==application== is ==not running== either in the foreground or background still notification can be sent. - This is called ==Push notification== - This is done via operating system, your mobile device is connected to google or IOS server - We can sent notification via operating system server, API available. - The service is FCM (firebase cloud messaging) for andriod and apple push notification service for ios. - These connections are active sessions - they need to be sticky (statefull) - loadbalancer uses consistant hashing to assign same servers - Notification service might expose 2 API - establishConnection(user_id)=>connection - notify(user_id, message)=>Acknowledgement # Unique ID Generation + Rate Limiter ## Desired properties - Universally Unique - Collision across different entities does not matter - Example same user_id = [1,2,3,4,5], order_id = [1,2,3,4,5] - No Collosion of same entity across shard is what unique means, ie unique inside a table, but this is not universally unique, it is still ok but universally unique is still desirable. - Strictly Increasing - Lexicographically sortable - 1,3,7,1239,2000 (increasing but not continuous ie skipping allowed) - If ids are sequenced then adding do not require rebalancing of the tree and it can just be an append in the right most branch of B+ tree - Rebalancing the tree means read and write nodes again which are stored randomly on the disc, which result in random read write which is 10000 times slower then sequential read - Random Sparse ID - Dense id - 1,2,3,4,5 (every value valid), 1,3,5,7,9 (everu second value valid - sparse id - 1, 1001, 2001, 3001 - sparse not random - easy to guess - **sparse and random id - 1,1032,2119, 5360 ... - sparse and random - still incremental - difficult to guess hence secure (difficult to make enumeration attack) - Decentralized - anyone even client can generate ids independently and still be unique without talking to each other :::info Never expose Primary key - Dense and not random, open to enumeration attack - in case of database migration URL might break ::: ## Challenges **Auto Increment - SQL** - No Random sparsity - not decentralised - single server, single point of failure(SPoF) **Replication** - stale read when generating next id will result in non increasing id - even if we use quorum method, we need immediate consistency which will increase latency **Sharding** shard 1 - 1,4,7,10,13 shard 2 - 2,5,8,11,14 shard 3 - 3,6,9,12,15 - In this case if one shard is down/slow the ids will be left behind the other 2 shards and hence once online the newer ids directed towards shard 3 will not have increasing id but ids lower then what previously generated by shard 1 and 2. - Client still depend on SQL backend server - Replication with algorithm still needed with high consistency :::success ## Solution ::: ### **UUIDv4** - Just 128 bit random string - Such a large number ($2^38$) that collision probability is almost none but not strictly increasing - Simple time stamp can not be used as NTP(Network time protocol) might reverse time ( range of 25ms) while syncing with other computer - this is because the way crystal mesure time vary machine to machine within the manufacturing tolerance - this is mitigated by NTP sync - but this may also lead to revering of time of the machine is faster then other machines or whatever standard has been set by NTP ### Twitter Snow flake | Sign Bit | Time stamp | Datacentre ID | Machine ID | Sequence No. | |----------|------------|----------------|------------|---------------| | 1 Bit | 41 Bit | 5 Bit | 5 Bit | 12 Bits | | 0 | 0-41 | 42-46 | 47-51 | 52-63 | - Single bit is for sign, always 0 - Time stamp started from 4 Nov 2010 so that instead of 64 bit it can be kept in 41 bit - Datacenter - 32 max possible - Machines - 32 machines in each data center possible - Sequence - for ID generated in the same milli second it would just add incremental sequence from 0 (2000 max sequences in a ms) ```python datacenter_id = 12 machine_id = 10 time_start_ms = time('4 Nov 2010') prev_timestamp_ms = 0 seq_no = 0 def generate_id(): timestamp_ms = time.now_ms() - time_start_ms with lock(): # for thread safety IF prev_timestamp_ms == timestamp_ms: # the millisecond has not changed seq_no += 1 else: seq_no = 0 prev_timestamp_ms = timestamp_ms id_flake = seq_no |machine_id << 12 # 12 Bits |datacenter_id <<17 # 5 bits |timestamp_ms<<22 # 5 bits |0<<63 # sign bit is always 0 return id_flake ``` ### ULID/UUIDv6 ![3--node-ulid-structure](https://hackmd.io/_uploads/HkPa_8y_xl.jpg) # Micro Service Scale discussed earlier - - Data size - QPS (read/write/avg/peak) Other scale - - Number of features - Code size and complexity - Team size This scale influence the architecture and design choices ## Monolithic Architecture Popular definition - - One giant data base and mostly single DB generally SQL - All Apps/Services in single project with one DB - All module in single exe, but we can run multiple instances of the executable side by side in multiple processes (Each process will be running all the modules) - We can have multiple server with same code base - We can shard the DB - Other database also OK but at the scale SQL is preferable for different reason. :::success Pro ::: - Straightforward to develop and deploy - Homogeneous web servers - Predictable Performance - Each service will be different server in micro-services, this adds network overhead, in monolith all services communicate via function calls, so network overhead. - It is also simple to debug, simple print or debugger can be used - Easier cross-cutting like security, exception handling, monitoring, environment variables, connection pooling - Logging is simpler, they go in single file - Easy transactions as locks are easy when all modules are part of the same process - Expertise buildup :::danger Con ::: - Waste of resources in scaling for example if search service need scaling support will also scale - One bug will cause all services to go down - technology lock in as if one service need ML it will be better to use python but if we have to add another service which does video processing we wont be able to use C++ or rust - New dev need to understand massive codebase - All dependencies are in one environment, suppose library a and b are needed which in turn need librabry x of different version. this cause dependencies incompatibility and codebase break. slang for it is "dependency hell" - Slower experimentation - Slower upgrade - Mistakes compound slang for it is "big ball of mud" :::info Kubernetics - Config as code for network architecture when we have micro-service will diff docker and server configs ::: ## Micro service - Collection of small services each running in it s own process and communicating via light weight mechanisms (messaging queues, REST, RPC) - Each micro service implements a single business capability and can be developed and deployed independently. each service manages its own data, has its own team and can choose the tech stack the best suited to the service :::success Pros ::: * Service autonomity in scale, build, scale each service, memory leak bugs are. * Loose coupling. * Bounded contexts. * Isolation. * Single responsibility principal. * Lightweight communication. * Database per service can be implemented. :::danger Cons ::: Added complexity - instead of one app to manage we now have dozen services Additional infrastructure Network overhead Need for failure handling - need inter service communication mechanisms to handle failures gracefully Difficult testing and debugging Data consistency - data is spread across many DB Upfront cost is high which pays off only with sufficient scale ## Common Microservice Component - Virtual private cloud VPC - microservices - DB - message queue - Untrusted network - DDoS protection and Firewall Authentication - Load balancer - API gateway for example the SSL connection will terminate at unstructured network. SSL is when implementing HTTPS and communication over a secure protocol, there is some sort of encryption, if we decrypt internally in VPC, it becomes huge overhead. so the incyption ends in untrusted network that is SSL termination. ### Reverse proxy - It is a network architecture - In proxy, Instead of directly sending request to server Client will send the request to proxy, the proxy will raise the request again to server - A reverse proxy is takes request and send it to required server and the client do not which server the request was sent to - API gateway, Load balancer is reverse proxy ### API Gateway - in case of microservice there should be some logic that decide which request goes to which micro service - they are application layer as they need to understand the request to route it. ### Load balancers - Evenly distribute load across servers - Don't need to understand the request - They can work at transport layer or sometimes network layer - Rate limiter :::info - nginX, Kong example of reverse proxy provider like API gateway, load balancer rate limiter etc. - Cache can also act as reverse proxy in case coded in that way, ie request goes to cache first and cache send the request to service server (Write Back/through Cache ) ::: Authentication and Authorization Authentication happens only once in untrusted network once confirmed request can pass to VPS Authorization check happens at every service level ### Modular Monolith Architecture is monolith but inside it the service codebase are isolated and can not use services across these separation and communicate via dummy API interface ### From monolith to Micro **Deciding the service to breakoff** - start off rfom converting monolith to modular monolith - Start with least amount of dependencies - Or services with uneven load - Or modules which benefit with separate text-app or database - Scale vertically not horizontally one by one. Example | Micro service | Search | Product | Support | |---------------|-----------------------|------------------------|------------------------| | SubService | n/w handling | n/w handling | n/w handling | | SubService | logging | logging | logging | | SubService | database access | database access | database access | | SubService | caching | caching | caching | | SubService | performance monitoring| performance monitoring | performance monitoring | **How to breakoff** Build new service route limited traffic to this increment the traffic and monitor itteratively do this for other services ### Hybrid Architecture Some micro around monolith ### Crosscutting concernes some service like database access every service can built as per requirement but for cross cutting concerns like nw handling, logging, auth, utils we need to handle them. These need to be standardized In these cases we can make it into seperate microservice or a shared library example n/w is shared library (HTTP protocols) logging can be library authentication - can not be library but pulled out into seperate micro service Transaction can be library utils can be shared library ### Pitfall of micro service - Underestimation of overhead - automation, monitoring, logging, tracking - Breaking tight coupled processes - Ignoring distributed fallacies - latencies will be there, network fails - Need of observability - In case of failure we need way to trace the flow - It is possible issue is somewhere and error is happening somewhere else - Need good logging, monitoring and tracing ## Communication between micro services - Latency - Network calls increase latency - Reliability - Network can fail so we must implement retries and circuit breaker - Versioning - backward compatibility needs to be taken care of - Consistency ## Client Micro service communication ### API Gateway - API gateway (untrusted n/w) can handle cross cutting concerns like auth, rate limit, SSL termination, request aggregation and more. The client just calls one endpoint which process calls or microservice NginX, Kong, traefik, HAproxy ### GraphQL - Solves (n+1) problem - Allow the front end to evolve independently of the back end - gets data from back end without being dependent on back end to update the API response and manually add needed data during development. ### Backend for front end (BFF) for example client like smart TV, desktops and mobile for such case we can have different gatways and backend combination for different type of client ## Microservice to microservice communication ### Synchronous microservice call - Any processing service do will block the client response - Tight coupling between two service ### RESTful API :::success Pro ::: Easy to build and debug :::danger Cons ::: JSON is text based and heavy, creating latency, Serialisers also increase load. - Ex 8 bite of time stamp when in string due to character lenght {"timestamp":"2025-03-28 08:54:21 +5:30 Asia/Kolkata"} become approx 100 bytes (2 byte per char unicode) HTTP have 3 way handshake which is heavy In trusted network within trusted closed network (VPS) this becomes inefficient ### Remote Procedure Calls (RPC/gRPC) HTTP/2 sends binary UTC data with the datatype ex timestamp will be sent as 8byte data and the datatype that is timestamp (called **protobuf**) ### Inter Process communication (IPC) - Very Rare used in case micro servicees are running on same server (special case) ## Asynchronous micro service call - Request delivery response received and processing happens in the background - Other requests are not blocked waiting for response - Loose coupling between services ## Queues for Async calls (Micro service) Message Queues - Task Queue (rabbitMQ), once done it will be deleted Event Driven/ pub sub (Kafka) - message subscribed by multiple subscriber (Different consumer group) Data-driven/Event Streaming (Kafka stream, apache Flink)- for contineous data flows where events are processed in near real time. common in analytics or monitoring pipeline :::danger Cons ::: - Eventual consistency - Distributed tracing required to see which request going where and how request flow happens - Message duplication and delivery guaranty need to be considered message queue adds to infra :::warning ## Pitfalls of async micro service calls ::: - overly chatty API, batch the frequent APIs - Automate discovery (kubernetes, Eureka) - Keep API coarse-grained - should not contain internal implementation details ## Observability of flow of micro services - Detect something is wrong (Known and unknown) - Diagnose the issue - Verify that fix worked - Metric - expected behavior deviation example CPU usage goes above 80% - any data point we can collect is a matric - every microservice will expose these /matrics endpoint. in that response all the matrics needed. - the matrics should be tagged properly like when, service, instance, server, region etc - Tools can be used to do that - Basic matrics suggested are - Latency (handled by middleware) - Traffic - Error types and frequency - Saturation ie how full is your service (memory, i/o constrains etc) - User behaviour(rating/ NPS) - User experience - Revenue - Check sre book by google - Logs - once matrics signify something is wrong we go through logs to check and identify the problem - it will be done in every individual server - suggested JSON as human readable and machine pasrable and order dosent metter in the string - suggested logging info - correlation ID (for distributed tracing) - time stamp with timezone - log level - service - message - service entity id - hash(user id) - Aggregation of logs from all servers to one location - it should have enable search - Popular tools (log shipper + DB + Visualization - logstash + elasticsearch + kibana - Traces - Alerts - If matrix deviate above acceptable threshold it should trigger warning - Good Alert - it should be actionable - it should be precise that is exact where the problem occured - it should be relevant, ie trigger should be raised only when needed - Bad Alert - Noise - too frequent and irrelevant - Ambiguous lacking clear context or recommendation - Low impact - trigger on minor issues - Best Practices - Set appropriate threshold - Escalation policies should be defined - Runbook - Proper documentation of actionable should be there - Review review - check thresholds ## Distributed Tracing- Trace - request speaks with multiple service in haphazard sequence - service is logging when service come - from multiple logs the way we track the complete flow of request is distributed tracing - the way we do this is when the client makes a request to APi gateway we create an ID called correlation/trace/request_id - this identifier is passed all around, it is saved in request header (HTTP - x-br-traceid, W3C traceparent) Spans within trace specific group of processing is called span it has its ouwn span id and reference its parent span All this doing itself is ops overhead small scaled company use 3rd party managed services like data log, new relic instead of opensource like prometheus or jaeger ## Micro service Resilience - Service will fail or network will glitch, system should handle this gracefully ### Common Failure - Service Unavailability: Instant crash (Out of memory, bugs, cosmic events), deployment issue - Network Issues: Timeouts, packet loss, high latency, DNS problems etc - Service Overload/Slowness : A dependency becomes slow due to load, inefficient queries, resource contention etc :::info Returning error is not equivalent to failure. It is called handling situation gracefully. ::: ### Cascading failure - Due to inter dependency between multiple services, one failure cause multiple service to fail or time out - This can happen is Async as well, this hapens because the queue may get overwhelmed - Solution is circuit breaker pattern ![Cascading Failure](https://hackmd.io/_uploads/SyfLGQP_xg.png) ### Thundering Herd - Massive retries plus request hit the service which came online after being down can cause service to go down again due to sudden overload ![Thundering Herd](https://hackmd.io/_uploads/rJVaQQwdgx.png) ## Basic Defence ### Timeouts - Every network call MUST have a timeout. - This is done based on expected latency (P95/P99) - How NOT to Measure Latency by Gill Tene (Youtube) ### Smart Retries - Handle transient failuer : using reties for small failure - Avoid Naive retries: immediate, repeated retries are called naive reties - Use exponential backoff: increase delay between retries - Add Jitters: Randomize the backoff delays slightly to prevent synchronized reties - Limit retries: usually 2-3 max for internal API, we can go slightly more aggressive (max 20) if doing something like crawling the web (external) - Mandatory Idempotency: Ensure retried operation dont have unintended side effect, for example If we are developing API and client retries, it does not end up creating duplicate messages (event) ```python= def retry_decorator(func): import random def wrapper(*args, **kwargs): retry_delay_ms = 100 retry_count = 0 MAX_RETRIES = 3 while retry_count<=MAX_RETRIES: success = func(*args, **kwargs) if success: break retry_count += 1 retry_delay_ms *= 2 retry_delay_ms *= 1 + random.randint(0, 1) thread.sleep(retry_delay_ms) return return wrapper ``` ## Circuit Breaker Pattern - Solves for cascading and thundering herd - The main cause is target service failure type, either it is transient or long lived, was not clear to the service making call to failed service - circuit breaker will maintain the status of service (closed/open/half Open) some were like redis cache ### Circuit closed: Healthy, default status ### Circuit Open: Non responsive status - No request is sent and client is sent response while handling this situation gracefully. - The status is changed from closed to open (Circuit tripping) when failure threshold is reached. ### Circuit Half Open: - The moment status change to Open we set a timer called sleep window. - After sleep window expire we change the status to half Open. - In half Open status, a small percentage of request are sent to service (1 - 2%) - These request are responded we increase the number of request till it is changed to closed state - if still non responsive we change the state to open and reset the sleep window ![Circuit Breaker Pattern](https://hackmd.io/_uploads/B1wZgEPugx.png) ```python= # Service A def handle_request(request): data = circuit_breaker.make_call("B",data) return response # Circuit_breaker (Library/Side-Car) # maintains the state for each service (dependency) def make_call(service,data): state = service_states.get(service) if state == 'closed': response = make_api_call_to_B(data) if response != 2xx: service_states.incr_failure_count(service) if failure_rate > FAILURE_THRESHOLD: service_states.set(service,'open') if state == 'open': return 5xx if time_since_state_open > sleep_timeout: service_states.set(service,'half-open') if state == 'Half-Open': p = random() if p <1%: response = make_api_call_to_B(data) monitor response if response_succeed_count > test_threshold: service_states.set(service,'closed') else: service_states.set(service,'open') return 5xx ``` ### Infrastructure Requirement (Circuit Breaker) - Need to be stores in central location like Redis server - Or each service can maintain state of its dependencies and circuit breaker can be implemented as library/ sidecar ### Implementation Consideration (Circuit Breaker) - Scope: applied pre dependency, sometime pre operation within a dependency - Threshold: Need careful tuning based on baseline metrics - Monitoring: Critical to monitor the state of your circuit breakers. Put alert on state transitions and services tuck at open ### Popular Implementations - Resilience4j - Polly - Istio/Linkerd ## BulkHead (Resource Isolation) ## Managing data across services data per service - this lets each service have autonomy on its own data - it can change the process of storing data - make upgraded and modification in databases ### Cross cutting situations - where other services have to read or write in other service data base - this is handles by command/query responsibility segregation pattern (CQRS) ## Command-Query Responsibility Segregation Pattern (CQRS) - ==**Write**== : When one service modify its data base which needs to reflect on other service data base it should do this via message queue or via task (command) - the other service will consume that task (message) and modify their own database - ==**Read**== : When the data is modified in service (service R) from where the data is need to be read by a service (service M), the service R on each modification should add task/message in message queue and the service M will consume the message and copy the now modified data in its database. - ==**Command**== - write, ie represent the intent to change the data - Need complex business rules, validation, ACID consistency - ==**Queries**== - Read, ie represents a request for data - high speed, flexibility across different data base optimized vies for UI :::warning This forces for eventual consistency. As the read and write database are also separate leading to update lag ::: :::success Benefit is it is highly optimised when the system is highly read heavy ::: ## Consistency in micro service (locks) ### 2 Phase commit for immediate consistency 1 Phase commit - We commit accross 2 servie at the same time - one succeed and other fail - we can roll back the succeeded one - but if read happens during this time 2 services provide inconsistent data 2 phase commit - a write request comes `a, b = 4, 11` - server (coordinator) ask one database to prepare to commit `a = 4` and second data base to prepare to commit `b = 11`. - data base will prepare by storing the values on hard disk (persistent storage) separately, they have not updated a or b, just stored the value elsewhere. they will also lock the a and b. this will prevent any other read or write. - the data base responds with success in preparation acknowledgment. - once server get response that data is prepared it means the database can commit whenever told to commit - if service get response of acknowledgment of preparation from both data base it request commit - if one database do not acknowledgment, service can either retry, just not tell the other service to commit and remove the lock. - there is no inconsistency in read as the data have not been written as well other write can not happen due to lock. ### complex failure - Preparation done acknowledgment fails - coordinator server fail - rollback can fail While all these complexity is handled the lock on data is there ==hence the latency is very high== ## Saga Patterns for eventual consistency if transaction is long lived and hits multiple services instead of treating as one single flow, we break down a transaction down in smaller segments ==called sagas==. A ==saga== should be small enough to complete that it can be executed in one microservice using a local transaction example - order placed on swiggy in order service - order service will complete a local transaction and save something in DB - order service communicate order to restaurant service and payment service - so once order saga succeed payment service will make local transaction, if this succeed it will go to inventory service, the success can be returned to previous service - in case of failure if follows command pattern (used for undo) ### Implementing Saga pattern https://hackmd.io/@Asrik/CSE-SAGA # Rate limiter - to control the rate at which request hits the servers - this is done to - protect the resoures (CPU, memory, DB) - Abuse prevention and security against attacks like DDoS, brute force login. - Fair use and avoiding the resource monopoly in case multiple service use them, - cost management as 3rd part api calls and bandwidth incure cost - enforce slas and business tiers like free and pro versions ## Type of rate limiter ## Debounce - Client side rate limiter - Situation like when you click on incorrect link and re-click on correct one immediately or resize the window very rapidly, all cases where user trigger actions very rapidly. making API call for every stroke is very inefficient - we set timer delay of very short duration in sending api. if another event occur before the time finishes it reset the timer. Api is only sent the the time is completed. ```python= import threading from functools import wraps def debounce(wait_ms): """Decorator to debounce a function call by wait_ms milliseconds.""" def decorator(fn): timeout = None @wraps(fn) def debounced(*args, **kwargs): nonlocal timeout # Cancel previous timer if timeout is not None: timeout.cancel() # Start new timer timeout = threading.Timer(wait_ms / 1000.0, lambda: fn(*args, **kwargs)) timeout.start() return debounced return decorator # Example usage @debounce(300) # Wait 300 ms after last call def fetch_suggestion(text): print(f"Fetching suggestions for: {text}") # Simulating rapid calls (like typing) import time fetch_suggestion("a") time.sleep(0.1) # 100 ms fetch_suggestion("ab") time.sleep(0.1) fetch_suggestion("abc") time.sleep(0.4) # Wait enough time for debounce to trigger ``` ## Throttle - Client side rate limiter - Allow xecution of first trigger and ignore subsequent trigger till cooldown timer is completed. ```python= import time from functools import wraps def throttle(wait_ms): """Decorator to throttle a function call to once every wait_ms milliseconds.""" def decorator(fn): last_time = [0] @wraps(fn) def throttled(*args, **kwargs): now = time.time() * 1000 if now - last_time[0] >= wait_ms: last_time[0] = now return fn(*args, **kwargs) return throttled return decorator # Example usage @throttle(300) def fetch_suggestion_throttled(text): print(f"Fetching (throttled) for: {text}") fetch_suggestion_throttled("a") time.sleep(0.1) fetch_suggestion_throttled("ab") # Will be ignored time.sleep(0.4) fetch_suggestion_throttled("abc") # Will execute ``` ## Fixed Window Counter - Back end rate limiter A Fixed Window Counter is one of the simplest rate-limiting algorithms — it counts how many requests happen within a fixed time period ("window") and blocks anything over the limit. How it Works - You pick: A window size (e.g., 1 second, 1 minute, 1 hour) - A max request limit (e.g., 100 requests per minute) - You keep a counter of how many requests have happened in the current window. - When the counter exceeds the limit before the window resets, you block the request. - At the start of the next window, you reset the counter to zero. Example Limit = 5 requests per minute Window = 1 minute Start: 10:00:00 Time Request Count Allowed? 10:00:02 1 ✅ Yes 10:00:05 2 ✅ Yes 10:00:10 3 ✅ Yes 10:00:15 4 ✅ Yes 10:00:20 5 ✅ Yes 10:00:25 6 ❌ No — limit reached :::success Pros ::: - Very easy to implement. - Low memory usage. :::danger Cons ::: - Burstiness at window edges: - If someone makes 5 requests at the end of one minute, then immediately 5 more at the start of the next minute, they effectively make 10 requests in ~1 second. - This is why sliding window log or token bucket algorithms are often preferred. ```python= import time class FixedWindowRateLimiter: def __init__(self, max_requests, window_seconds): self.max_requests = max_requests self.window_seconds = window_seconds self.current_window_start = time.time() self.request_count = 0 def allow_request(self): now = time.time() # If current window expired, reset if now - self.current_window_start >= self.window_seconds: self.current_window_start = now self.request_count = 0 if self.request_count < self.max_requests: self.request_count += 1 return True # Allowed else: return False # Blocked # Example usage limiter = FixedWindowRateLimiter(max_requests=5, window_seconds=60) for i in range(7): allowed = limiter.allow_request() print(f"Request {i+1}: {'Allowed' if allowed else 'Blocked'}") ``` ## sliding window log Backend Rate Limiter A **sliding window log** rate limiter lets up to **N requests within the last W seconds**—measured over a *continuously sliding* time window (not reset on minute boundaries). It keeps a **log of timestamps** for each client/key and, on each request, drops timestamps older than `now - W`, then checks if the log size is `< N`. ### Why use it * Prevents the “edge burst” problem of the fixed window counter (no double-dipping at window boundaries). * Behavior is intuitive: “no more than N requests in the last W seconds,” at any instant. ### Trade-offs * **Memory:** stores 1 timestamp per accepted request per key; can grow with traffic. * **Time:** each check prunes old entries. With a deque, amortized O(1) per request (worst case O(k) to prune), where k = number of stale entries at the head. * If you need strict O(1) memory/time, consider **token bucket / leaky bucket** instead. ### Python implementation (deque-based) ```python= import time from collections import deque from typing import Deque, Dict class SlidingWindowLogLimiter: def __init__(self, max_requests: int, window_seconds: float): self.max_requests = max_requests self.window = window_seconds # Map from key (user/IP/api-key/route) to a deque of request timestamps self.logs: Dict[str, Deque[float]] = {} def _prune(self, dq: Deque[float], now: float) -> None: cutoff = now - self.window # Pop from the left while entries are older than the window while dq and dq[0] <= cutoff: dq.popleft() def allow(self, key: str) -> bool: now = time.time() dq = self.logs.setdefault(key, deque()) self._prune(dq, now) if len(dq) < self.max_requests: dq.append(now) # accept and record this request timestamp return True return False # reject: already hit N in the last W seconds # ---- Example usage ---- if __name__ == "__main__": limiter = SlidingWindowLogLimiter(max_requests=5, window_seconds=10) for i in range(7): allowed = limiter.allow("user-123") print(f"Request {i+1}: {'Allowed' if allowed else 'Blocked'}") time.sleep(1.5) ``` ### Notes & options * **Keys**: Use whatever identifies a requester (IP, user ID, API token, route+user). * **Clock**: For tests, inject a clock function instead of `time.time()`. * **Thread safety / multiprocess**: Wrap with a lock in threaded apps, or keep the log in a shared store (e.g., Redis list) if you have multiple workers. * **Redis version**: Use a sorted set per key (score = timestamp). `ZREMRANGEBYSCORE key 0 now-W` to prune, `ZCARD` to count, `ZADD` to insert. This scales across processes. Here’s the **sliding window counter**: an *approximate* sliding-window rate limiter that smooths the “window edge” problem of fixed windows while using **O(1) memory per key**. ## sliding window counter * Pick a window size **W** (e.g., 60s) and a limit **N** (e.g., 100 requests). * Keep **two counters per key**: * the **current window** count * the **previous window** count * At time `now`, compute how far we are into the current window (`offset ∈ [0, W)`). Effective count = `current_count + previous_count * (1 - offset/W)` * Allow the request if that effective count is `< N`. This blends the two adjacent fixed windows to approximate a true sliding window, without storing per-request timestamps. :::success Pros ::: * Much less memory than sliding window log (timestamps). * Smoother than fixed window counter (fewer boundary bursts). * Approximation: still allows small bursts; if you need strict fairness, use sliding log. * Simple; O(1) time/memory per key. ## Python implementation ```python= import time import threading from dataclasses import dataclass from typing import Dict @dataclass class _Counters: # Start time (epoch seconds, aligned) of the "current" window current_window_start: int current_count: int prev_window_start: int prev_count: int class SlidingWindowCounter: def __init__(self, max_requests: int, window_seconds: int): if window_seconds <= 0: raise ValueError("window_seconds must be > 0") if max_requests <= 0: raise ValueError("max_requests must be > 0") self.max_requests = max_requests self.window = window_seconds self._store: Dict[str, _Counters] = {} self._lock = threading.Lock() def _aligned_start(self, t: float) -> int: # Floor to window boundary (seconds) return int(t // self.window) * self.window def allow(self, key: str) -> bool: """ Returns True if the request is allowed under the sliding window counter, otherwise False. Updates counters on success. """ now = time.time() cur_start = self._aligned_start(now) offset = now - cur_start # seconds into the current window ∈ [0, window) with self._lock: c = self._store.get(key) if c is None: # First observation for this key: initialize counters c = _Counters( current_window_start=cur_start, current_count=0, prev_window_start=cur_start - self.window, prev_count=0 ) self._store[key] = c # If we moved into a new window, rotate counters if c.current_window_start != cur_start: if c.current_window_start == cur_start - self.window: # Simple rotate: current -> prev, new empty current c.prev_window_start = c.current_window_start c.prev_count = c.current_count else: # We skipped one or more windows; drop history c.prev_window_start = cur_start - self.window c.prev_count = 0 c.current_window_start = cur_start c.current_count = 0 # Blend counts from previous window based on overlap weight_prev = 1.0 - (offset / self.window) effective = c.current_count + c.prev_count * weight_prev if effective < self.max_requests: c.current_count += 1 return True else: return False # Example usage if __name__ == "__main__": limiter = SlidingWindowCounter(max_requests=5, window_seconds=10) key = "user-123" results = [] for i in range(7): allowed = limiter.allow(key) results.append((i+1, "Allowed" if allowed else "Blocked")) time.sleep(1.5) for n, status in results: print(f"Request {n}: {status}") ``` ### Notes Sliding window counter rate limiter Thread safety: guarded with a single lock; for high concurrency per key, you can shard locks by key hash. Multi-process / distributed: store the two counters in Redis per key (e.g., as a small hash) and rotate using Lua or a transaction keyed by cur_start. Choice of window: seconds are typical; use smaller windows (e.g., 1s) if you need finer smoothing. ## Toke n Bucket -Rate limiter - each user have assigned token and it is consumed per request ## Leaky Bucket - used to protect downstream services - as request come buffer is filled, the downstream consumes at fixed rate ## Hardware VS software rate limiter - Software limiter - in code application - proxies like NginX, Envoy - flexible, ow cost, but can add latency and is limited by resoures of server - Hardware limiter - dedicated network appliences - high performance low latency, ut expensive and less flexible # Case studies and more reading https://hackmd.io/@Asrik/CSE_HLDcases