The DATABASE - HackMD

## The DATABASE ```mermaid flowchart LR A([Projects]) --> |1: New string to translate| B B{{Localization.Service}} --> |2: Send new string to translate to Crowdin| C{Crowdin} C -.-> |3: Webhooks returning translated datas| B B --> |4: Store the returned datas in our database| D[(Internal Database)] ``` In order to save on our side all tanslated strings, we need a database who can handle a **big** amout of request. **BUT...** Why do we want it on our side ? - We can't perform a lot of request on crowdin to get latest translations, there are many rate limit on their API - We can't do a memory database, each time we will reboot, we need to recover all data from Crowdin, and we just reach the first case - In term of speed and performance, it's better to keep the source of truth really near the Localization Service - We need an extra data in our translation: the state. Depending if the string is already approved or just suggested from the Crowdin's IA, we will need to store this on our side to provide to the user a "IA translation state percentage" In the choice of database, we need to take care of several criterias: the high availability, performance, Turbulent's knowledge,... But we need to keep in mind that the Localization Service will have some cache between the services & database, reducing the load on it. ### 1. 🟢 Use a two database, document & key value store On RSI we can isolate two kind of data to translate: - tokens (from template per example, who is like a key / value) - pages (from the comm-link, who can be stored as a document, who can be a set of token) To optimize the storage & requests speeds, we can separate in two database these datas to provide and separate the way we can retrieve it. For a single token, the key value is the best option. However for a page, the whole translation can be stored in a document and the key can be a part of the URL, per example: ``` comm-link/spectrum-dispatch/19462-Whitleys-Guide-San-toky-i ``` ```mermaid flowchart LR A{{Localization.Service}} --> B[(Key / Value DB)] A{{Localization.Service}} --> C[(Document DB)] B --> A C --> A ``` - **Advantage:** - Retrieving all translated strings for a page can be faster than searching each key for the same page in the key / value store - A page / set of tokens can be used for more than a page like a full TySku object - **Disadvantage:** - Implement two different logics for tokens & page / set of tokens - Need to maintain more code because we are using two different database - ###### *Explanations about two different storage method* - ###### 💾 **Database #1 KEY / VALUE** ``` KEY | VALUE account_settings_rsvp_title | Titre RSVP RSI_faq_title | Foire aux questions ``` - ###### 💾 **Database #2 Document NO-SQL** ``` KEY | VALUE ty_sku_1001 | {"title": "Grand vaisseau 10 places", "description": "Achetez-le"} ty_merch_9987 | {"title": "Hoodie XXL", "description": "Devenez un pilote"} ``` ### 2. 🟢 Use a single database as key / value store Instead of using two database and differenciate data type, we can store it all inside the same database engine. Document or key / value store, it will keep the logic for both: token or split page elements to tokens. ```mermaid flowchart LR A{{Localization.Service}} --> B[(Key / Value DB)] B --> A ``` - **Advantage:** - Architecture is lighter - Compared to Solution 1, less code to maintain - If the load require it, replication can be easy with key sharding: place **ty_sku_*** in Replica-1 and **ty_merch_*** in Replica-2 - **Disadvantage:** - Need to maintain more code because we are using two different database - Need to find a way to create & retrieve key depending the opbject, page, etc ###### *Explanations about single storage method* - ###### 💾 **Database KEY / VALUE** ``` KEY | VALUE account_settings_rsvp_title | Titre RSVP RSI_faq_title | Foire aux questions ty_sku_1001-title | Grand vaisseau 10 places ty_sku_1001-description | Achetez-le ty_merch_9987-title | Hoodie XXL ty_merch_9987-descriptio | Devenez un pilote ``` ### ⚖️ Choice about database (Key / Value store) ### 🟢 Redis: **Functionality:** - In-memory key-value store, providing high-speed data access due to data storage in memory. **Capabilities:** - Used for caching, session management, message queues, and supports advanced data structures. **Scalability:** - Redis can be scaled horizontally using clustering. **Sharding:** - Supports sharding for horizontal scaling. **Advantages:** - Extremely fast read and write operations. - Support for advanced data structures. - Highly suitable for caching and real-time applications. - ✅ Lot of knowledge inside Turbulent about this database engine **Disadvantages:** - Limited data size by available RAM. - Not ideal for large-scale data storage. ### Comparative Array | Criteria | Redis | ScyllaDB | ArrangoDB | |----------------------------|-------|----------|-----------| | Scalability (easy) | x | ~ | | | Better for Key-value store | x | x | x | | High performance | x | x | | | Price (cheap) | x | x | x | | Sharding | x | x | x | | Replication | x | x | x | | Data persistence | | | x | | Multiple GET | x | x | | | Single Threaded | | | x | | TURBULENT KNOWLEDGE | x | | x | ### 🔴 ArangoDB: **Functionality:** - Multi-model NoSQL database supporting document, graph, and key-value models. **Capabilities:** - Offers the flexibility to work with various data models. **Scalability:** - ArangoDB supports horizontal scaling through data partitioning (sharding). **Sharding:** - Built-in support for sharding. **Advantages:** - Support for multiple data models. - Horizontal scalability for distributed applications. - Joins and transactions across data models. **Disadvantages:** - Smaller user base compared to some other databases. - Limited third-party tools and libraries. - Less turbulent knowledge comparing to redis ### 🟡 ScyllaDB: **Functionality:** - NoSQL database compatible with Apache Cassandra, based on a columnar model. **Capabilities:** - Offers high availability, low latency, and supports Cassandra Query Language (CQL). **Scalability:** - ScyllaDB uses data partitioning (sharding) and replication for high availability and linear scalability. **Sharding:** - Built-in support for sharding. **Advantages:** - High availability and low latency. - Compatibility with Cassandra, easy migration. - Suitable for real-time applications. **Disadvantages:** - Requires some configuration for optimum performance. - Smaller community compared to Cassandra. - Never used inside Turbulent ### ⚖️ Choice about database (NO-SQL & Documents) ### 🟢 MongoDB: **Functionality:** - Document-oriented NoSQL database. **Capabilities:** - Suitable for applications with flexible and scalable data schemas. **Scalability:** - MongoDB can be easily scaled horizontally using sharding. **Sharding:** - Built-in support for sharding. **Advantages:** - Flexible schema, easy horizontal scaling. - Good performance for read-heavy workloads. - Rich querying capabilities. **Disadvantages:** - Eventual consistency model may not be suitable for all applications. - Indexing and data modeling complexity. ### 🟡 Couchbase: **Functionality:** - Distributed NoSQL in-memory database supporting key-value and document-oriented models. **Capabilities:** - Offers high-performance caching, multi-master replication, and indexing for efficient data retrieval. **Scalability:** - Couchbase achieves horizontal scalability using data partitioning (sharding) and replication. **Sharding:** - Built-in support for sharding. **Advantages:** - High-performance data retrieval. - Horizontal scalability and multi-master replication. - Flexible data modeling. **Disadvantages:** - Configuration complexity, requiring careful planning. - Smaller community compared to some other databases. ### 🔴 ArangoDB: **Functionality:** - Multi-model NoSQL database supporting document, graph, and key-value models. **Capabilities:** - Offers the flexibility to work with various data models. **Scalability:** - ArangoDB supports horizontal scaling through data partitioning (sharding). **Sharding:** - Built-in support for sharding. **Advantages:** - Support for multiple data models. - Horizontal scalability for distributed applications. - Joins and transactions across data models. **Disadvantages:** - Smaller user base compared to some other databases. - Limited third-party tools and libraries. - Less turbulent knowledge comparing to redis ### Comparative Array | Criteria | Couchbase | MongoDB | ArrangoDB | |-------------------------------|-----------|---------|-----------| | Scalability (easy) | x | x | x | | High performance | | | | | Price (cheap) | x | x | x | | Replication | x | x | x | | Can do BOTH (K/V & Documents) | x | | x | | TURBULENT KNOWLEDGE | | x | x |