Motivation - HackMD

# Motivation As per Duygu's targets, extracting technical moments: 1. Big messages must be supported (>10M in decent time, i.e. less than 30min). 20M messages / 30 min would be standard benchmark here and below as our new shiny target. That's 11k messages / sec. Meaning ~8 CPU cores sending simultaniously at 100% load for 30 min with current implementation. In addition to sending itself that means 11k mongo updates / sec for 30 min with current implementation. 2. Transactional push support. Shouldn't be a big issue, current implementation works good enough. The only thing to keep in mind is that a single push overhead should be minimal, that is push != message, pushes cannot be stored in mongo collection forever while messages can. 3. Stability improvement. We cannot undermine Countly server ability to process requests with 8 CPUs off for 30 min, neither we can load mongo that much. 4. Control over significant loads. We need a way to run CPU/RAM-hungry ops away from Countly server. Relevant for cohorts as well, may be some other parts. That is, to run jobs in a separate box. --------------------- # Database With all said above, we cannot run push on Countly's MongoDB. 11k updates/sec for 30 min is not feasible whatsoever. Let's explore our options here: #### 1. Separate MongoDB database. We store all push-related data in separate MongoDB database. *Pros:* - same drivers / connection logic; - can be run on standard Countly's MongoDB for small customers; - sending queue / statuses can be written / read by Countly code. *Cons:* - not scalable in PaaS model, we won't be able to send 2-5 big messages simultaniously - we'll hit same MongoDB bottleneck again and I'm afraid sharding won't help, correct me if I'm wrong; #### 2. Kafka / RabitMQ / whatever. We store push queue in a MQ. The queue is filled by Countly connected to it. Statuses are stored in MQ, then they get slowly inserted back to Countly's MongoDB. *Pros:* - scalable, distributed, not mongo; - some level of knowledge in the team (Kafka); - can be used along with the gateway. *Cons:* - quite complex to setup, maintain and use; - quite an overhead for a simple task; - cannot be used for small customers; - would still require transfer of statuses back to MongoDB. #### 3. Redis / RocksDB / any other distributed KV-store *Pros:* - not mongo, can be distributed, scalable even for PaaS case; - simple installation, minimal learning curve; - gonna be faster than any of the above; - can be used by other parts of Countly: pub/sub can be quite handy (this is HUGE in my opinion), distributed cache, can be used for ingestion, for jobs. *Cons:* - still a new thing to setup & maintain; - drilling across kv & mongo would be sort of questionable, but we do uids arrays filtering even now, with mongo. **Conclusion:** I propose having a distributed KV-store in Countly. Quite surprisingly I don't see much of a choice here. # Pushin' Let's list our **motivation** from above again: - ~8 CPU cores sending simultaniously at 100% load for 30 min with current implementation; - That's minimum ~16GB of RAM for 30 min with current implementation; - We need a way to run CPU/RAM-hungry ops away from Countly server. Relevant for cohorts as well, may be some other parts. Another point I'd add is: - as push plugin is in CE, it needs to be usable (not too much operational overhead) in low-usage scenarios. ### Requirements: 1. Must be a "pluggable" plugin, that is to have an ability to run on our standard 2-core Countly server with MongoDB in and in huge distributed setups with multiple servers. 2. We need to improve performance as even our new shiny target would require provisioning of a 8-core 16GB RAM server just for push. Taking aside complex solutions like to provision a server for a single message as not feasible. 3. We not only need to support CE, but also heavily closed almost offline setups (usually banks) where we have almost no control over server provisioning. 4. Improve on error reporting and flexibility, read - current implementation changes a lot. ### Conclusion: We still have to run on existing Countly servers for most customers when they have spare capacity. Therefore, push must have flexible throttling capability, on both CPU & RAM. At the same time we need to improve performance on CPU side as on top loads current implementation would require too much overprovisioning. As painful as it is, I don't see any way to improve on CPU using Node.js or MongoDB. I was about to list possible solutions, but the only solution which looks sort of ok to me is this: 1. We have a horizontally scalable KV-store now, installed on MongoDB servers for now or even separately in some cases. 2. We refactor jobs so they: - are stored in our new shiny KV-store; - possibly using some popular jobs library (they're all kv-store based, we can finally use them); - are not capable of spawning node.js subprocesses anymore; - are run on each server in master process, yet in a worker thread (all jobs in one or each / some jobs in separate threads - TBD), so jobs won't block master process' event loop. 3. We refactor `push` plugin, so it: - stores part of push data in a KV-store (tokens, statuses, queue); - keeps storing other data in MongoDB (token booleans, messages, metrics, events, etc.); - uses open source push libraries (we won't do Node-native HTTP/2 implementation); - abstracts from push providers (Moneytree); - has more consistent error reporting; - has Vue UI; - ready for next point. 5. We make new `push_ee` plugin, which: - is based on / overrides `push` plugin; - has a native app companion (won't require compilation on Countly install) which is to be run on each Countly server by jobs whenever new push task is inserted into shiny KV-store; - the native app won't use more than 100MB of RAM and will throttle itself down to XX pushes/sec whenever machine CPU usage is more than XX%; - yet the native app can be run completely separately from Countly servers (some config option would prevent running push jobs on Countly servers): docker would have a separate container for that, some big customers might have a separate server for it. A few other notable points: - using open source push libraries in CE would degrade performance for APN, but we don't care about that; - yet using native app in EE would dramatically increase it, hopefully by at least 2x on CPU side while decreaseing RAM usage at least 10X thanks to KV-store streaming; - KV-store must be horizontally scalable; - With this setup it's worth thinking about having a single KV-store in our GCE for all our GCE customers, just like having a dedicated bunch of pushin' servers or an autoscaling group (sort of PaaS); - Possibly we can have jobs & `push` plugin store all data in MongoDB removing need of KV-store in small (push-wise) setups, but I really think Countly as a whole can benefit from it and we still need it for 20M case; - we need to check feasibility first. And a few disadvantages to discuss if it's worth it: - native app; - operational burdent regarding KV-store shouldn't be too big, but still; - maintaining effectively 2 push implementations for the "last mile" part;