Scaling ToDo - HackMD

# Scaling ToDo ## Direct nog nodig voor AWS **alle sub-punten 1 a 2 uur)** - Postgres - [x] AWS-instellingen checken (encryption aan, maximaal aantal instances beperken etc.) + wat code cleanup - [x] Migratie uitvoeren op Staging; moet zonder extra handmatige stappen - [x] Regular production-update (for app-booster) **18-2** - [x] Provision new postgres-hosts in Grafana-config **@jeroen this week** - [x] Remove postgres backup-script (from ansible/web-api) **@jeroen this week** - [x] Deploy postgres on production / US **Early next week** - Web-API code for Fargate ECS **@jeroen before 22-2** - [x] Deploy firmware bucket on root-account - [x] Finish back-end code / review / deploy / test on sandbox **@anirudhbisht** - [z] (re-)move redis-migration code (to app-backend) - [x] move queue initialisation code to app-backend -> disabled on Fargate (but still runs on EC2) **@jeroen before 22-2** - [x] Replace internal HTTP endpoints with Redis pub/sub (since internal http is no longer a thing on Fargate) - [x] Replace offline-detection with queue-jobs overwritten on every transmission - ECS Fargate - [ ] Add CI step that redeploys web-api on Fargate ECS **@anirudh?** - [x] Deploy Fargate ECS on develop **Goal: 23-2** - [x] Improve stress-test (= unreasonably heavy) **@jeroen Before 24-2** - [ ] Deploy & execute stress-test on staging **Before 26-2** - Goal is at least 10k systems on Staging - Depending on outcome, rescale redis/postgres/fargate - Influx is allowed to be bottleneck (time-series-DB will be sorted after) - [ ] Deploy on Production / US ## Deployment - [ ] Staging bij web-api code-aanpassing altijd updaten / bedsenses dedicated aan Fargate hangen - [ ] Na neerzetten -> klein aantal bedsenses overzetten / checken ## Ook belangrijk - Altijd notificaties **3 dagen** - Info nodig voor notificaties syncen tussen Postgres/Redis - Alleen Redis gebruiken voor hele notificatiesysteem - Betrouwbaarder & sneller dashboard - Eigen WebSocket vervangen voor socket.io (herverbinden/fallback naar normale API) **1 uur** - Alles evented maken (alleen timesync overhouden) **2 dagen** - Pingen vanaf app / responden met server_time (ipv periodiek server_time als laatste liveData opsturen) **halve dag + iemand van app-booster** - Schaalbaardere Redis-code - Alle .all()-calls vervangen voor specifieke set device_ids (bv uit current_beds) **halve dag** - Datareductie vanaf ESP **1 dag / + Anirudh of Danny** - Alle General Messages checken op - Uberhaupt nodig - Interval van versturen - Evt. GeneralMessages combineren op basis van gewenst interval - Datavalidatie op ESP's - Edge-cases waar bedsense na opstarten nog geen time-sync heeft en data > 1 minuut uit sync opstuurt. Wordt nu genegeerd (niet negeren zorgt potentieel daarna voor gemiste events, omdat de meest recente change in de toekomst kan liggen, en tussentijdse changes om die reden genegeerd worden) - Mission-critical time-series - StateChange verhuizen naar managed time-series DB (VictoriaMetrics of AWS-native) **1 week** - Checken of er nog andere mission-critical tabellen zijn voor dashboard / verhuizen **1 uur** - (nare/wel nice) stateChange retroactief vullen op basis van restlessness/stateDataSmooth/... -> dan historische grafieken ook volledig **1 week** - Analyse impact of time-series-changes: - [ ] StateChange-impact (how many devices lose what amount of history if only stateChange is kept / not retroactively filled) - Cost-optimize! (current rate of credits is way higher then needed; will have nothing left before expiry then ~8k montly costs afterwards) ## Cost optimisation - [ ] Reserved instances! (voor redis- en postgres-clusters, EC2 NA influx-migratie) - [ ] Postgres Aurora -> t4g instances (77$/m) - [ ] S3 storage class (glacier voor backups ed) - [ ] Cloudwatch -> retention policy? ## Nice-to-have - Postgres opruimen - user_logs verhuizen naar influx **2 dagen** - Influx opruimen - alle tabellen beoordelen op nut / weghalen waar onnodig (veel wordt bv uberhaupt al niet meer gestuurd door models 10/11) **2 dagen** - pre-cloudwatch-logs dumpen op S3 glacier - ideally - managed time-series-db containing stateChange (only?) - new influx / selectively migrate measurements as wanted/needed by Algo - Kosten-optimalisatie - Storage veranderen van S3 naar S3 glacier **1 dag** - Productie-EC2 terugschalen (moet prima kunnen als zaken hierboven gedaan zijn -> influx RAM is bottleneck) - Develop and staginig postgress and redis cluster to be downsized and scaled up using automated scripts - Reserved instances (iig voor EC2, mogelijk voor Postgres/Redis-clusters) **halve dag**