project water tower

# project water tower _how best to use the animaniacs_ ## background [hachyderm infra](https://hackmd.io/vcn5avQaQr-yCthuUYZBLg) we have alice as our powerhouse, and wakko (soon to be joined by yakko and dot) as our compute node. there's a few ways we can use these new compute additions to scale out the stack and this doc will be concerned both with what those options are and how to decide. ### scaling challenges before getting into how we might scale, it's worth digging into what the scaling challenges might be. [scaling up your server](https://docs.joinmastodon.org/admin/scaling/) is a good starting point for this discussion with both suggestions for scaling ingress (`WEB_CONCURRENCY` and `MAX_THREADS` environment variables) and workers (run separate `sidekiq` processes). going further [scaling mastodon](https://blog.joinmastodon.org/2017/04/scaling-mastodon/) suggests that having `sidekiq` processes on different machines allows for greater scalability. [recent scaling challenges](https://hackmd.io/W1O8EGXWRSi0TOqaofG5gw) have been seen mainly around the `sidekiq` queue latency so that is the obvious first place to look. however, our [current observability](https://grafana.hachyderm.io/d/dEovIID4z/mastodon?orgId=1&refresh=30s) stack focuses mainly on `sidekiq` as this is what is readily exposed from the statsd exporter in Mastodon, and we have not yet implemented nginx observability. anecdotal evidence suggests that even though the `sidekiq` queues were largely under control during "the @ian incident", the website was still sluggish. this could be back pressure from the `sidekiq` delays and retries to ingress, or ingress itself not scaling appropriately for the number of incoming views/requests. ## options ### animaniacs run `sidekiq` given the above (possibly biased) read of our recent scaling issues, it seems obvious to have the `sidekiq` workers move over to them where they can be scaled up to significant numbers of threads and run totally independently. we can distribute the queues like: - wakko: default, mailer - yakko: push, default, pull, mailer - dot: pull, default, push, mailer - alice: scheduler, default this should give us a reduction in load on alice with a good boost across the other queues. #### benefits - load reduction from alice not having to run >200 `sidekiq` threads across 3 processes - this CPU can then go to serving the web and DB processes #### challenges it's not clear how to run the `sidekiq` worker processes on different machines from the web though it is apparently possible. "Sidekiq cannot reliably use read-replicas because even the tiniest replication lag leads to failing jobs due to queued up records not being found." [[citation](https://docs.joinmastodon.org/admin/scaling/)]. i.e., we couldn't put read replicas on the compute nodes, which means we'll be incurring some (small) network overhead between the `sidekiq` workers and alice hosting the r/w postgresql DB. this also requires us to carefully coordinate the worker process sizes as the total number of `sidekiq` worker threads (configured per compute node) can't be greater than `DB_POOL` (configured on alice). ### animaniacs run `mastodon-web` given how well-suited alice is to managing the DB and `sidekiq` workers, and how we need the `sidekiq` workers to work against the r/w replica rather than read replicas, it may instead make sense to have the animaniacs running `mastodon-web` instances instead. the SSDs they have are large enough to host a (small) static content cache and `mastodon-web` itself doesn't require that much in the way of CPU/RAM. #### benefits transparent vertical scaling for user requests takes unnecessary CPU load from alice's nginx process and spreads it (behind a loadbalancer) trivially. that extra CPU from alice can then go towards the much more important `sidekiq` and database worker pools, which must be colocated. #### challenges it's not clear how to run the `sidekiq` worker processes on different machines from the web though it is apparently possible. ## how to decide having nginx observability (beyond current blackbox probing) may provide some insights into the relative workloads, but the singular challenge of running `sidekiq` separately from `mastodon-web` remains. [2022-11-09] with the [nginx dashboard](https://grafana.hachyderm.io/d/GxDpHJvVz/nginx?orgId=1&from=now-3h&to=now&refresh=30s) up, we ran tests to increase the worker threads for `mastodon-web` and saw no change in CPU use or http request states. this, coupled with @Tani's observation that the slow queries had `no-cache` directives suggests that we are not HTTP bound. as such, using the animaniacs for sidekiq seems like the best plan. ## task list - [ ] implement nginx observability - [X] run nginx exporter - [X] expose `/metrics` on nginx - [ ] ~~import [nginx dashboard](https://grafana.com/grafana/dashboards/14900-nginx/)~~ - [X] build custom nginx dashboard - [ ] decide which option is best from above - [ ] figure out how to implement it - [ ] add observability on compute nodes - [ ] implement it

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.