owned this note
owned this note
Linked with GitHub
hey, sorry if I wasn't communicating correctly on twitch, I've experience of the mastodon codebase (former contributor), and I wrote one of the first node websocket servers; trying to help via chat whilst you stream isn't easy, more than happy to help someone debug via voice chat or on discord.
I don't think the 5xx's or alerts should be caused by either rack-attack (rate limiting connections from clients because they're spamming requests, likely due to unstable mobile networks)
This is the rack-attack configuration, so you can see the different rules it's applying: https://github.com/mastodon/mastodon/blob/main/config/initializers/rack_attack.rb#L57
Essentially, if node streaming talking to directly to the ruby server (rack), then it'd be (hopefully) over localhost, and therefore not rate limited based on those rules, as localhost connections are excluded from the rate limiter. (to clarify: node.js is _not_ talking to ruby/rack, it's talking only with postgresql and redis)
This is the logging setup for rack-attack, which is where the rate limit messages are logged:
The errors that are causing potential service outage also shouldn't be from node.js streaming, as those error log lines aren't an "errors" — that's just a developer writing noisy logging. Those are, at best really "info" log lines, but miscorrectly labelled as "error".
On websocket servers, when doing a broadcast (in node.js this is essentially a for each over an array of connections at a given point in time) the server will end up iterating over some number of connections that are no longer open or have been closed or who's connection was interrupted (bad networks) but node.js being single threaded, the connection will still exist in the array being iterated, and as the iteration is blocking the event loop, the processing of cleaning up the connections array doesn't take place.
So code has to be defensive and verify "can I actually write to this socket?" before each write.. and the mastodon developers are logging failures which are part of normal operations. (tl;dr: node.js isn't great if you're iterating a LOT of items in an array & doing work as part of the iteration, as it'll introduce lag to the processing which can affect other parts of the server, such as cleaning up connections that have been closed — it's kinda like iterating a list of connections in Go (my Go knowledge isn't great, bare with me, this is an analogy), but having or not having a lock/mutex on the list being iterated (I can't remember Go semantics here), and then wondering why you have bad data in the list.
To improve node.js's performance here, it's often desired to chunk iteration into a series of smaller batches, such that you can yield back to the event loop more frequently. A good metric to track would actually be the percentage of connections that a single write is going to. If the mastodon server has one highly followed user, a post by them, especially in a "busy" timezone for the instance, will result in unbalanced write behaviours, where one message posted will result in iterating over a heap more connections than others (one per follower who's connected to streaming), so you can end up doing 40,000 network writes very easily, locking up node.js temporarily from processing disconnections correctly.
I hope this helps! Sorry for all the DMs, but I couldn't with the twitch chat explain clearly what was happening.