# hachyderm postmortem: fritz overload 2023-01-03 _please do not change the format or delete sections_ _fill out anything in []_ | <!-- --> | <!-- --> | |---------------|----------------------| | Author | @dma | | Collaborators | | | Status | draft | ## executive summary | <!-- --> | <!-- --> | |------------|----------------| | Impact | spikes in response times and "streaming down" alerts in discord | | Root Cause | too much CPU being used on fritz | ## problem summary | <!-- --> | <!-- --> | |---------------------|-----------------| | Duration of problem | ~40m | | User impact | users experienced very long response times and 500s | | Detection | alerts fired in discord | | Resolution | changed mastodon-streaming service config on fritz | ## background <!-- what does a reader need to know to understand the rest of the doc --> fritz runs mastodon-web and mastodon-streaming and all other web nodes proxy to fritz. mastodon-web was configured with 16 processes each having 20 threads. mastodon-streaming was configured with 16 processes ## root causes and trigger organic growth in users and traffic coupled with the return from vacation of the US caused the CPU to hit >90% consistently on fritz causing responses to fail to be returned to the upstream web frontends. <!-- the root cause is what is at the heart of what happened. root cause analysis is the most important part of the post-mortem. --> <!-- the trigger is what caused the issue to occur. this may not be the same as the root cause --> ## Impact <!-- what was the impact of the issue in terms of user experience or necessary changes to infrastructure --> p90 response times grew from ~400ms to >2s. increase of 502 responses to >1000 per minute. ## Lessons Learned <!-- what have we learned that we can take away from this incident? --> response times are very sensitive to puma threads (reducing from 20 to 16 threads per process doubled GET response times). the site functions pretty well even with fewer streaming processes ## Things that went well <!-- celebrate the good things in life --> we had the core CPU load on the public dashboard. ## Things that went poorly <!-- what would we have prefered had not happened during the response? --> in an attempt to get things under control both mastodon-streaming and mastodon-web were changed. puma was then reverted as we had over-corrected and response times were getting quite bad. ## Where we got lucky <!-- we all get a little lucky sometimes --> @dma was already keyed in to fritz thanks to an earlier issue where certs hadn't been renewed. ## Action items <!-- what could or did we change to either prevent this issue from happening again, detect it sooner, or mitigate the issue? --> <!-- "type" is one of "repair", "prevent", or "detect". --> | Action item | Type | GitHub Issue | |-------------|--------------|--------------| | reduce the number of streaming processes on fritz from 16 to 12 | repair | n/a |