# hachyderm postmortem: fritz overload 2023-01-03
_please do not change the format or delete sections_
_fill out anything in []_
| <!-- --> | <!-- --> |
|---------------|----------------------|
| Author | @dma |
| Collaborators | |
| Status | draft |
## executive summary
| <!-- --> | <!-- --> |
|------------|----------------|
| Impact | spikes in response times and "streaming down" alerts in discord |
| Root Cause | too much CPU being used on fritz |
## problem summary
| <!-- --> | <!-- --> |
|---------------------|-----------------|
| Duration of problem | ~40m |
| User impact | users experienced very long response times and 500s |
| Detection | alerts fired in discord |
| Resolution | changed mastodon-streaming service config on fritz |
## background
<!-- what does a reader need to know to understand the rest of the doc -->
fritz runs mastodon-web and mastodon-streaming and all other web nodes proxy
to fritz.
mastodon-web was configured with 16 processes each having 20 threads.
mastodon-streaming was configured with 16 processes
## root causes and trigger
organic growth in users and traffic coupled with the return from vacation of the US
caused the CPU to hit >90% consistently on fritz causing responses to fail to be
returned to the upstream web frontends.
<!-- the root cause is what is at the heart of what happened. root cause
analysis is the most important part of the post-mortem. -->
<!-- the trigger is what caused the issue to occur. this may not be the
same as the root cause -->
## Impact
<!-- what was the impact of the issue in terms of user experience or
necessary changes to infrastructure -->
p90 response times grew from ~400ms to >2s.
increase of 502 responses to >1000 per minute.
## Lessons Learned
<!-- what have we learned that we can take away from this incident? -->
response times are very sensitive to puma threads (reducing from 20 to 16 threads
per process doubled GET response times).
the site functions pretty well even with fewer streaming processes
## Things that went well
<!-- celebrate the good things in life -->
we had the core CPU load on the public dashboard.
## Things that went poorly
<!-- what would we have prefered had not happened during the response? -->
in an attempt to get things under control both mastodon-streaming and
mastodon-web were changed. puma was then reverted as we had
over-corrected and response times were getting quite bad.
## Where we got lucky
<!-- we all get a little lucky sometimes -->
@dma was already keyed in to fritz thanks to an earlier issue where
certs hadn't been renewed.
## Action items
<!-- what could or did we change to either prevent this issue from
happening again, detect it sooner, or mitigate the issue? -->
<!-- "type" is one of "repair", "prevent", or "detect". -->
| Action item | Type | GitHub Issue |
|-------------|--------------|--------------|
| reduce the number of streaming processes on fritz from 16 to 12 | repair | n/a |