dominic hamon - HackMD

hachyderm postmortem: fritz overload 2023-01-03
please do not change the format or delete sections fill out anything in []   Author @dma Collaborators
dominic hamon changed 3 years agoView mode Like Bookmark
hachyderm SLOs
in an effort to be more targeted with our efforts to keep the site online, it seems reasonable to consider SLO definitions. this will allow us to both focus where focus is necessary, and to stop dashboard trawling and rely on alerts instead. this document will define SLOs per grafana dashboard, for some level of structure. nginx SLOs dashboard [X] 90th %ile response times (GET 200) [global]: 10s for 1munder normal circumstances we are around 1s - 3s. issues are reported during periods of much higher response times (10s+)
dominic hamon changed 3 years agoEdit mode Like Bookmark
slow responses
alert Value: [ var='C0' metric='max(nginxlog_http_response_time_seconds{quantile="0.9",status=~"2[0-9]*",app="hachyderm",method="GET"})' labels={} value=11.141 ] Labels: - alertname = global 90th %ile response time (GET 200) - grafana_folder = nginx alerts Annotations: - description = global 90th %ile response times (for GET 200s) have been >10s for 3m - summary = HTTP responses are too slow Source: https://grafana.hachyderm.io/alerting/grafana/YfwP_DK4z/view?orgId=1 Silence: https://grafana.hachyderm.io/alerting/silence/new?alertmanager=grafana&matcher=alertname%3Dglobal+90th+%25ile+response+time+%28GET+200%29&matcher=grafana_folder%3Dnginx+alerts
dominic hamon changed 3 years agoEdit mode Like Bookmark
Hachyderm v2 Proposal
Introduction We need to move hachyderm.io out of Nóva's basement as the internet connectivity and UPS is shaky and we seem to have "cabling problems" during Twitch Streams. It is hard to define the hardware requirements as we still see significant growth of hachyderm's user base. Still, we need to start somewhere. The starting point should be a setup, that can be quickly set up but also quickly ripped down without long contractual bindings. Status Quo Right now the hachyderm infrastructure can be separated into two parts: The Core and Content Delivery. The Core The Core lives in Nóva's basement: alice, yakko and wakko.
dominic hamon changed 3 years agoEdit mode Like Bookmark
project water tower
how best to use the animaniacs background hachyderm infra we have alice as our powerhouse, and wakko (soon to be joined by yakko and dot) as our compute node. there's a few ways we can use these new compute additions to scale out the stack and this doc will be concerned both with what those options are and how to decide. scaling challenges before getting into how we might scale, it's worth digging into what the scaling challenges might be. scaling up your server is a good starting point for this discussion with both suggestions for scaling ingress (WEB_CONCURRENCY and MAX_THREADS environment variables) and workers (run separate sidekiq processes).
dominic hamon changed 3 years agoEdit mode Like Bookmark
public grafana
Hello World! we won't share any of our existing dashboards publicly but will create a specific dashboard with charts we feel comfortable sharing. how to share a grafana dashboard publicly grafana docs tl;dr: opt in to alpha feature, create dashboard, share. in progress: here
dominic hamon changed 3 years agoEdit mode Like Bookmark