🌈 IPFS Bifrost - 2019 Q4 - Objectives and Key Results

# 🌈 IPFS Bifrost - 2019 Q4 - Objectives and Key Results _What is the main thing?_ - Provide actionable insights to IPFS developers - Provide a great service to the community - Be safe, observable, reliable, happy & open. ## IPFS infrastracture informs and improves go-ipfs development - P0 traffic is mirrored to staging gateway running latest go-ipfs - P1 all devs have access to ipfs logs from production and staging infrastructrure - P1 nightly and on-demand pprof profiles from all nodes are published to a shared repo - P2 bifrost team identify 5 perfomance issues with go-ipfs ## IPFS infrastructure is stable and scales to meet demand - P0 2 preload/delegate node are provisioned, monitored and documented - companion is in brave now, so we need to ready for increased usage from js-ipfs - P1 4 new DNS Bootstrap nodes are provisioned, monitored, and documented. - dnsaddr bootstrap nodes are in the config but not yet deployed. - P1 Ansible tower manages our deployments - Running ansible from CI gets us so far, but will cause friction as preload/bootstrappers are brought into the bifrost-infra - P2 8 Solarnet Bootstrap nodes are rebuilt and monitored or decommissioned. - they are currently held together by systemd restarts. - P2 Cluster nodes ansible is merged to bifrost-infra and monitoring is consistent with the gateways. ## Our metrics are reliable - P0 We have seperate metrics for our websites and gateway traffic - P1 PromQL for Gateway Service Level Indicators (SLIs) dashboard is version controlled and reviewed by 4 people - We have too many grafana dashbords, and some report the same stats differently. PromQL is complex; we need to audit and reduce the number of dashboard, to have clear signals from our systems. - P1 We fix the netdata sync errors (causing erroneous metrics spikes) ## Our alerts are actionable - P1 Service Level Objectives thresholds (SLOs) are researched, agreed and alerted on. - P2 A runbook is published with initial triage steps - P2 opsgenie / responsible human rota is defined ## The community knows what to expect from the ipfs.io services - P0 Gateway usage policy is published - P1 We write the section on gateways (local vs ipfs.io) for the ipfs docs. # 2020 goals - average response time for content not on the gateways is < 30s - average response time for on the gateways is < 300ms - extract our websites from our gatway - seperate domains for project websites and ipfs.io gateway - gateway timeout page - public metrics dashboard - open source our ansibles _carefully_, and reboot ipfs/infra ## Questions - Alerting - are we the first to know about failures? - who should respond and how? - Metrics & trends - can we see problems coming? - can we find the cause of failures? - can we provide useful reports to the community? - Experiments - How can we best help go-ipfs improve? - Do we **know** how go-ipfs config options affect the gateways - Do we have a good idea on how to run experiments so they give meaningful results, and the information is retained and compounds over time? - Safety - Are we protected from operator error taking down key infrastructure - Are secrets secure - Is it easy on and offboard new operators - Performance - Do we know how fast the gateways are, and how to make the go faster? (*sic* this is deliberately un-nuanced to prompt discussion.) - Reliability - Do we know how reliable the gatways are and how to make them more reliable - Do we know how reliable the gateway infrastructure is vs how reliable go-ipfs is - Technial Debt - Do we have peer-reviewed auto-deploy process for the Preloaders, bootstrappers, dht boosters, *-star libp2p services - Do we have metrics and alerting for the above? - that we trust? _These are to prompt thinking about our OKRs... I think the answer for most of them is negative right now._ ## Suggested High level key results from IPFS org level Q4 OKR report _Please add your thoughts on these_ [P0] IPFS gateway/pinbots/clusters/preload nodes are ready to 10x scale with great QoS - Entire team experts at using SLI dashboard to diagnose network changes / outages and respond appropriately - ∴ Some of the team need to be experts in promQL and SLI definition - ∴ We all need to train to understand how to interpret the SLI graphs - ∴ We need a shared understanding of what our SLIs are and why those numbers are chosen - ∴ We need actionable alerting on our SLIs - ∴ We need a defined "custodian of the SLIs" rota... who will wake up and under what conditions to deal with alerts - ∴ We need a runbook of appropriate first responses to various failure conditions - IPFS preload nodes prepared to handle traffic from companion launch within Brave (50K users?) - ∴ We need to bring the preload nodes config and monitoring into bifrost-infra - ∴ We need to load test to demonstrate readyness - Gateway canary testing IPFS master vs prod behavior / CPU / response time / etc - ∴ We need a staging environment - Launch external communication of gateway purpose, status, SLOs, and terms of use - ∴ We need to define our terms of use and SLOs - ∴ We need to publish our status. Become observable. - - “Chaos monkey” (or similar) configured to run on infra and a week of emergency preparedness testing (if various services fail/spike, how do we maintain QoS?) scheduled for Q1 olizilla: "chaos monkey" & spinnaker would be nice to shoot for, but i feel like we have a load of tidying up for the bootstraps and the preloads etc. before moving the house again. ## Your thoughts... - [P0] Preloaders: - Migrate provisioning to Ansible - Are the preloaders hard-coded anywhere else other than https://github.com/ipfs/js-ipfs/blob/master/src/core/runtime/config-browser.js#L20-L25 - olizilla: also here https://github.com/ipfs/js-ipfs/blob/cf38aead2b0cb0b5f269daf265a2b868c50a81f8/src/core/index.js#L50-L56 - olizilla: related, the delegate nodes are offered up to the public in the docs here https://github.com/ipfs/js-ipfs/blob/e1c214f55ffacc0277cf73231476c50760923d80/README.md#configuring-delegate-routers - How can we achieve load balancing if `node0.preload.ipfs.io` and `node1.preload.ipfs.io` are hard-coded - Current boxes (8core, 32GB) are CPU-bound: `21:08:57 up 204 days, 21:23, 1 user, load average: 14.52, 11.04, 9.31` --- - seperating the websites - 4 prod services, low touch, docker images - bootstrappers are distributed. we dont have to deal with bird - bgp is perfect for gateways... - dns libs are the worst. - bootstrappers have hardcoded ip addresses - dns/foobar/ - roll out pre-computed bootstrappers and preloaders - we should use them. - get away from the hard coded ips - ansibles first - mtail? we have multiple prometheuses... migrate the boostrappers - provsn - preload & bootstappers - we have to preserve the ip preloaders - bootstrappers - stuff get cached giving them pprof is dope go-ipfs was cpu bound ]s

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.