owned this note
owned this note
Published
Linked with GitHub
# 🌈 IPFS Bifrost - 2019 Q4 - Objectives and Key Results
_What is the main thing?_
- Provide actionable insights to IPFS developers
- Provide a great service to the community
- Be safe, observable, reliable, happy & open.
## IPFS infrastracture informs and improves go-ipfs development
- P0 traffic is mirrored to staging gateway running latest go-ipfs
- P1 all devs have access to ipfs logs from production and staging infrastructrure
- P1 nightly and on-demand pprof profiles from all nodes are published to a shared repo
- P2 bifrost team identify 5 perfomance issues with go-ipfs
## IPFS infrastructure is stable and scales to meet demand
- P0 2 preload/delegate node are provisioned, monitored and documented
- companion is in brave now, so we need to ready for increased usage from js-ipfs
- P1 4 new DNS Bootstrap nodes are provisioned, monitored, and documented.
- dnsaddr bootstrap nodes are in the config but not yet deployed.
- P1 Ansible tower manages our deployments
- Running ansible from CI gets us so far, but will cause friction as preload/bootstrappers are brought into the bifrost-infra
- P2 8 Solarnet Bootstrap nodes are rebuilt and monitored or decommissioned.
- they are currently held together by systemd restarts.
- P2 Cluster nodes ansible is merged to bifrost-infra and monitoring is consistent with the gateways.
## Our metrics are reliable
- P0 We have seperate metrics for our websites and gateway traffic
- P1 PromQL for Gateway Service Level Indicators (SLIs) dashboard is version controlled and reviewed by 4 people
- We have too many grafana dashbords, and some report the same stats differently. PromQL is complex; we need to audit and reduce the number of dashboard, to have clear signals from our systems.
- P1 We fix the netdata sync errors (causing erroneous metrics spikes)
## Our alerts are actionable
- P1 Service Level Objectives thresholds (SLOs) are researched, agreed and alerted on.
- P2 A runbook is published with initial triage steps
- P2 opsgenie / responsible human rota is defined
## The community knows what to expect from the ipfs.io services
- P0 Gateway usage policy is published
- P1 We write the section on gateways (local vs ipfs.io) for the ipfs docs.
# 2020 goals
- average response time for content not on the gateways is < 30s
- average response time for on the gateways is < 300ms
- extract our websites from our gatway
- seperate domains for project websites and ipfs.io gateway
- gateway timeout page
- public metrics dashboard
- open source our ansibles _carefully_, and reboot ipfs/infra
## Questions
- Alerting
- are we the first to know about failures?
- who should respond and how?
- Metrics & trends
- can we see problems coming?
- can we find the cause of failures?
- can we provide useful reports to the community?
- Experiments
- How can we best help go-ipfs improve?
- Do we **know** how go-ipfs config options affect the gateways
- Do we have a good idea on how to run experiments so they give meaningful results, and the information is retained and compounds over time?
- Safety
- Are we protected from operator error taking down key infrastructure
- Are secrets secure
- Is it easy on and offboard new operators
- Performance
- Do we know how fast the gateways are, and how to make the go faster? (*sic* this is deliberately un-nuanced to prompt discussion.)
- Reliability
- Do we know how reliable the gatways are and how to make them more reliable
- Do we know how reliable the gateway infrastructure is vs how reliable go-ipfs is
- Technial Debt
- Do we have peer-reviewed auto-deploy process for the Preloaders, bootstrappers, dht boosters, *-star libp2p services
- Do we have metrics and alerting for the above?
- that we trust?
_These are to prompt thinking about our OKRs... I think the answer for most of them is negative right now._
## Suggested High level key results from IPFS org level Q4 OKR report
_Please add your thoughts on these_
[P0] IPFS gateway/pinbots/clusters/preload nodes are ready to 10x scale with great QoS
- Entire team experts at using SLI dashboard to diagnose network changes / outages and respond appropriately
- ∴ Some of the team need to be experts in promQL and SLI definition
- ∴ We all need to train to understand how to interpret the SLI graphs
- ∴ We need a shared understanding of what our SLIs are and why those numbers are chosen
- ∴ We need actionable alerting on our SLIs
- ∴ We need a defined "custodian of the SLIs" rota... who will wake up and under what conditions to deal with alerts
- ∴ We need a runbook of appropriate first responses to various failure conditions
- IPFS preload nodes prepared to handle traffic from companion launch within Brave (50K users?)
- ∴ We need to bring the preload nodes config and monitoring into bifrost-infra
- ∴ We need to load test to demonstrate readyness
- Gateway canary testing IPFS master vs prod behavior / CPU / response time / etc
- ∴ We need a staging environment
- Launch external communication of gateway purpose, status, SLOs, and terms of use
- ∴ We need to define our terms of use and SLOs
- ∴ We need to publish our status. Become observable.
-
- “Chaos monkey” (or similar) configured to run on infra and a week of emergency preparedness testing (if various services fail/spike, how do we maintain QoS?) scheduled for Q1
olizilla: "chaos monkey" & spinnaker would be nice to shoot for, but i feel like we have a load of tidying up for the bootstraps and the preloads etc. before moving the house again.
## Your thoughts...
- [P0] Preloaders:
- Migrate provisioning to Ansible
- Are the preloaders hard-coded anywhere else other than https://github.com/ipfs/js-ipfs/blob/master/src/core/runtime/config-browser.js#L20-L25
- olizilla: also here https://github.com/ipfs/js-ipfs/blob/cf38aead2b0cb0b5f269daf265a2b868c50a81f8/src/core/index.js#L50-L56
- olizilla: related, the delegate nodes are offered up to the public in the docs here https://github.com/ipfs/js-ipfs/blob/e1c214f55ffacc0277cf73231476c50760923d80/README.md#configuring-delegate-routers
- How can we achieve load balancing if `node0.preload.ipfs.io` and `node1.preload.ipfs.io` are hard-coded
- Current boxes (8core, 32GB) are CPU-bound:
`21:08:57 up 204 days, 21:23, 1 user, load average: 14.52, 11.04, 9.31`
---
- seperating the websites
- 4 prod services, low touch, docker images
- bootstrappers are distributed. we dont have to deal with bird
- bgp is perfect for gateways...
- dns libs are the worst.
- bootstrappers have hardcoded ip addresses
- dns/foobar/
- roll out pre-computed bootstrappers and preloaders
- we should use them.
- get away from the hard coded ips
- ansibles first
- mtail?
we have multiple prometheuses...
migrate the boostrappers
- provsn - preload & bootstappers
- we have to preserve the ip
preloaders -
bootstrappers -
stuff get cached
giving them pprof is dope
go-ipfs was cpu bound
]s