Project pheonix book reflections

# Project pheonix book reflections These are reflections refined from notes taken listening through [Project Phoenix](https://www.amazon.com/Phoenix-Project-DevOps-Helping-Business/dp/0988262509) on audible. ## On what basis do we decide if we accept a new project? (chap 19, 18:00) "On what basis do we decide if we accept a new project" - this is a key question for 2i2c I think. My take is that we should be careful about accepting a new project at this point in time, and that we currently lack a policy/procedure on accepting new complexity. My view about this can be clarified by considering discussion below. ## Throughput and complexity tradeoff I recall Eric looking at a plant floor saying "the goal is throughput" (chap 19, 28:30). This discussion led to a discussion about how many different kinds of parts etc. were manifactured. The more amount of parts manifactured the lower the throughput would become, and the longer the order->delivery time becomes etc. I've often related to this idea, thinking that the simpler the service offering, the more communities we can serve it to. This has to impact the cost of the service as well. Expanding the service features comes at the opportunity cost of being able to scale out the existing service to more communities at a lower cost. ## Tendency towards 100% firefighting and eliminatable toil > **Debt** > > I'll refer to debt here where it just partially mean "technical debt", but doesn't exclude debt in establishing practices interacting with communities and to onboard etc. If we don't find time to handle debt by preventive maintenance, eliminating toil, establishing documented procedures and practices etc, then we end up only doing responsive things without time to get unstuck. The more complexity we've introduced, for example by "producing more kinds of parts", the scale of debt to handle increase. Until we manage our debt well enough, I see it as very problematic to introduce more complexity! I'm currently thinking 2i2c should avoid additional complexity, we have a lot of debt making us unable to scale or provide a sufficiently qualitative service. ## Task distribution and stress of detecting sparks > **Sparks** > > I'll refer to _sparks_ meaning things that could or is guaranteed to become fires over time. Depending on what kind of tasks we do, we may be more or less prone on detecting sparks. Of course they should ideally be handled before they turn into fires. Now what happens if the lead time to recruiting help from others to handle sparks before it turns into a fire is too long? That has at least compelled me to spring into action, squeezing in time to do something I've not been allocated time for. This is stressful. ## Open source comittments **Deep contributions** Working with 2i2c where we upstream features in open source projects, mostly in github.com/jupyterhub, I feel a responsibility of providing deep contributions. This means that we don't only add parts that meet our needs, but also help manage the incomming issues, review other peoples work, fix misc bugs, ensure documentation is comprehensive, tests are in place, automation for tests/release, releases procedures etc. If we are in a 100% firefighting mode, we won't do this. **A separate organization** Providing deep contributions upstream means to for example be able to prioritize a security matter, but that means that we must have some time available and not be fully scheduled. This is a complexity not captured in the phoenix project where they mostly relate to the profits of the company, while as deep contributors to open source, we have to find time for both 2i2c's and jupyterhub orgs priorities to some degree. ## Big pieces of debt I've observed - **Billing** - **Resource allocation** How much CPU/Memory are users allocated? This is a problem influencing: cloud costs, stability of service, user experience of starting servers fast/slow. It is a big topic and is hard to work because its both complicated and includes "breaking changes" to some degree, where communities may get a new UX starting servers for example. - **Community onboarding and communication** How do we communicate misc information to communities so that we don't have to explain things retroactively when they run into trouble of nog being onboarded into for example that we shut down servers being inactive for an amount of time? - **k8s cluster maintenance** We've almost caught up with this, but its not finalized yet because we don't yet have thorough docs on doing these - its a WIP for me to write this down still.