OSS "Telemetry"

# OSS "Telemetry" ## Hypothesis driven, not data driven I love this diagram! <a title="Efbrazil, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:The_Scientific_Method.svg"><img width="512" alt="The Scientific Method" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/82/The_Scientific_Method.svg/512px-The_Scientific_Method.svg.png"></a> Open source telemetry should be using *this* approach, rather than the 'big data' approach of 'collect everything we can, and sift through it later'. The 'collect everything' approach, while *easier* for implementers, leads to both privacy problems as well as sometimes inaccurate results ([p-hacking](https://en.wikipedia.org/wiki/Data_dredging) for example). It's also, IMO, less cool! And in 2023, being cool is very important. So the overall process in an open source project should really look like: 1. Come up with a specific question it wants to answer 2. Determine *if* data needs to be collected from users to answer this question, and which subset of users we would need data from. Sometimes, there are other ways to answer this question that don't require new data collection! Or maybe this question can not actually be answered even with data, at which point it's moot and we can move on. This is a *very* important step, as it lets us justify *why* we are collecting data at all. 3. Design the data collection so we collect *exactly* the minimal set of data required to answer our question, and no more. Just like how communities have developed coding & community best practices over time, they will also develop data collection best practices over time. 4. Instrument your application / library to collect this data, and send it over to *open infrastructure* that can collate, store and make this data available to everyone. More on this later. 5. Once you have enough data to answer your question, answer your question! Publish your result, then stop collecting the data. However, in some cases (such as measuring deprecated function usage, for example), this might basically continue forever - functions will always keep getting deprecated :) This is fine too, but should be an intentional decision. This isn't fundamentally new - it's mostly [pre-registration](https://www.cos.io/initiatives/prereg) codified. I'm not a scientist, so I am sure there is a lot more literature out there about how to do this well that I don't know about! I just want the Open Source Software community to *adopt* these practices, for both software they ship and infrastructure they run. It isn't the simplest or easiest thing to do, but given the age of surveillance we live in, it is the only way. ## Informed consent When you have a specific set of questions to answer, it is also much easier to convince users to send us data! If we can tell them 'it is to help answer these questions we have, and answering them will benefit you too!' vs 'give us the data, we will do cool things with it, trust us bro'. I fundamentally believe that users *want* to help open source software get better, but given that *most* data collection is at best neutral to the user whose data is being collected but in the default case is [actively harmful](https://www.theatlantic.com/technology/archive/2014/06/everything-we-know-about-facebooks-secret-mood-manipulation-experiment/373648/), us open source maintainers have a duty to explain why this data collection might benefit *users*. It could be as simple as 'knowing this is used will help us get more resources to develop this software, benefiting you' (as [GNU Parallel does](https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/citation-notice-faq.txt)). But it needs to be explicit, and it needs to be specific to the purpose at hand, and the user needs to actively agree to it. Informed consent is perhaps the *hardest* part of doing this, primarily because it is the newest. GDPR is forcing large companies to at least make an attempt at figuring out how much they can twist the words 'informed' and 'consent' to not eat into their profit margins too much. But we're pretty far away from having easy to use frameworks for: 1. Easily informing users what is being collected 2. Enough user education that they *understand* what is being collected 3. Easy infrastructure for the *developers* to do all this. This will also need to be different for libraries, CLI applications, GUI desktop applications, GUI web applications, things that go 'brrrrr' on HPC systems, things that make Jeff Bezos richer, etc. I don't think it's an unsolveable problem though, especially if we are looking at it through the lens of 'collect data to answer specific questions' rather than 'just do whatever it is that companies are doing right now'. ## Open Infrastructure Most open source maintainers who made it through this far are probably thinking 'I barely have enough energy to hit a merge button most days, all this is so much work!'. And it is.