Why you should use open data for hackathons

# Why you should use open data for hackathons It's common to see non-commercial (NC) and no derivatives (ND) clauses on industry hackathon dataset licenses. Please understand that these clauses are not 'open' according to the Open Definition (https://opendefinition.org/), or any other opinion on open data that I know of. Neither of the NC and ND restrictions is compatible with open data. This is a problem because open data is one of the cornerstones of reproducible science and engineering. What's more, in my experience many hackathon participants would like to share their work with others, or perhaps publish their work in the literature, or even go off and develop commercial business solutions. In addition, many scientists employed in industry (e.g. at some of your sponsors) would like to participate in hackathons. The license on the SEM dataset makes all of these possibilities -- all of which would be great for geothermal industry -- much harder or impossible. (I, for one, will not touch non-open datasets because the litigation risk is too great -- ask the seismic company GSI about this if you doubt it's real!) So I wonder if you will consider changing the license on this dataset? Personally I recommend CC BY or CC BY-SA for open data in the sciences. I even prefer them to 'public domain' licenses, because they require attribution and therefore preserve provenance. ## Does NC really prevent people from using the data? Yes. Many data owners choose the NC restriction to prevent people from legally selling the data. But the NC license actually prevents a person making use of the dataset in almost any commercial context. The CC BY-NC text explicitly says this: "for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database for NonCommercial purposes only and provided You do not Share Adapted Material;" I suppose you could get into what is meant by 'noncommercial', 'extract' and 'adapted material' but my approach in these situations is to assume the worst. In short: I would not touch a dataset like this. It's far too easy to violate the terms of the license. My response to the idea that the data might be sold is twofold: - If the vendor adds value, then I would argue that selling the augmentation is only fair. - If the owner does a good job of hosting the open data, then there's not a lot of scope for selling the unaugmented data. ## Organizer and participant points of view In the past, we've seen subsurface hackathons and data science contests using non-open data from data vendors (for example). These companies typically attend and participate in the hackathon, and get access to the results — especially if the results are *required* to be open access (more on that later). So they derive all the benefits of open culture... without actually making their stuff open. This is fundamentally unfair. In general, regardless of the license participants place on their contributions, I think hackathons should only use open data. Then there is no ambiguity at all about participants' rights in the dataset. They can use it freely for any purpose, publish papers, and so on, without worrying about what they are allowed to do with it, or accidentally committing something to GitHub, publishing a table, using data in a software test, etc, etc. All easily done, and all violating the terms of the license. I don't want to give the impression I think all hackathon products should be open -- I don't. I strongly believe that hackathons should be able to produce commercial ventures, especially in an industrial endeavour, and requiring projects to be open substantially limits what people are prepared to contribute. Teams should be able to produce non-open projects. In hackathons, we must remember that people are donating their time and ingenuity to the even and should be able to participate on their own terms -- otherwise it just looks too much like free labour. ## None of this is easy I've organized more than 30 hackathons and hosted hundreds of participants at conferences, universities, governments, and inside petroleum companies. I've formed these opinions over time, talking to various stakeholders about these kinds of events. It's definitely a tricky area, and part of the problem is that the general level of awareness about these issues is pretty low in our field. And that's partly why I think it's so important to set high standards for ourselves and make these events as impactful as possible. And I think that starts with open access code and data. ## So what should you do? #### Data I think the CC BY license is the right license for open data. It covers the 'sui generis' database rights (make sure you use version 4.0 International) and covers the normal attribution expectation in science. It's not a requirement of the license, but I recommend including a file called README in the data package that explicitly names the copyright holder(s), names the license, and tells people where to get the data from. Be careful not to make it sound like you are *modifying* the license (which is a bad idea), just providing copyright information. You can suggest or recommend that people include the info file with the data. Basically you want to make it as easy as possible for people to understand their rights, and also point out that the data is available from the data owner for free. You should also include a file called LICENSE that is just the full license text >> https://creativecommons.org/licenses/by/4.0/legalcode.txt #### Code It's good to encourage hackathon teams to make open tools, but I'm not sure about making it a requirement. For open projects, the MIT license is a great recommendation. In a way, it's the software-specific equivalent of CC BY, inasmuch as it's a popular and well-understood 'permissive' open license. I don't think you can *require* people to use it however, for the following reasons: - Some people feel very strongly about 'share alike' terms (aka nonpermissive or copyleft licenses). Unfortunately the argument borders on religious warfare. For those people, I would recommend considering the LGPL (it's less restrictive than the GPL, but less widely known). - Some people might build solutions that depend on software carrying the GPL, which requires downstream users to use a similar license. You can't use GPL'd libraries in an MIT-licensed tool. - Some people might prefer other permissive licenses, eg Apache 2.0 or whatever, for various reasons. ## Further reading Read more about open data here > https://opendatahandbook.org/guide/en/what-is-open-data/ For what it's worth, you can read more about these things here > https://agilescientific.com/blog/2021/2/17/which-open-licence-should-i-choose This is also a nice tool > https://choosealicense.com/ (except I wish it recommended the LGPL over the GPL, at least for libraries; the GPL is better for applications in my opinion).