ODC @ TTW Bookdash Session | 14 Nov 2022

--- tags: workshop --- # ODC @ TTW Bookdash Session | 14 Nov 2022 [TOC] ### Useful links * [BigScience Data Governance Framework](https://yjernite.github.io/content/LangDataGov.pdf) * [The Stack](https://www.bigcode-project.org/docs/about/the-stack/) * [Am I In the Stack?](https://huggingface.co/spaces/bigcode/in-the-stack) * [ODC Repo](https://github.com/dingaaling/opendatacustodians) * [TTW Personas Issue](https://github.com/alan-turing-institute/the-turing-way/issues/2255) **What does the term “Data Custodian” mean to you? Do you consider yourself one?** * Someone who thinks about the data holistically and not just steps on the way to a research project but preparing the insfrastructure to better prepare for all projects * Cleaning datasets, * Someone who manages data? I'm not sure if I am one but it also feels like almost all of us are? * Someone who responsibly uses the available data * someone in charge of managine and storage of data * Maybe me in the role of managing my own data that is out there? In that case yes, I am a Data Custodian, but if it is about more general data manager then no - I'm not actively managing other data. * Someone who looks after data for others, a bit like a museum curator. ### Discussion ![](https://i.imgur.com/Ctdf4c5.png) ------ **1. How do you think TTW fits into this Data Governance model? (or how do you fit into this Data Governance Model)** * TTW can provide guidance on all parts of this framework +1 * In a sense TTW is a data custodian since it is managing/providing data/information on research processes. * As data host and hence data provider, TTW does act as a custodian. But the infrastrcuture it uses doesn't belong t * * * TTW provides data that other users can access ------ **2. How do you feel about your data being scraped to train an AI model? Does it matter to you what kind of data (e.g. images, code, writing) and whether it's public domain data?** * I would love to have ways to explicitely give consent - can these crawlers let people know "hey we are using your data - have your say!". What would this involve? * I think understanding what is being scraped and for what purpose is the most important thing in order to make informed decisions on a case by case basis (which is why the copilot example is so worrisome) * They have scraped my Github repos but only my rubbish ones - ones i use to teach github so not sure how useful this is to anyone really? I suppose if I am putting anything out there on the web, I should accept that it can be scraped. I'm happy really as long as it is a transparent process and also if there is an opt-out. * * If it is done like Huggingface I think it is great: they are transparant about the process and they follow license requirements. What GitHub did is not cool. If they use TTW information they are allowed, as long as credit is provided. So as long as they do that I guess I'm happy :) * For me, I think it depends a lot on the context of the data? * If it's open or not? * If it's providing the option to opt-out? * Is it for profit? * Does it take care of the license? * Are there accurate attributions? **3. If you had the option to remove your data from AI training datasets, would you do so? And if so what kind of data would you prioritise for removal?** * It depends on the context the data is being used especially with images. However other forms of data I wont mind having it public * I would remove it if the tool is harmful, or used for disadvantaging someone. Also, it would be expected that I have ways to know who is using my data and how, and if indeed there is a way to remove it. * Any data containing personal information about the users or participants in the data, which could be a risk to their privacy.In that case, a consent form might be useful. * Probably yes, will remove it, again depending on the context. Priority would be to remove anything with personal identifiers. * Yes I would prefer there to be the option to remove my data even if I don't use it. * **4. Are there any other examples of use cases of TTW data for which you would want to opt-out your data contributions?** * I can imagine if let's say TTW deviates from its current mission, and starts using people's work to generate personal profit, people might not want to be associated with the final product. This might bring issues around if people would like to be attributed - for which version - in which version did TTW become problematic etc. We are going through something similar in NASA-TOPS. * * * No not really - open is open and so as long as they stick to the license we have, i am happy. * * * * ### Post-Workshop If you could query or download TTW repo, what kind of data might you be interested in? For what use cases? * lOver the past few months, have talked with folks about the possibility of tagging pages/sub-chapters by "theme" - being able to stitch together a "turing way" that includes everything related to "accessibility" and "data stewardship", for example * Tagging work was being explored in UX, they created something workable, see: https://deploy-preview-2246--the-turing-way.netlify.app/reproducible-research/reproducible-research.html (it's not perfect!) * Yes! The idea was slightly altered, not curated for certain personas in advance -> makes me think we need a session here at Book Dash to discuss! * I wonder if we can use cff file for this to add fields that asks for some info that relates to data custodianship. * definitely persona based query will be great! :D * Generating newsletter or other reports? * Automating manual CM processes * Identifying where TTW was mentioned in external resources * Scraping RSS feed through Google Sheets? * Open questions around attribution/acknowledge for different versions (e.g. original vs. for-profit version) * Data ownership * Who is being asked for consent? * Who benefits? * Challenging norms & expectations around what is "normal practice" * Web scraping as removing people from their social context (efficiency vs. dehumanisation) * Alt text/data schemas/readme (data about data) play a role of adding context to data-driven processes * Reintroducing the social context * Current TTW contributors.md not same as contributors table * Info in multiple places right now in contrast to OLS central contributors database * Append data consent info to contributors table? * Experiment: what does breaking up with TTW look like? how do we integrate that value int our work? * reasons for retraction * Example - contributor to non-open access paper