Discussion group Humanities & Data Science @Turing === ###### tags: `Humanities and Data Science`, `Ethics`, `Discussion group` 📣 ==[Joining Zoom Link](https://turing-uk.zoom.us/j/99339963748?pwd=TkEzNnZuQ2k2UTRqMTZwc2Jza2JSZz09)== --- :::info - **Next meeting date:** 20 July 2022 16:00 PM (BST) - **Hosts:** Fede, Katie, Malvika, Valeria, Kalle, Anne - **Contact:** fnanni@turing.ac.uk - **Mailing list:** [Link](https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=TURINGINS-HUMANITIES-DATASCIENCE) - **To Know More about the Special Interest Group:** - https://www.turing.ac.uk/research/interest-groups/humanities-and-data-science - https://github.com/fedenanni/HDS-DiscussionGroup ::: [TOC] ## Humanities and Data Science Turing Interest Group * The main aims of the group are to strengthen relationships and build collaborations at the intersection between data science and digital humanities * [Website](https://www.turing.ac.uk/research/interest-groups/humanities-and-data-science) * Join the [Maling list](https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=TURINGINS-HUMANITIES-DATASCIENCE) for more events * [GitHub repo](https://github.com/fedenanni/HDS-DiscussionGroup) for discussion group ## Notes 20 July 2022 ## Topic: Humanities Computing Infrastructure **Special guests: [Anna-Maria Sichani](https://amsichani.github.io/) and [Melodee Wood](http://drmhwood.co.uk/)** **Participants (write your names below)** *Name/Institute* Valeria/Turing Kalle/BL + Turing (Living with Machines) Fede / Turing Jez/BL **Volunteers to take notes: Please add your name below** Fede ### Slides [Link](https://docs.google.com/presentation/d/1I-dZUHJagF4NuwzaevaTgK4g-LuN26ms7McsERzq_hs/edit#slide=id.g12e660dee76_0_0) ### ✍ Initial Drafted Notes *anyone can help taking notes* <!-- All things discussed during the meeting can be entered here. --> Intro on Melodee and Anna Maria. Anna Maria: fellowship started delayed by thee end of Jan 2022. Large inveestment to unlock thee use of AH data and connect the wealth of UK cultural creative offering Strand A: enhancement of existing data servicees Strand B: Reesearch-community led scoping and design of cloud-native data repo in emerging areas Strand C: DRI Policy and Engagement Fellowships Strand D: Observership fees and associate costs CLARIN and DARIAH ERICs Anna-Maria: findings: -![](https://i.imgur.com/Rw15XJy.png) Very fragmented ecosystem Still barriers for the uptake of HPC and AI The culture of deposit is still way more prevalent than the culture of re-use Two sets of recomendatios: - have a governance model that works in the long run - embed thee use of HPC services - include connectivity and collaboration Second set: - high-level teechnical requirements - RSE taskforce and pool provision RSE Steering Group: broad representation of RSE permanent AH RSE taskforce One of the main take-aways: infrastructure is mostly about people :heart: Melodee UKRI engagement in Humanities (through HPC) HPC literature Three types of articles: 1. They used HPC in the background (possibly someone else in the team did) 2. The authors worked on HPC but in a way that was disconnected from the discipline methodology 3. The authors connected HPC use with the research questions they wanted to answer (very few!!!) Q: is this specific to HPC or does this apply to any computational method? Wouldn't we say the same for deep learning for instace? The language in HPC articles was very different and inconsistent, but the questions tended to be similar. What "computational" meeansin different disciplines? difference between computation problems and research questions. Repercussion of short-term projects (average duration 3-5 years): loss of know-how, lack of longevity. Melodee created a survey to understand more about how people use (or would like to use) HPC. Please, take part [here](https://forms.office.com/pages/responsepage.aspx?id=wE8mz7iun0SQVILORFQIS7LxqUXCIkhAjlni55pui4ZURFNITDJGUzZaRVRXOFhFUUVDU1VBUElORi4u) Katie's first question: issue of career mobility, challenge of loss of knowledge, the role of postdocs. What could we do? Anna Maria: the fellowship already is very fragile. we only had 6 months to write something that will influence the top. Fragmented knowledge is there. Melodee: the short life of projects is issue number one. It's not about to make a HPC tool to do what i need - it's more about how to outsource some things from me to a computer. Having a discourse, a written record is really neeeded. Katiee: Issue on the spectrum of computational resources. Melodee: we need encouragement while learning. DD Roure: things change when you put the word "computational" before the name of a discipline Katie: for a lot of people in history what we do. there's also thee issue of documentation as infrastructure. Anna Maria: we need permanent careers for RSEs in arts and humanities and also long-running / permanent fellowships on these topics. DD Roure: you are demonstrating that this scheme works Katie: the Turing as a hub. the relation between projects and universities and individuals is complex. Dariah has been looking on national infrastructure for many years, what should be learning from them? Can we work with them? Melodee: tension between HPC and high-throughput computing. Vera: head of skills at the Turing. even in very computationally heavy fields it take lots of time to run things. On 3/5 years cycle things are reaelly difficult ### Chat Federico Nanni to Everyone (16:08) Here are the meeting minutes: https://hackmd.io/@turing-hds/DiscussionGroup and here's the github of the discussion group: https://github.com/alan-turing-institute/HDS-DiscussionGroup Jez Cope to Everyone (16:10) There are no former SSI Fellows: once a fellow, always a fellow... ;P Federico Nanni to Everyone (16:10) <3 David De Roure to Everyone (16:11) 🙂 Katie McDonough to Everyone (16:11) haha thanks Jez 🙂 Melodee Wood to Everyone (16:11) :) David De Roure to Everyone (16:15) The scoping project on Born Digital was led by Oxford and TNA, we had 23 case studies including one from Turing. Katie McDonough to Everyone (16:16) Thanks, David! Yes, it would be cool to see a list of these. Federico Nanni to Everyone (16:16) in the hackmd you can add your name and help us with the notes if you'd like: https://hackmd.io/B05ta6rlRLig4auxDn5gkA?both#Topic-Humanities-Computing-Infrastructure David De Roure to Everyone (16:17) All the evidence base will be available - we're just giving authors a chance to confirm they're ok for it to be published. I'll find a list… Katie McDonough to Everyone (16:18) Fantastic! David De Roure to Everyone (16:23) These were the born-digital mini case studies: David De Roure to Everyone (16:23) Living with Machines National Library of Scotland Chapbooks Oxford Museums ALICE: The Aggregate Line Inspector & Collaborative Editor Knowledge graphs Rider Spoke – Riders Have Spoken King’s Digital Lab UK Web Archive Emerging formats Digital Voltaire Annotation: anəstor Redelivery: freizo Cambridge Digital Library PriSM SampleRNN HS2 Archive Complex e-theses Open Geospatial Data Application and Services Environmental monitoring system Literary and Linguistic Data Service Malvika Sharan to Everyone (16:30) Thank you Anna-Maria and Melodee. Your work is really valuable - thank you for sharing it here. Federico Nanni to Everyone (16:31) both presentations are so cool and touching on so many relevant points! David De Roure to Everyone (16:32) Interesting about the absence of papers about a research question and a computational approach. One of the ideas last week for a new strand at the DHOx summer school was "computational thinking' - might come back to this in discussion... Federico Nanni to Everyone (16:32) 👍 John Stell to Everyone (16:33) was just going to say computational thinking outside disciplines is a key thing that is needed somehow Ruth Ahnert to Everyone (16:34) Yes, hard agree John - that's been a key drum that has been beaten in the Steering Group Anna-Marie and James have been convening. :-) Katie McDonough to Everyone (16:34) The survey: https://forms.office.com/pages/responsepage.aspx?id=wE8mz7iun0SQVILORFQIS7LxqUXCIkhAjlni55pui4ZURFNITDJGUzZaRVRXOFhFUUVDU1VBUElORi4u John Stell to Everyone (16:36) does it have to be HPC to be important computational thinking? what about computational thinking with less fancy tech? Melodee Wood to Everyone (16:36) https://forms.office.com/r/wFRXVBM2Bt David Beavan | I've got you on the big screen to Everyone (16:39) Running with your comment John, for me, no. What I'd love to see is the barriers of entry to HPC to come down. So its easy to code a pipeline locally on small or sample data and move it to HPC on big data and have it just work. It's not like that, the barrier to entry is great David Beavan | I've got you on the big screen to Everyone (16:39) ^ no as in computational thinking doesn’t just mean HPC Thomas Padilla to Everyone (16:41) I agree, Ive worked at multiple institutions where HPC are essentially nonusable for a number of disciplines. HPC engineers generally arent there to provide a ramp to use. Columbia University Libraries had been discussing a couple of years ago providing that, I dont know what to call it, "tier 1” support for HPC use. Not sure if it was implemented. David Beavan | I've got you on the big screen to Everyone (16:43) On the face of it, it's people and skills that are needed. In many respects that *should* be easier to put in place than a new compute facility or Pb of storage. But many HPC communities are fixed on equipment, because use of HPC is a given for them? Thomas Padilla to Everyone (16:44) need to run - thanks for making the meeting open for this librarian from Las Vegas! David Beavan | I've got you on the big screen to Everyone (16:45) 👋 Valeria Vitale to Everyone (16:45) Thank you for joining! Jez Cope to Everyone (16:45) also hei investment is biased towards capital because if you invest in capital you still have the asset on your balance sheet, whereas if you invest in hiring or training people that money disappears Jez Cope to Everyone (16:46) hiring people is risky because you commit to keep paying them or giving them a redundancy payment (unless you can make them fixed-term) and training people is risky because they may leave as soon as you've given them the skills Claire Gorrara to Everyone (16:46) Agree, Jez, there are really inhibiting institutional cultures Vera Matser to Everyone (16:47) My background is working with the European HPC Centre of Excellence and trying to close the skills gap. Many of the issues are very recognizable. Really enjoyed both your talks. David Beavan | I've got you on the big screen to Everyone (16:47) So, is that where a national (distributed?) centre can provide those skills for the long-run? Separated from projects that come and go, and plugging into existing infrastrusture capital investments? Claire Gorrara to Everyone (16:48) cultural translation is key - commercialisation is a term that is dissuasive for my A&H scholars. Other words are needed Claire Gorrara to Everyone (16:49) UKRI impact accelerator accounts can fund such documentary tool kits Vera Matser to Everyone (16:49) There are the national competence centres for HPC, but I haven't really heard humanities talked about in that context but that could be because I come from a life science/biomedical angle Katie McDonough to Everyone (16:50) That's a great point @claire , thanks! Valeria Vitale to Everyone (16:50) Also, an excellent documentation still wouldn't count as a publication, in most contexts David Beavan | I've got you on the big screen to Everyone (16:51) @vera - interesting, will google about a bit Claire Gorrara to Everyone (16:51) It would for a REF impact case study.... Melodee Wood to Everyone (16:51) @Valeria That's the thing! And Where do you "publish" it? Not in Journal of Victorian Studies (although they are very nice who do publish a lot of my methodology ramblings!) Kalle Westerling to Everyone (16:51) @Barbara - something for Journal of Open Humanities Data? Publishing documentation - does that align with the journal's goals? Claire Gorrara to Everyone (16:52) Many of us in universities with A&H would like permanent roles such as yours Anna-Marie within our research investment ARIA is being launched John Stell to Everyone (16:55) general problem with academia: exposition, making things accessible, documentation, etc not valued as “original” research Jez Cope to Everyone (16:56) @John and similarly maintenance is not as interesting as innovation David Beavan | I've got you on the big screen to Everyone (16:59) There's also the question about how well suited HPC investments are for A&H jobs. We wrote up an article UCL, and found just how input/output bound A&H tasks were.... and the existing HPC wasn't well suited for that. Maybe it's changed now, but there's a call for A&H case studies to be part of shaping HPC David De Roure to Everyone (17:01) Agree with Vera. This was apparent when we did the UKRI "sciecne case for UK Supercomputing" a couple of years ago. Melodee Wood to Everyone (17:01) DOI:10.5281/zenodo.4985325 David Beavan | I've got you on the big screen to Everyone (17:02) Thx all, we need a discord after chat for these things ;) Great talsk and siscussion Vera Matser to Everyone (17:02) Thank you ## Notes 25 May 2022 ## Topic: Open Data **Participants (write your names below)** *Name/Institute* - Fede / Turing - Nilo / Turing - Anne / Turing (The Turing Way) - Kalle / BL + Turing (Living with Machines) - Jen / Turing (ASG RAM) - Katie / Turing (Living with Machines) - Jez / BL - Barbara / King's and Turing **Volunteers to take notes: Please add your name below** ### Slides [Link](https://docs.google.com/presentation/d/1qqCjjNTzA0iJUQycUl4fWq6WyKzke9y-Bu0vKriIy8g/edit?usp=sharing) ### ✍ Initial Drafted Notes *anyone can help taking notes* <!-- All things discussed during the meeting can be entered here. --> Data sharing a standard practice in natural science but not in the social science and humanities Humanities are empirical (are they?) Humaniites research should replicable -> Publishing data is the first step Practical methods for implementing open data in the humanities: - publishing data online? - publishing data papers? Journal for data paper: JOHD https://openhumanitiesdata.metajnl.com/ Promote data reuse values in the humanities Data paper accompanies a dataset describing it Short data papers: 1k words Research papers: 3/5k words --> increased publication, showed there was niche to fill Storing in open access repository: JOHD dataverse, zenodo, figshare, SND, DANS The paper is reviewed, not the data Bigger topics: - Quality of data and the author's perspective - Data review in open data publishing: should it be reviewed? who should do that? - Open scholarship in science vs humanities: Q: How do you define data, and is that tied to a research paper or data paper? A: Big divide btwn humanities & natural sciences. Esp NLP+software using disciplines. Not as systematised data (as compared to NLP for example). Dora: There is a long tradition in corpus research publishing corpus as a resource, including cleaning, tagging data etc. Barbara: in fact a lot of data papers we publish are from linguistics, but we see this spreading to other humanities fields. Sometimes it's a question of using the term “data" , "corpus”, "archive”, "collection” depending on one's background Katie: do i want my data to reproduce the specific experiment or i want soomeone to reuse data that i have created? You make interpretative choices when turning into machine readable data Barbara: explain / contextualise the dataset in a data paper Nilo: humane reproducibility, in certain fields, it has a very different face. in humanities, maybe they need a separate or different movement needed, because humanities has separate needs. re: whether people actually use open data - more often cannot be reproducibile because they deal with unique objects. Ex: french revolution. But it's possible to reproduce the way of investigating. Reproducibility =/= replicate a result or study. Fede: We puts lots of value on our disagreement - which become a contribution. Was presenting datasets from histories, colleagues in NLP would say that the problem is ill-defined. How to come with the value of disagreement. Dora: corpus resources are for complimenting work and data sources. Not about reproducibility to look at variables. Looking at contextual variables. Making data available in a way that is documented to cover biases. Nilo: defining an annotation scheme is very complex and it's important to document the choice Tom: historial musicology derives its origin from corpora Glenn: there is value in biasing in data, diachronic linguistic research Jez: chemistry research, there is one dataset on which we have done our studies. and then they also look at several publications. you publish somthing which goes more into the meta-analysis level Anne: Ethnography is a reproducible method. Fede: What happens with the data afterwards? Nilo: the responsibility is on the end users, the importance is documenting clearly the choices Dora: on the gold standard topic, this is a broader complex topic. Every dataset will be very contextualised. Barbara: question re: gold standard proves the importance of data papers. Katie: Talking with histories talking about world history saying they want places to put data, but there's already a lot out there. Trying to 'do something with the data' has been done under different circumstances, contexts, the idea that this can be brought together & do global history with it. Fede: Who guarantees quality? How to find the time to review the data of others? Tom: Need to distinguish between data vs sources. Scares people without computational background. Q from Jez: Practice research is another interesting avenue to consider wrt reproducibility: if you are trying to say/discover interesting things through examining your practice, is it necessary to document that practice in a way that we would recognise as "reproducible"? ### Chat Federico Nanni to Everyone (16:03) Minutes: https://hackmd.io/@turing-hds/DiscussionGroup Federico Nanni to Everyone (16:08) Minutes: https://hackmd.io/@turing-hds/DiscussionGroup Me to Everyone (16:08) link to the slides are also available in the minutes notes but for convenience: https://docs.google.com/presentation/d/1qqCjjNTzA0iJUQycUl4fWq6WyKzke9y-Bu0vKriIy8g/edit#slide=id.g9e15982f89_0_0 Dora Alexopoulou to Everyone (16:24) There is a long tradition in corpus research publishing corpus as a resource, including cleaning, tagging data etc. Barbara McGillivray to Everyone (16:25) yes, I agree! and in fact a lot of data papers we publish are from linguistics, but we see this spreading to other humanities fields. Sometimes it's a question of using the term “data" , "corpus”, "archive”, "collection” depending on one's background Katie McDonough to Everyone (16:26) For ex in social sciences in the US, there has been huge infrastructure supporting data reuse - https://www.icpsr.umich.edu/web/pages/index.html Sorry, didn't see Dora's comment before I spoke! Mark Bilby to Everyone (16:31) Data curation can also be a way of quantifying and assessing scholarly disagreement! Katie McDonough to Everyone (16:31) We are on the same page 🙂 Jez (he/him) Cope to Everyone (16:31) I'm a mathematician by training: compared to that *everything* is ill-defined :D Katie McDonough to Everyone (16:31) haha Katie McDonough to Everyone (16:38) @Tom this is such an important consideration. We need to work harder to talk about how lots of humanistic research is about collecting resources, and here perhaps the jump is about making it machine-readable Tom Irvine to Everyone (16:40) @Katie, yes. The machine-readability piece is now much easier I think, there is a whole comp sci sub-discipline, Music Information Research, that is just about this jump… Mark Bilby to Everyone (16:41) @Tom Has the open access/data movement in general started to take hold in historical musicology? Are there concerted (pun intended) efforts to publish open musical scores and editions? Open data curation and publishing can be a powerful way to disrupt and transform for-profit / monopolistic publishing models that dominate many disciplines. Tom Irvine to Everyone (16:43) @Mark there are unfortunately some particular hurdles to this (copyright, monopolizing publishers, straight-up Luddites, jealous archivists etc…) David De Roure to Everyone (16:43) Sorry I had to step out during the talk (but I;ve looked at the slides!) I was ina meeing aout TREs, and it might be interetsing to mention this. Me to Everyone (16:46) @Tom - I'm from a background in theatre history, where I certainly have seen similar hurdles! Tom Irvine to Everyone (16:48) @Kalle, yes. What is the way around these problems? I see a lot of interesting open data work around pre-1500 Western repertoires, where no one is particularly fussed about “my composer your composer” but not as much after. I was a postdoc on an open Mozart edition in the 2000s and it failed… Katie McDonough to Everyone (16:49) my favorite topic! Me to Everyone (16:49) @Katie - Mozart editions? 😅 Katie McDonough to Everyone (16:49) Ah, sorry, gold standards 🙂 Federico Nanni to Everyone (16:50) ahahah Katie McDonough to Everyone (16:50) no shade on Mozart though Me to Everyone (16:50) 😛 Jez (he/him) Cope to Everyone (16:50) Practice research is another interesting avenue to consider wrt reproducibility: if you are trying to say/discover interesting things through examining your practice, is it necessary to document that practice in a way that we would recognise as "reproducible"? David De Roure to Everyone (16:50) @Jez good questin! Tom Irvine to Everyone (16:52) The issues with global history are similar to those with musicology! Mark Bilby to Everyone (16:54) On the downstream / reuse aspect of open data, I wonder if there is an opportunity for increased/formal coordination between the Kaggle data science community and open data academic publishers. Kaggle datasets vary widely in quality, pre-processing, curation, and contextualization, and yet they generate massive interest and use through question- or challenge-specific competitions or use in data science curricula and modeling. The Kaggle community could benefit from expert-curation of data by specialists, especially in the Humanities. Conversely, academic journal published datasets could benefit from the broad scale public interest and data science challenge approach that Kaggle brings. Glenn to Everyone (16:55) Mechanical Turk ;-) Jez (he/him) Cope to Everyone (16:55) Yet more unpaid labour... :) David De Roure to Everyone (16:55) I introduced a discussion last year on the hidden labor of AI vs crowdsourcing 🙂 Dora Alexopoulou to Everyone (16:56) Mechanical Turk does not pass ethics approval in many universities now. Nilo Pedrazzini to Everyone (16:56) Corpus is definitely data!! ;) Katie McDonough to Everyone (16:57) So many conversations about the word data. Katie McDonough to Everyone (16:57) https://miriamposner.com/blog/humanities-data-a-necessary-contradiction/ Glenn to Everyone (16:57) We've definitely touched both senses of "data" and "information” in this discussion Mark Bilby to Everyone (16:57) Lor is scary, not Data. Jez (he/him) Cope to Everyone (16:58) Star Trek references >_< Katie McDonough to Everyone (16:58) https://library.stanford.edu/blogs/stanford-libraries-blog/2021/05/everything-data-except-when-it-isnt Rossitza Atanassova to Everyone (16:58) How are students taught about data in humanities courses if practitioners are still not sure about it? Katie McDonough to Everyone (16:59) They mostly aren’t… Glenn to Everyone (16:59) Thank you for organizing! ## Notes 30 June 2021 ## Topic: Non-English NLP :mailbox_with_mail: [Invitation prompt](https://hackmd.io/GGuqNbzpS2qGvwh28N5e9w) **Participants (write your names below)** *Name/Institute* - Katie McDonough / Turing - Federico Nanni / Turing - Malvika Sharan / Turing - Leontien Talboom / UCL & The National Archives, UK - Quinn - Daniel - Kevin - Javad - Alex Brandsen / Faculty of Archaeology, Leiden University - Martin - Serge Sharoff / University of Leeds - Ludovic Moncla / LIRIS, INSA Lyon, France - Thao Do **Volunteers to take notes: Please add your name below** - Fede - Leontien - Malvika ### Slides https://hackmd.io/_7ga9589T8OqQs1F6GWpEA #### References from slides/invitation - Just an example of why Wikipedia should not be always assumed to be an available corpus for each language: https://www.theguardian.com/uk-news/2020/aug/26/shock-an-aw-us-teenager-wrote-huge-slice-of-scots-wikipedia - Quinn Dombrowski, "What's a "word": Multilingual DH and the English Default" (2020) http://www.quinndombrowski.com/?q=blog/2020/10/15/whats-word-multilingual-dh-and-english-default - The Multilingual Digital Humanities initiative https://multilingualdh.org/en/ - Emily Bender on (her) Bender Rule (2019) https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/ - Right to Left Conference 2021 https://dhsi.org/dhsi-2021-online-edition/dhsi-2021-online-edition-aligned-conferences-and-events/dhsi-2021-right-to-left/ - NEH-funded project: New Languages for NLP https://newnlp.princeton.edu/ - https://gscl.org/en - https://www.ai-lc.it/en/ - https://hausanlp.github.io/ - Domenico Fiormonte, "Towards a Cultural Critique of Digital Humanities" in Debates in the Digital Humanities 2016 ### :dart: Quick Question *Feel free to answer or add a '+1' next to a statement that you agree with and/or would like to discuss* **What is your experience with non-English languages in NLP or DH?** ### ✍ Initial Drafted Notes *anyone can help taking notes* <!-- All things discussed during the meeting can be entered here. --> Non-English NLP. Lots of development working with non english language outside of the NLP community and in the DH. how do we approach working with non english material in these communities? Problems: - getting data - developing methods - sharing work - how we teach Significant differencies across centuries for many languages. Many NLP methods are developed to work in English, gaps in tools / software libraries for working in tools. Imperative to present in English is a problem, if there is no attention to linguistic diversity in the classroom then there will be a lack of it in the public sphere. Why people work in other languages? Quinn: Often there are no things out of the box that you can use, because most of the time these are things that are not really used in the NLP community. Katie: the tools that are out there were terrible and don't work. Alex: dutch archeology pipeline is not generalisible and so very hard to publish Serge: the diversity of language is quite large ![](https://i.imgur.com/f2TE1sD.jpg) the rise of multilingual method brings less language resources and for some tasks might work Daniel: relation between business and law - if you have a model that assign a tag and works well in English, does this work well in another language Serge: machine translation and multilingual transformers might work Quinn: there is unavoilable labour for annotations. Often is non scalable Katie: many complex and time consuming questions Fede: find good annotators across languages is hard Alex: there are many gray areas when working in other languages Leontien: archival selection is approached with very engineering way Serge: simple engineering solutions might be a starting point to enable language-specific research Katie: what are typical problems when teaching multilingual NLP? Quinn: words are not words and sentences are not sentences. Katie: certain type of linguistic work in the humanities are easier to do when working in english compard ### :books: Reference and other works mentioned during the discussion *Please add links and references to any work that has been discussed and mentioned* * Nekoto, W. et al. 2020. ‘Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages’, Findings of Association for Computational Linguistics, ACL Anthology. URL: https://www.aclweb.org/anthology/2020.findings-emnlp.195 DOI: 10.18653/v1/2020.findings-emnlp.195 - This paper proposed participatory research for African languages * This paper from researchers in Nigeria who are working to embed Hausa and other local languages for NLP: https://arxiv.org/abs/1911.10708 https://github.com/hausanlp * Linguistic variety, cognate languages and NLP: http://corpus.leeds.ac.uk/serge/publications/2020-jnle.pdf --- ## Notes 28 April, 2021 ## Topic: Open-source Journalism **Participants (write your names below)** *Name/Institute - Federico Nanni, The Alan Turing Institute - Leontien Talboom, UCL & The National Archives UK - Malvika Sharan, The Turing Way - Turing Institute - Katie MacDonough, The Alan Turing Institute - David Beavan, The Alan Turing Institute - Giovanni Maria Pala, University of Oxford/Magdalen college - Bernard Ogden, The National Archives - Camila - Andre - Rossitza Atanassova, British Library - S. Sharoff - Ridda - Ismael - Andre Piza - Alan Turing Institue - Andrea Kocsis, The National Archives **Volunteers to take notes: Please add your name below** - Federico - Malvika ### Slides [Open Source Journalism](https://docs.google.com/presentation/d/1EWPeRaRDjYKbs8Y1P06rqKIB8Ul0AUhUMNOmHxD31pA) - references: - https://www.bbc.com/news/technology-22214511 - https://www.bellingcat.com/ ### :dart: Quick Question *Feel free to answer or add a '+1' next to a statement that you agree with and/or would like to discuss* **What is your experience with open-source journalism? Do you think it could be of any help for your discipline? If so, why?** *Name / response* * David / Along with Camila here on the call, I'm part of [Turing Data Stories](https://github.com/alan-turing-institute/TuringDataStories): a mix of open data, code, narrative 💬, visuals 📊📈 and knowledge 🧠 to help understand the world around us. * Blatent self promo: Come join us, with stories, data, ideas and community building * [name=Camila] 🙌 YASS! * Ismael / Not very much, but I spoke with the Comms team today and it was interesting to hear how things have changes during the pandemic, as journalists now just look out for preprints. See Fox ([2020](https://www.sciencemediacentre.org/what-should-press-officers-advise-on-preprints-during-a-pandemic/)). * David / One of my fave exhibitions ever was [Forensic Architecture](https://forensic-architecture.org/) '...undertake advanced spatial and media investigations into cases of human rights violations, with and on behalf of communities affected by political violence, human rights organisations, international prosecutors, environmental justice groups, and media organisations.' * Camila: One of my main source for Venezuelan news comes from a [OSINT account](https://twitter.com/conflictsw) that can't be censored by the regime. It is been extremely useful to know what is happening in certain events. * Malvika: Positive journalism in the context of citizen science (researcher's night, road show, data reporting, blogging) and negative journalism in the context of public shaming of people social media (twitter trends on open source related news - current one on basecamp). * I am thinking more in the direction of peer-reviewing of news where ideas are represented fairly (a more wholesome view of world event rather than living in a bubble). * Andre: I'm a journalist by background and I'm running a project with the Bureau Local (a branch of the Bureau of Investigative Journalism) that works with a lot of open source or citizen journalism principles. The project I'm running is a collaboration with Coney (interactive theatre company) and they are working on innovative community engagement, investigation, storytelling and impact strategies. The idea is that this could help both organisations to tell contemporary stories that matter to people co-creating with communities. It is based on the idea that traditional journalism is extractive of society and it needs to work better for communities affected by the issues that journalists cover. **For reference**: [FT's The Uber Game](https://ig.ft.com/uber-game/) * Katie: I am familiar with a couple of projects at the Spatial History Project at Stanford that are nice combinations of mapping/data viz and journalist- and documentarian-driven projects. For ex: http://web.stanford.edu/group/spatialhistory/cgi-bin/site/project.php?id=1045 * Giovanni / No real experience but curious about the value as a source, both for current affairs and Social Science, but also within a more classic "historical source" framing. I wonder how much notions of legacy/obsolescence of the journalistic open-data are discussed within these groups. * * */ * ### ✍ Initial Drafted Notes *anyone can help taking notes.* <!-- All things discussed during the meeting can be entered here. --> - Starting point on Boston bombing and picking the wrong suspect (discussion from online forums like Reddit). During the last 5 years open source journalist as a counter movement to alt-facts. - Transparency as a key aspect - very very clear where they got their sources from and there's lots of help from the public. Example of Bellingcat on geolocating picture. Open discussion is public around information. - S. Sharoff: texts published by non by professional journalist but by everyday people. The format and style is often similar but the type of content is different. The founder of Bellingcat is not a journalist. Andrea: she is trained as a journalist, but it's not the education that makes you a journalist, more about the prcatice. - Andrea: internship is again learning through practice instead of professional training (trust in editing) - Open source journalism is about building accountability. Quicker response to any concern compared to established outlet. - S. Sharoff: there are some linguistic differences in the way information is presented. often opinions and third-person viewpoints are not mixed. Fake news spreading often these are more personal - André Piza: open-source journalism has a great impact on organisation. they recognise that there's a big change in the field - Andrea: the role of citisen journalism in anti-democratic countries. - Malvika: open-source journalism and open-source development. Transparency seems to be the key. Peer reviewing, challenging the power imbalance and allowing users to become contributors. - Here they define if blogging can be called open source journalism: [https://www.upstart.net.au/explainer-open-source-journalism/](https://www.upstart.net.au/explainer-open-source-journalism/) * Crowdsourcing data and stories: - Dave: transparency is the main aspect. Bellingcat is based on open source data. Traditional journalism doesn't show the sources and the flow so clearly ("protecting sources"). - Katie: writing about research and writing about journalism. Training graduate students how to write to non academics about their research - often it is done to replace the comms stuff. - Dave: Turing Data Stories: https://github.com/alan-turing-institute/TuringDataStories - Andre: Crowdsourcing data and stories: https://www.thebureauinvestigates.com/blog/2021-04-14/a-blueprint-for-investigative-journalism-how-the-bureau-worked-alongside-riders-to-investigate-deliveroo - Leontien: FullFact and it's not about being trusted, it's about being trustworthy. - Serge: writing is for an audience and for different audiences you'll have different messages. - Andrea: the audience is more differentiated than just age. Older demographies are more exposed to fake news - Camila: youtube as first source of news for young people. - Andre: from 2019 social media is not anymore growing as source of news and also trust in people is changing. * ### :books: Reference and other works mentioned during the discussion *Please add link and reference, any work that has been discussed and mentioned* * ## Notes 24 February, 2021 ### Topic: The Role of Authorship** ([slides](https://docs.google.com/presentation/d/1oanQWbP_yg9UkLSI9OrNwSxST0K5mkgXvNiymz4ixqw/edit?usp=sharing)) **Volunteers to take notes: Please add your name below** - Leontien - Malvika **Participants (write your names below)** *Name / Institute* - Federico Nanni, The Alan Turing Institute - Leontien Talboom, UCL & The National Archives UK - Malvika Sharan, The Turing Way - Turing Institute - Katie McDonough, The Alan Turing Institute - Emma Karoune, The Alan Turing Institute - David Beavan, The Alan Turing Institute - Ismael Kherroubi Garcia, The Alan Turing Institute - Jez Cope, The British Library - Glenn, The Alan Turing Institute - Jenny Bunn, The National Archives - John Moore (The National Archives) - Alessandro Tirapani - Isil Bilgin (University of Reading, Brainhack Global Organization Committee) - Kaspar Beelen, The Alan Turing Institute - Bernard Ogden - Serge Sharoff, University of Leeds University of Leeds - Becca Hutcheon - James Smithies, King's Digital Lab, Kings College London - Rossitza Atanassova, The British Library - - ### :dart: Quick Question *Feel free to answer or add a '+1' next to a statement that you agree with and/or would like to discuss* **Is authorship a topic you discuss openly with your colleagues / group? Or is it something that comes out only at the very end of a project (when you’re about to submit a paper for instance)?** *Name / response* - Jenny - Coming from the perspective of being an archivist we have been reluctant to claim ownership of our work (e.g. in the form of a catalogue). We operate in a different framework around recognition - For many years it has been seen as important that we are invisible, neutral in the process, but this is increasingly being questioned. +1 DB:neutrality I suspect is a myth, the catalog is a reflection of individuals, and certianly society at time of writing/editing. JB - Exactly this is what is being recognised and some are calling for colophons to be added - a sort of positionality statement for archivists. * Emma - single author still fairly common in my field, archaeology but I am speaking about this a lot at the moment with The Turing Way - discussing how many authors and contributions can be captured or attributed fairly, how to record collaborative projects etc. - Ismael - Yes, with those I discuss philosophy of science with; not with the people I am currently preparing a conference paper with. Hmmm, may have caught myself there. * Malvika: Yes, but because I am not currently in a job that sits in a traditional academic system. It is quite important in open source community as a lot of work stay hidden and we would like them to receive acknowledgement. * * Isil: I guess the authorship agreement highly differs between the fields, even countries depends on the apriori accepted settings regarding the order, who should be there, and so on. This seems like a challenge in many aspects to break when it becomes a backboned struture especially when you are an ECR. Here is the question how to decide which contribution worth more than the other or how to quantify them given we all are working under public fundings which outputs and benefits of the research more important than these discussions to focus. * James: We're taking a more ad hoc approach than we'd like in King's Digital Lab, due to other priorities. It's something we're starting to consider now, but on a case by case basis. We are running a 'Research Collaboration Framework pilot' that defines RSE research involvement in the pre-grant stage, and defines high level authorship / attribution expectations, but doesn't get down to the specifics of attribtuion in articles. In general, King's DH has long assumed a 'movie credits' approach to attribution on websites, but we do have work to do with other kinds of outputs. It would actually be useful to have someone other than me here, given I'm lab director and it's really an issue for the RSE team to decide (in my opinion). * [name=Jez] Mostly discuss this as part of the writing process myself; very interested in recording the different roles in metadata, see e.g. [CASRAI Contributor Roles Taxonomy (CRediT)](https://casrai.org/credit/) * Yes, and it gets amplified when there may be different linked outputs, e.g. papers and code. Getting agreement over who are authors, who are contributors who are acknowledged is always hard +1 [name=Glenn] * * Alessandro: in social sciences, it is uncommon to have more than 3 authors. Most papers that are not single authored have quite clear rules. The person who did most work/had the idea/collected data is the first author, and the others are listed depending on contribution. At times a new author is added in the review process, and it would normally go last. So I think there is not much debate beforehand, but it can be discussed when more than one person collects data or does substaintial work along the way. In few cases, you can see a * saying 'All authors contributed equally'. * * * ### ✍ Initial Drafted Notes *anyone can help taking notes.* <!-- All things discussed during the meeting can be entered here. --> First discussion talked about the contribution of code, how does this work? Github holds a history of who has worked on the code, but what if you move it? Jenny talks about being the 'invisible labour' of doing archival work and that archivists are complicit with that as they perceive themselves as neutral. This question has come up more now that there is more recognition for preprocessing datasets within academia. But archivists do not get recognized in the same way that someone who may be doing this preprocessing. At King's Digital Lab they are looking at different ways that people may be recognised for their research contribution. They are able to choose between three options, co-investigator, research services or undertaking a project and getting recognition for it. Other invisible work is done by for example students, who may not be recognised for the tedious work in projects. Jenny also talks about participants in this setting and that it is also about understanding when recognition is important. Alessandro talks about how in social sciences the position is almost irrelevant. It doesn't really matter as much, whereas in other fields it matter quite a lot. Fede then mentions how he would like to discuss the interdisplinary work and publishing this. And what do you do? Do you publish in a highly prestigous interdisciplinary journal? Or do you publish in two different journals? What is the best way of doing this? Fede also asks if code is something that is seen as more general across disciplines? David did say that it is not, they are different ways of coding and different ideas of good code. Ismael asks if people really are seeking recognition and Jenny talks about how there are different levels of recognition, also there is a difference between people who want the recognition and people who need the recognition. Isil talks about how every contribution is important. Creating a mindset and a community where any contribution is a contribution is important. ### :books: Reference and other works mentioned during the discussion *Please add link and reference, any work that has been discussed and mentioned* - [CASRAI Contributor Roles Taxonomy (CRediT)](https://casrai.org/credit/) - [The Turing Way](https://the-turing-way.netlify.app) - a guide to reproducible data science that will support students and academics as they develop their code, with the aim of helping them produce work that will be regarded as gold-standard examples of trustworthy and reusable research. - [Communicating Open Science](https://github.com/alan-turing-institute/the-turing-way/issues/1733) ## Notes 01 December, 2020 ## Topic **Ground Truth and the Humanities**: The Structured Representation of Places (and other Named Entities) in a Knowledge Base ([slides](https://docs.google.com/presentation/d/1PIddDoFrhQsSwxvfy715_5idNujxbnmv2Jo-ztS-8_s/edit?usp=sharing)) ## Volunteers to take notes - Fede - Malvika ## Participants **Participants (write your names below)** *Name / Institute* - Federico Nanni, The Alan Turing Institute - Leontien Talboom, UCL & The National Archives UK - Malvika Sharan, The Turing Way - Turing Institute - Katie MacDonough, The Alan Turing Institute - Ludovic Moncla, INSA Lyon - Carmen Brando, EHESS Paris - Rossitza Atanassova, British Library - Bruno Martins - Matt - Arno - S. Sharoff, Leeds - Gethin Rees - Beatrice Alex - Janelle Jenstad, University of Victoria, Map of Early Modern London - Daniel Wilson - Kasper Beelen - Ruth Mostern - Karl Grossner, World Historical Gazetteer - Arianna Ciula, King's DigitaL Lab, King's College London (UK) - Francesca Benatti, The Open University - Enrico Daga, The Open University - Miranda Lewis - Arianna Ciula - Mark Bell - Katherine Bellamy - Yann Ryan, QMUL, Networking Archives project - Arno Bosse, KNAW Humanities Cluster, Amsterdam - Alex Butterworth - Barbara McGillivray - Bekka Kahn ✍ Initial Drafted Notes --- *anyone can help taking notes.* <!-- All things discussed during the meeting can be entered here. --> - Ground Truth in Humanities: came across this through [field paper](https://wiki.openstreetmap.org/wiki/Field_Papers) - Remote sensing data and historical map: what counts as ground truth about historical places? - The concept of ground truth, especially in digital humanities - Reference for the concept in the humanities - Etymology of the word ground truth: remote sensing, the truth "in the ground". Correcting the digital data (no longer or never represented) what is on the ground. Ground truth as returning to the ground. Going back to the landscape - As per wiki: https://en.wikipedia.org/wiki/Ground_truth - 'records the use of the word "**Groundtruth**" in the sense of a "fundamental **truth**" from Henry Ellison's poem "The Siberian Exile's Tale", published in 1833.' - [Geographic information systems](https://en.wikipedia.org/wiki/Geographic_information_system "Geographic information system") such as GIS, GPS, and GNSS, have become so widespread that the term "ground truth" has taken on special meaning in that context. If the location coordinates returned by a location method such as GPS are an estimate of a location, then the "ground truth" is the actual location on Earth. - Arianna Ciula: in terms of computer science context of verifiability and the lack of data in the humanities to often establish ground truth: https://drops.dagstuhl.de/opus/volltexte/2013/4167/pdf/dagman-v002-i001-p014-12382.pdf - useful to have in some circumstances e.g. when subsantial training datasets needed - Coincidentally today with some other colleagues we had a meeting with Prof Charlotte Roueche about re-building/refreshing this gazetteer project https://www.slsgazetteer.org/ and one of the discussion points focused on the constraints on measurements methods at the time when names were recorded (the point is that we are for granted even the identification of exact coordinates). - Janelle Jenstad: the need of an autority of name for place. What counts as ground truth**s** about a place? Who wrote those documents and why? - Who has the right to name? - Experience working on historical mapping of early modern London, North America's indigenous land. - Who from the past has the right to name the land, who we have forgotten, how do we develop a gazetteer that brings people to get their voices heard (truth from the land/ground) - Ruth Mostern: what we see in published works and what we see on the ground are not the same thing! Go to the ground to figure out what is there. Thinking about the spacial humanities as part of the humanities. Decolonizing colleg campuses - list of every building / roads in every campus and studying all individuals commemorated. - Definition: A **gazetteer** is a geographical **dictionary** or directory, an important reference for information about places and place names (see: toponomy), used in conjunction with a map or a full atlas. - Bea: working on NLP and geoparsing in Edinburgh. Ground truth for verifying an algorithm. Project and task driven the decision on the gold standard / ground truth. - Katie: is the gold standard the same thing as the ground truth? - Bruno Martins: we use ground truth referring to annotations that can be considered as truth. we often work on projects that have ambiguity - we look at the level of agreement. there are tools to quantify uncertainty. reaching something that is close to a consensus. In geography you want to go beyond categorical variables. Does probability theory - S. Sharoff: another point of view from NLP. Classification of genres for instance, often there is the problem of reliability of annotations. Annotations offer different perspectives. If you have a task for translating, a question is what is the cognitive difficulty of this (this is really difficult to establish) - Katie: there is lots of theory on how to handle interannotator agreement, but with visual material the reference point is going back to pre-digital geography. - Ariana C: To react to Bruno’s comment: agreed but I also think it depends a lot on the dataset we are discussing; e.g. in the case of palaeography mentioned above there are simply not enough data to establish a ground truth in the computational sense - Francesca Benatti: Another historical example is the Irish Ordnance Survey, which renamed every single place name in Ireland from Irish Gaelic into English during 1824-1842, while Ireland was part of the United Kingdom. Even after Irish independence in 1922, the English place names are the ones that have remained in common usage. Some of the Irish Gaelic names have only survived in the documents of the surveyors. In Northern Ireland, certain places have different names for the nationalist community and for the unionist community e.g. Derry/Londonderry. “Truth” cannot be separated from the political. - Karl Grossner, it is a bit ironic that a group of humanists discuss about how to establish truth. There is observational data, but this is different from truth. - Katie: Who's on first as an interesting project (from the internet, not from academia). How do we capture multiple truths? - Arianna: Data Information Knowledge pyramid quoted a lot but not always clear what it means in practice so enjoying Karl’s explanation - Ruth: Devil's advocate position: the fact that there was a Gaelic name that was erased by British colonialism (as per the example above) IS a truth, which needs to be excavated. Truth is a good word for this process. - Bekka: Knowledge as ‘belief’ also makes me think about holy/spiritual places which were real at historical moments such as Hades, or Jerusalem (I know the team at KIMA have thought about this a lot). - Arno: be transparent on what we want to use the gazetteer for. Why are you building the gazzetteer - Katie: what is important when you construct a gazzetteer (both historically and now) - Enrico: what is the identify of a place? Is London and Londinium the same place or different places? You use the term ground truth to make a point in computer science. What is the role of "ground truth" in the humanities? Maybe we should speak more about "evidences". - Arianna: defined data models in a layered way (definition of strata depends on scope/purpose/user research) - even a minimum data model for place can be quite complex as the concept is relational (space-time) Full Zoom chat available [here](https://drive.google.com/file/d/1oVRDSPYsBZpUzylSjjubTQJLibdu2LTD/view?usp=sharing) :books: Reference and other works mentioned during the discussion --- *Please add link and reference, any work that has been discussed and mentioned* - [field paper](https://wiki.openstreetmap.org/wiki/Field_Papers) - Gazetteer project: https://www.slsgazetteer.org/ - ref by Arianna: the humanities to often establish ground truth: https://drops.dagstuhl.de/opus/volltexte/2013/4167/pdf/dagman-v002-i001-p014-12382.pdf - Ref by Janelle: https://www.w3.org/2009/12/rdf-ws/papers/ws21: Halpin and Hayes, "When owl:sameAs isn't the Same" - Ref by Karl Grossner: https://asistdl.onlinelibrary.wiley.com/doi/pdf/10.1002/asi.24194 - Ref by Francesca: Logainm.ie The Placenames Database of Ireland, which is working to register both Irish and English place names https://www.logainm.ie/en/. It also links to historical sources of place names
 - Recent Reassembling Republic of Letters publication also includes useful reference to modelling of places: https://www.univerlag.uni-goettingen.de/handle/3/isbn-978-3-86395-403-1?locale-attribute=en - cf. World Historical Gazetteer: http://whgazetteer.org ## Notes: 14 October, 2020 ## Topic - Ethical implications of archiving the web, especially social media ## Aim of this meeting - We would like to talk about the benefits and drawbacks of guaranteeing long-term access to this type of material, focusing in particular on the dichotomy between authorial consent and historical preservation. - Here’s a few starting points for the discussion: - [Guest Editorial: Reflections on the Ethics of Web Archiving](https://www.tandfonline.com/doi/full/10.1080/15332748.2018.1517589) - [We Could, but Should We? Ethical Considerations for Providing Access to GeoCities and Other Historical Digital Collections](https://uwspace.uwaterloo.ca/bitstream/handle/10012/11649/Milligan_etal_JCDL2016%281%29-s.pdf?sequence=1&isAllowed=y) - Archiving social media for good: - [https://www.docnow.io/](https://www.docnow.io/) - [https://www.bellingcat.com/](https://www.bellingcat.com/) ## Participants **Participants (write your names below)** *Name / Institute - Federico Nanni, The Alan Turing Institute - Leontien Talboom, UCL & The National Archives - Jenny Bunn, The National Archives - Andy - Nicola - Helena - Ian :books: Reference and other works mentioned during the discussion --- Recent blog post on an experiment we ran using Webrecorder in the last UK General Election: https://blogs.bl.uk/webarchive/2020/05/using-webrecorder-to-archive-uk-political-party-leaders-social-media-after-the-uk-general-election-2.html This is an older blog post discussing archiving social media throgh heritrix: https://blogs.bl.uk/webarchive/2017/04/the-challenges-of-web-archiving-social-media.html IIPC collections are cross national and multi lingual. All are open access: https://archive-it.org/home/IIPC WARCnet: https://cc.au.dk/en/warcnet/
 IIPC Research Working Group: https://netpreserve.org/about-us/working-groups/research-working-group/ :mag: Main arguments from the discussion --- The discussion started with a few examples of how archiving of social media is done in practice, covering the UK Government Web Archive, Bellingcat and Document the Now. - UK Government Web Archive is archiving YouTube and Twitter for example from government pages. They are only able to archive the original content, none of the comments or other community parts of it. - Document the Now doesn't archive themselves but offers a set of tools to empower people to archive material online. - Bellingcat does archive the context around it, sometimes this being private or confidential information. But they archive it as evidence, which is a slightly different purpose. Starting question was around the difference between archiving and capturing social media. This then led to the British Library (BL) outlining their approach to archiving social media. They also talk about the Legal Deposit, as this limits them from archiving at scale. Another question was asked around the UK content and how this should be determined, as these boundaries may not be as visible on the web. Then there was a discussion around metadata, especially focusing on losing context, such as the UK Government Web archive, only capturing what the government does online. BL keep specific metadata with their material to preserve the context, but what is considered enough and what are we actually capturing? :closed_book: Closing thoughts --- Although the discussion may not have given concrete answers to what should and shouldn't be archived. It was good to hear people's different approaches to this and what their problems they encountered when doing this type of work. ### Additional Drafted Notes <!-- Other important details discussed during the meeting can be entered here. --> We will start from a few examples on how this is done in practice. - The UKGWA archives twitter, youtube and trying facebook - They keep the tweets (no context / retweets / comments) Another example is the Document the Now (they don't archive material themselves) but offer tools - Bellingcat is an investigate journalist company. They use social media to debunk certain things - examples on Covid misinformation - They have a more don't ask for permission but ask for forgiveness. Is it useful to archive - difference between archiving and capture - from BL. Archiving social media preserving the entire life cycle. Legal deposit - they don't archive social media at scale. How to assess what is UK content? - Heretrix used for archiving at scale - not suited for archiving at scale - The UKGWA doesn't keep dynamic nature of social media - Losing context around the tweet - UKWeb Archive at BL tries to get context from metadata - Social media material is as much as possible publicly avaiable (you would need an additional permission from the content owner) - what do we create when we doing it? What are we capturing? (who archives social media at scale?) - build social network from the XVIII century through letters - you can do that, but it is not there in the same form - what are we preserving and why? - Research on social media - but how did you collect them? - Discussion around anonymisation / deanonymisation - And the role of the archive in this - it could act as a collaborator for the researchers for guaranteeing the way data is collected - Prioritisation from the archivist point of view. This is usually not a priority - Technical expertise are needed - We can run the code from our holdings - CommonCrawl / Internet Archive are available so people can run their code first there - document the process of how data has been collected - question for the researchers: what do they want? Discussion around preserving cross-national events - the distinction between private and public is dissolved now - the social context is all mixed together - discussion about the role of BL on archiving and making available - archiving newsletter? discusses how BL archives currently Weibo - moving more into a contracting out to the users the type of contents to preserver - who makes the choice? Delegating to the crowd - but are we replicating the old model? - people are already contacting the BL for preserving their online activities before Discussion on who to communicate to the crawler what you should not ## Notes: 02 September, 2020 ## Aim of this meeting The focus is on the **current and future role of preprints** as a way of sharing research findings, with examples from different communities. ## Participants **Participants (write your names below)** *Name / Institute* - Federico Nanni, The Alan Turing Institute - Leontien Talboom, UCL & The National Archives - Jessica Polka, ASAPbio - Anna Rogers, University of Copenhagen - Dmytro Mishkin, CTU in Prague - Demitra Ellina, F1000Research - Barbara McGillivray, University of CamBarbara McGillivray - David Beavan, The Alan Turing Institute - Rennie Mapp, University of Virginia, US - Martin O'Reilly, The Alan Turing Institute - Callum Mole, The Alan Turing Institute - Adam Tsakalidis, QMUL & The Alan Turing Institute - Alessandro Tirapani, City, University of London - Giulia Paci, UCL - Amy Tabb, at the meeting as an independent scholar, USA :dart: Quick Questions --- *Feel free to answer them or add a '+1' next to a statement that you agree with and/or would like to discuss* **In which cases do you post a preprint of your work?** *Name / response* - Dmytro / Anytime, unless my collaborators are against it - Leontien / Never, it is not common in my field - Jessica / Always (except some commissioned review articles) - Adam T / faster access (post-acceptance) - David B / does final author copy count here? i.e. to satisfy open access - where it's mandadted by funder - Martin O / Pre-submission or post-acceptance (depending on journal policy) if journal paper is not open access, as I always want a freely available copy. I'd like to move my default to pre-submission pre-print as standard practice - Callum / I publish a pre-print at paper submission. Primarily for faster access since the review process can be so sluggish sometimes. I also like that the review process is then transparent (if the paper changes a lot from pre-print to journal article). - Amy Tabb / most of the time when publishing w/ CS/ECE researchers. **And when you don't?** *Name / response* - Dmytro / Only if co-authors are not allowed to. - Jessica / Review article requested by the journal (depending on journal policy) - Alessandro / In our field (organisation studies) it is extremely uncommon to do it alltogether. Multiple journals ask you to take it down (few do it nonetheless) and some do not accept articles already posted online - Amy Tabb / when the lead authors are not in favor and/or it is not in the discipline's tradition (entomology). **How do you select which new preprints to read?** *Name / response* - Leontien / Mainly shared by people on Twitter [name=DavidB] +1 [name=Martin O] +1 - Dmytro/ http://www.arxiv-sanity.com/, twitter - Jessica / Twitter - we have cataloged some efforts here: https://reimaginereview.asapbio.org/explore/?search_keywords=preprint&sort=latest - Amy Tabb / twitter **Do you ever question your approach?** *Name / response* - Dmytro / No. I thought about it a lot, but cannot find reasons not to for myself. - Amy Tabb / Also no. Preprinting has been very positive for my work and allows me to transfer the technology. **Do you then regularly read the final paper when it is published?** *Name / response* - David B / Nope, things have often moved on well before. Makes it tricky as what to cite, the preprint or the final - Leontien / Depends on the type of work it is, as some work will be outdated quite quickly - Dmytro / rarely, mostly if it is updated on arXiv and the paper is very relevant to me - Jessica / we have recommended that preprint servers implement changelog in metadata, would be good to see this for journals as well: https://asapbio.org/biopreprints2020-report - Demitra/ At F1000Research we combine the preprint with the open post-pub peer review process, with all versions linked and an ammendement box explaining what has changed between versions - Amy Tabb / not frequently because of access issues. :books: Initial Drafted Notes --- The Current and Future Role of Preprints Across Research Communities Preprints - scholarly or scientific papers that precedes formal peer review and publication in a peer-reviewed scholarly or scientific journal. Some communities use it a lot more than others, also there has been a paid increase of preprints in the last decades. Why publish preprints? - Increased visibility - Increased citations - Faster dissemnation of results - May prevent scooping - It migh be an easy way to wrap-up a side project - Bypassing paywall Anna discusses the behaviour in NLP regarding preprints. Main point of preprints here is to try and get results out faster. Preprints can be very different from the published version, Anna gives an example of her own work which turned out completely different than the initial preprint. She talks about how even if the published paper turns out much better, not a lot of people revisit it. Dmytro disagrees, and talks about the fact that he does read an updated version if it is published. Martin would love to see how a preprint can change over time, a change log would be very helpful with that. Are they even still the same paper over time? There are very few journals in the life sciences that point to the preprints. Jessica talks about how journals should acknowledge that preprints exist. Some fields seem to be more comfortable citing preprints than other fields. In NLP citations for preprints seem to be more common than citing the actual published work. Across our different disciplines there are different approaches to editing and finishing off the final version of a published paper. Another point that we touched upon is the sheer amount of preprints and this slightly touches on the topic of trust of preprints. How do you select them? What approach is used here? Demitra from F1000 Research talks about how she approaches different disciplines. They are looking at the differences between fields, for examples, for some fields you need a PhD to be considered an expert, whilst others view a Master's degree as sufficient. Because the reviews are open and citable, it encourages people to do a better job at reviewing. Dmytro is concerned about if early career researchers do an open critical reviews, which could impact their career. Demitra talks about how you can team up and protect yourself from these types of situations. Preprints is increasing the speed of research, but does this push research into a certain direction? And is that the direction that you want to go in? Is this harmful for the research community? Jessica - making science openly accessible and make it possible for everyone to provide their expertise gives the writers a much better feedback than when using a more traditional peer-review approach. But the downside of this is that it is easier to disseminate misinformation. Rennie, from a digital humanities background, takes an example of a journal from her field (Cultural Analytics). This journal is against preprints, because it can disrupt the blind peer-review process. Both Dmytro and Martin question why this process is blind. Anna would like to preserve anonymity, her blogpost about this is linked in the below section for more details. Alessandro talks about how his field does not perceive preprints very well, they are not very well known either. Also, the discussion is different depending on the research methods, qualitative and quantitative material will need different approaches. He closes about mentioning how we should rethink the journal process. :mag: Main arguments from the discussion --- - Preprints may change drastically during the time of it being made available and when the actual paper is published. This rises questions around what is actually being cited, as the finished paper could be very different - Peer review is perceived very differently across different fields. It is difficult to find the balance between giving researchers credit for this work, but also keeping people's reputation intact if they provide a critical review, especially if this is an early career researcher. - The benefits of preprints differ across fields, in the section above some example have been given. However, there was a strong positive perception of publishing preprints. But some fields may be more used to using them than others. :books: Reference and other works mentioned during the discussion --- There is a tool for [seeing changes in arxiv papers](https://github.com/temken/comparxiv) [ReImagineReview](https://reimaginereview.asapbio.org/explore/?search_keywords=preprint&sort=latest) [F1000 Research](https://f1000research.com/about) Anna Rogers - [Should the reviewers know who the authors are?](https://hackingsemantics.xyz/2020/anonymity/#BharadhwajTurpinEtAl_2020_De-anonymization_of_authors_through_arXiv_submissions_during_double-blind_review) Dmytro Mishkin & Amy Tabb - [(part I) Hands off Arxiv!](https://amytabb.com/ts/2020_06_29/) Dmytro Mishkin & Amy Tabb - [(part II) What does it mean to publish your scientific paper in 2020?](https://amytabb.com/ts/2020_08_21/) Amy Tabb - [arXiv paper explainer](https://amytabb.com/ts/2020_08_09/) [Data feminism](https://mitpress.mit.edu/books/data-feminism) as an example of sharing qualitative research before publication [Twitter thread](https://twitter.com/annargrs/status/1301204793235566600) by Anna Rogers wrapping up the discussion :closed_book: Closing thoughts -- Next aspects on the topic that we can discuss in further sessions: * open peer review in the humanities * preprints / working draft and qualitative research * being a reviewer as a "job" ----------- ## Notes: 24 June, 2020 - Topic: ***Commercial organisations doing the job of libraries/archives*** - [Slides](https://docs.google.com/presentation/d/1ZfY0\_GyYBkRyvkrCt\_7hJhShFNdYGaRYQUsKpizUBAY/edit?usp=sharing) ## Aim of this meeting - Having a conversation rather than a one-to-many reading group - Discussing topics at the intersections of the two disciplines - Trying to consider different / uncommon points of view - Sign up to this mailing list: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=TURINGINS-HUMANITIES-DATASCIENCE **Participants (write your names below)** Name / Institute / What brought you here? (answer in a short sentence) - Federico Nanni / Turing / Leading the session - Leontien Talboom / UCL / chairing the session - Katie McDonough / Turing - Malvika Sharan / Turing / Community discussions - Sarah Gibson / Turing - Scott Bailey - Rossiza Atanassova - Patricia Murrieta - KBeelen - Eirini Goudarouli (TNA) - Daniel Wilson / Turing / Historian working with/on 'data' - D Vanstrien - David Beavan / Turing / Co-Organiser Humanities & Data Science interest group - Bernard Ogden - A Lang - Barbara McGillivray :dart: Discussion Goal --- Commercial Digitalisation is not a library, or are they? Example 1: - National archives have agreement with "Findmypast" to secure records on ancestory - pro- Findmypast does the digitalisation and preserves the data in different format - Con- this is available only upon visiting the national archive reading room, but to access them from home there is a paywall for access Example 2: - Googlebooks: Digital bookstores are not library - they are copyright and authors don't benefit from them - You need to pay to access the books and hence its google that profits from this and not society Example 3: - Internet archive - not library but piracy as there is no license to make these books available ![](https://i.imgur.com/sQpDrsM.png) Questions: 2. What are the drawbacks of commercial organisation acting in this environment? - As their main goal is to make money, how do we ensure that our values also come across? - How do we guarantee long-term preservation? (when the hype is gone) - If the business is based on data, how do we ensure that data is open and fully available? 3. How should we be setting up such a relationship? - Which value do we recognize in their work, apart from the invested budget? - How can we ensure that our expertise is not lost? - Or is it something that academia should discourage as a whole? :books: Reference and other works mentioned during the discussion --- - [Google books](https://books.google.com/) - [Torching the Modern-Day Library of Alexandria](https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/) “Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.” - [Google & the Future of Books by Robert Darnton](http://www.nybooks.com/articles/22281) - Gale Cengage came up with the Digital Scholar Lab to allow computation with the digitised collections behind pay-wall: https://insights.uksg.org/articles/10.1629/uksg.482/ - [Removing Barriers to Digital Scholarship](https://www.gale.com/intl/primary-sources/digital-scholar-lab) :mag: Main arguments from the discussion --- **Questions 1.** What are the benefits of commercial organisation acting in this environment? - How would we otherwise fund large digitisation projects? - Does this mean the material is more widely available? - Does this simplify cross-country efforts? **Discussion on Google books and British Library (BL) contract on digitalisation of literature**: - Google books: https://books.google.com/ - What are their business model? - very unclear to library and archives - Rossitza: Google Books are still going, and digitising collections at the BL and other institutions - Daniel Wilson (in chat): As Rossitza said, they *are* digitising thousands of books a month: but the business model is more opaque.I assumed Google Books was meant to be a ‘loss leader’ for the wider operation: PR for their ambition to ‘organise all knowledge’ (aka advertising) - A Lang (in chat): book historian Robert Darnton wrote a good piece on this in the NYRB some years ago (ironically enough, behind a paywall: http://www.nybooks.com/articles/22281) Questions on Google digitalisation: - Patricia Murrieta (in chat): what is the arrangement between the BL and them Rossitza? In terms of what you get and what do they get? - The contract is available online, it is quite inflexible https://www.openrightsgroup.org/blog/access-to-the-agreement-between-google-books-and-the-british-library/ - What are they interested in digitalising? What kind of material? Just speculating what's their goal :D - Curators had the freedom to select materials but there are restrictions on dimension and condition to meet the requirements of the scanning equipment Google use - A lot of the books can be rejected if the metadata is missing - There is a focus on scale than content, e.g. a request to digitalise some specific material was rejected because metadata was missing and BL did not have the resources - The goal seem to be text mining and OCR ... - Libraries are having different dialogues with google group (separately), and it's not consistent - Even if their goal is not the most charitable, how can communities benefit from it? - David Beavan (in chat): They are hoovering up all of human knowledge. Born digital for them = web, they have got covered. This is a way of going back in time. Language models, semantic change, OCR, gateway to knowledge. If they become to de facto place for knowledge/search and put libraries out of business (even if only by convenience) then you’ll get adverts between page turns etc.etc. Google are ultimately a advertising company - Daniel Wilson: there is one buyer and no competition. we need to understand what is that they gain from this, in order to value the resource they are being given. - In any case, my point was more that the BL felt its hands were tied, even before it got to that point
 - Katie: ancestry free from local libraries in the UK during the pandemic. - Geneology organisations hold a lot of power (personal information) - Based on where they are (America or UK), they also compete for information - Patricia: It's in a way like publishing companies. In order to change the model, holders of knowledge would have to choose not to go with them... - Mia: Other organisation can access Google Books, but they don’t mind the unlimited liability that Turing didn’t agree to - Kate M (in chat): Is there any writing/research about which countries have provided public funding for digitization vs. those that have gone (at least primarily) with commercial digitization? - Mia: Really great overview on the efforts of different countries in digitizing their cultural heritage (France, Finland, Australia, New Zealand, Canada) in comparison with the UK :closed_book: Closing remarks/questions/topics (for future discussions!) -- - In science we have a strong open movement on the basis that the Tax payers (public funding) going into research should produce output that is publicly accessible. - However, that kind of funding is missing in humanities which is shocking given the fact that humanities affects generations of scholars, researchers, politicians and citizens. - What we have also realised that some of the researchers work on a field not because that's what they want to do, but because that's the only field they can access paper on - I wonder if that pattern exists within humanities as well. - The embargo for IP rights on research output are same across all these fields ### Additional Drafted Notes <!-- Other important details discussed during the meeting can be entered here. --> - ## Notes: 20 May, 2020 ## Aim of this meeting - Having a conversation rather than a one-to-many reading group - Discussing topics at the intersections of the two disciplines - Trying to consider different / uncommon points of view - Chatting over lunch (a tea / beer) to make it as informal and relaxed as possible - Trying to have this the first Wednesday of every month (up for discussion) ### Topic - The Computational Humanities and Toxic Masculinity? A (long) reflection ([Original blogpost](https://latex-ninja.com/2020/04/19/the-computational-humanities-and-toxic-masculinity-a-long-reflection/), [Our slides](https://docs.google.com/presentation/d/11qi43HYFjogFJV36u2pS0CPLqP43d_Ypvv4pYvxuKVY/edit?usp=sharing)) - Katie McDonough (from Living with Machines) will introduce the topic and Fede chairs the debate. **Participants (write your names below)** Name / Institute / What brought you here? (answer in a short sentence) - Federico Nanni / Turing / Leading the session - Leontien Talboom / UCL / Leading the session - Katie McDonough / Turing / Chairing the session - Malvika Sharan / Turing / Want to capture different perspective in the Turing Way project - Barbara McGillivray / Turing and Cambridge - Ismael Kherroubi Garcia / Ethics Research Assistant at Turing * Sarah Gibson / I'm a Research Software Engineer in the Research Engineering Group at the Turing. I'm an advocate for reproducible research and work on open projects like mybinder.org and The Turing Way. I'm also on the Living with Machines project at the Turing. * James Smithies from King’s Digital Lab, King’s College London * Glen Cameron / Illinois (US) working at HathiTrust Research Center * Laura Carter / Human Rights Centre at the University of Essex, currently an Enrichment student at the Turing * Arianna Ciula / Deputy Director and Senior Research Software Analyst at King’s Digital Lab, King’s College London (UK) * Scott Bailey / Data and Visualization Librarian at NC State University Libraries, but previously worked at the Scholars’ Lab @ UVa, and at Stanford’s Center for Interdisciplinary Digital Research (CIDR) * Eirini Goudarouli / Heads of Digital Research Programmes at The National Archives, UK * Jane Winters / School of Advanced Study, University of London. * James Cummings / Newcastle University, DH, Late Medieval Drama, TEI geek, that sort of thing. * Luca Scholz / Lecturer in Digital Humanities at the University of Manchester (UK) * David Beavan / Turing Research Engineering. Amongst other things, I’m co-organiser of the Humanities & Data Science SIG at the Turing, find out more here: https://www.turing.ac.uk/research/interest-groups/humanities-and-data-science * Kaspar Beelen / Research Associate at the Alan Turing Institute (Living with Machines project) * Charlotte Tupman / Digital Humanities Lab at the University of Exeter. Into ancient inscriptions. * Giulia Occhini / PhD student at the Turing in Data Science/NLP/Digital Humanities and other stuff * Sarah Lang / (also known as The LaTeX Ninja and author of the post discussed today) - my non-Ninja-self works at the Centre for Information Modellierung (Zentrum für Informationsmodellierung) in Graz, doing my PhD in Digital Humanities on early modern science / alchemy. My internet isn't always stable, so no permanent video ;) * Melvin Wevers / DHLab of the KNAW Humanities Cluster in Amsterdam. One of the organizers of the Computational Humanities Research workshop. * Kevin Xu / Research Software Engineer at the Turing.
 * Glen Worthey/ U. of Illinois, Urbana-Champaign, at the HathiTrust Research Center. Thanks to Katie (my former Stanford colleague) for the invitation from across the Atlantic. Great to see many old friends and colleagues, looking forward to meeting new ones! * David De Roure / (Dave D as opposed to Dave B…), a Turing Fellow and my project is AI and music (I’m a digital musicologist, also know occasionally as a computational musicologist…). In Oxford I look after the Digital Humanities network (DH@Ox) which comes together annually in the DHOx Summer school (a cut-down 3 day online event this year). I’m a visiting prof at the Royal Northern College of Music working on science and music. I’m also involved in the UKRI research and innovation infrastructure exercise. * Daniel van Strien / I work at the BL as a digital curator.
 * Olivia Vane / Research Software Engineer at the British Library (Living with Machines project) :dart: Discussion Goal --- - :books: Reference and other works mentioned during the discussion --- - Gender bias before and after “Computational Humanities,” some starting points - [Beyond the Margins: Intersectionality and the Digital Humanities](https://www.digitalhumanities.org/dhq/vol/9/2/000208/000208.html), DHQ (2015) by Roopika Risam - [The Radical Potential of the Digital Humanities](https://blogs.lse.ac.uk/impactofsocialsciences/2015/08/12/the-radical-unrealized-potential-of-digital-humanities/), Miriam Posner - [Bodies of Information](https://dhdebates.gc.cuny.edu/projects/bodies-of-information), ed. by Jacqueline Wernimont and Elizabeth Losh (2018) - [DH-WoGeM](http://www.dhwogem.org/) - [Data Feminism by Catherine D’Ignazio and Lauren F. Klein](https://bookbook.pubpub.org/data-feminism) - [LaTeX Ninja blog post](https://latex-ninja.com/2020/04/19/the-computational-humanities-and-toxic-masculinity-a-long-reflection), 19 April 2020 (the author is here!) ### Initial Drafted Notes <!-- Important details discussed during the meeting can be entered here. --> Here's we can collaboratively take notes of the main passages of the conversation, that we can then organize as below. - Does the computational skill denote to some power structure? - is it assumed to be masculin (and hence exclude women or other genders) - Digital humanity has grown out of humantities wby techie people, similarly computational humanity seem to have come out of folks in computation who are interested in humanities - not sure if that's what creates a niche (Comment from Zoom: Does CH vs DH play back into long disproved stereotypes of The Two Cultures? Or is it different from wethat?) - A lot of the points that is in the blog post echoes human right approach that people in legal space talk about - Problem of binaries: techie - fuzzy divide (technies are engineers and fuzzy are historian and literature folks) that separates intellectual community in a campus - There is a gendered aspects indeed that exist in many research spaces and as a research community we should think about how do we manage these privileges and power dynamics - By the author on what led her to draft the post - The motivation comes from the lack of full understanding of what Computational Humanity actually stood for - In many languages this as a field doesn't exist - She noticed that some jobs in humanities are offered to computer scientists because they can do machine learning and a qualified humanities specialist might not - Many conference only highlight computational visualisation and not so much on humanities - Women and men will have same chances to get selected if they work on the same topic, but what if a field is also gender biased and that's the field that gets more focus - Privilege hazard: when you have privilege you don't see the problem - Often people get offended by people pointing out less privilege. They are afraid to speak up, and therefore having a safe space is useful. - ![](https://i.imgur.com/nTqIN1y.jpg) - How power dynamics influence the way we do research - When working in science there is an attitude of "verificationism" - as people who don't have the same lived experience want to understand what others are talking about (putting yourself in the shoes of other genders) - this causes frustrations to both sides of debaters - People come for a value but stay for the ethos - How is DH formed, and what aspects are being considered? - Why is a separate community being formed? - Melvin: I think it's not a clear separation, in our view it's more like a special interest / subcommunity within the larger community - James: Glen, It may also be different in the different regionalities of DH, where there are difference focuses. - James: Sarah, Maybe the perception of marginalisation is something we all have but to more/lesser degrees? ### Some comments from the chat - From James Cummings to Everyone: (4:32 pm): I'm not sure women have the same chance of getting accepted to a conference. At least that isn't what the statistics over a long period seem to show. There is a PI-goes-to-the-conference, and then far too many of those are still men. - From Ismael to Everyone: (4:34 pm): Having zero background in computational or digital humanities, I only learned the term when I saw this discussion advertised! I am happy to see the definitions are vague - I have a feeling that defining what either one is (or clarifying that they are the same) could be a first step (setting aside the enormous social background through which all concepts, names, etc. are interpreted for a moment) - From Jez Cope to Everyone: (4:35 pm): My gut reaction is "we need to investigate this more" too, but I'm also aware that attitude tends to perpetuate the status quo, both because you don't have to change anything until you've investigated, but but also there's a danger of confirmation bias - From James Cummings to Everyone: (4:36 pm): Ismael: There is a long history (and publications like 'Defining DH') on what is or isn't DH. I've learned from experience when someone claims they are doing DH it isn't my place to say whether that is real 'DH' or not. ;-) - From quinn dombrowski to Everyone: (4:37 pm): On representation in conference acceptance, there's this paper on DH (through 2015) which suggests underrepresentation https://scottbot.net/representation-at-digital-humanities-conferences-2000-2015/
 - Sarah: Thanks! I think this is a good summary of where I wanted to go with the post - I get how continuous feminism debates can be somewhat annoying to men, but it's just like Laura said- if you have the priviledge, you can't just "see" the perspective of those who doN#t - From quinn dombrowski to Everyone: (4:42 pm) If the stats reflect fewer women submitting, isn't that a problem too?
 - From Arianna Ciula to Everyone: (4:42 pm) My reactions: names ARE important, names often mean identity especially at certain stages in life (e.g. early career); society has problems with diversity (just look at figures on salaries across sectors); DH/RSE/Computational Humanities are right to question/problematise names and question bias/problemitise reifications of societal bias/problems; however you would assume we had figured out by now that instrumental and intellectual are entangled - if itsn’t this community who can articulate it best, who else? - james - Yeah, we are way passed having to ask women to prove they don't feel comfortable/experience misogyny. Just believe them. (But also, forgive us privileged if we forget EDI sometimes. Prod us when we do.) - From James Cummings to Everyone: (5:02 pm): (Hopefully we'll eventually get to a place where we don't need prodding, it is normal.) - From Melvin Wevers to Everyone: (5:03 pm): @sarah, I see how in the grant-world, traditional hum is threatened by computational approaches. But having a community dealing with issues related to communities is not necessarily set up as something that invalidates this field of scholarship, communities = computation - From James Smithies to Everyone: (5:03 pm): Thanks to the organisers and everyone who shared their thoughts - really valuable for me. :mag: Main arguments from the discussion --- - :closed_book: Closing remarks/questions/topics (for future discussions!) -- - ### Additional Drafted Notes <!-- Other important details discussed during the meeting can be entered here. --> - ## Notes: 04 March, 2020 - **Hosts:** - Fede, Leontien ### Topic - [Data-driven publications in the Humanities](https://docs.google.com/presentation/d/13nPK5f9Z6wEwOkjbNfLQI4WZ1cRJ9HfDcl6MmmuaJtY/edit?usp=sharing) - comments are open **Participants (write your names below)** - Malvika, Kasra, Laura, Katie, DanVan, Kaspar, Mariona, Giorgia Occhini, Tim, Amy :dart: Discussion Goal --- - Discussing the impact of data-driven research on the overall debate concerning methodology in the humanities :books: Works mentioned during the discussion --- - Gregory Crane, [What Do You Do with a Million Books?](http://www.dlib.org/dlib/march06/crane/03crane.html), 2006 - [The Culturomics paper](http://www.culturomics.org/), 2010 - Dan Cohen, [Initial Thoughts on the Google Books Ngram Viewer and Datasets](https://dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/ "https://dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/"), 2010 - Anthony Grafton, [Loneliness and Freedom](https://www.historians.org/publications-and-directories/perspectives-on-history/march-2011/loneliness-and-freedom), 2011 - Cameron Blevins, [Topic Modeling Martha Ballard's Diary](http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/), 2010 - Matthew Jockers and Annie Swafford discussion around the Syuzhet package, 2015. Starting points: [1](http://www.matthewjockers.net/2015/02/02/syuzhet/) and [2](https://annieswafford.wordpress.com/2015/03/02/syuzhet/) - Scott Weingart, [“Digital History” Can Never Be New](https://scottbot.net/digital-history-can-never-be-new/), 2016 :mag: Main arguments from the discussion --- - The different role that examples play in historical research compared to other disciplines (especially the social and natural sciences). In the first, they are presented as evidences for sustaining a specific narrative, while in the others they offer insights on a quantified property of the analysed data. - Initially, a distant reading method like topic modeling was used for browsing and visualizing the collection, not for deriving evidence (however the distinction is thin) - History deals with questions that often cannot be answered using big data and quantitative approaches. For instance "how" questions, rather than "what". :closed_book: Closing remarks/questions/topics (for future discussions!) -- - The role of private companies in digitizing and making available collections (ethical, copyright and accessibility issues) - Non-domain experts doing research in the humanities (as well in biology, medicine, psicology) because they know how to work at scale - The difference between discovering and justification in the Humanities (starting from [Trevor Owens, 2012](http://www.trevorowens.org/2012/11/discovery-and-justification-are-different-notes-on-sciencing-the-humanities/)) - How different disciplines answer "why" questions, and whether this is changing with the advent of data science. ### Additional Drafted Notes <!-- Other important details discussed during the meeting can be entered here. --> - Previous experience working/playing with tools for working with large dataset without any goal or questions in mind - How can playing around be changed to actual research: making sense of the outcome? - What happens when you find something you did not expect? i.e. sentiment analysis of Dorian Grey that ends up showing that the book is sad in the first part and happy towards the end. Is data exploration used in other fields? - Linguist researchers working on Oral tradition - can create maps and names - Engineers - hypothesis generation does not expect surprises, but going forward in exploration can give you surprise to support or reject these surprises - Historians go to archive with some question to select the collections to look at - then they can lean on serendipity that can lead to the crystalisation of bigger new questions - In bioinformatics we can start with a set of data, i.e. multiple cancer sequencing data (transcriptomics, metagenomics, metabolomics, proteomics) and we can study patterns and derive conclusion on what kind of cancer are they, what causes them, which are the genes or the drug targets. Serendipity and surprises are basically everything - but larger dataset allow us remove noise from actual signal in data. - Conclusion: Its hard to ignore surprises and ask more questions when they appear as a side effect of an original question - Dan Cohen's reaction to n-gram: are trends derived from big data as historical evidences or do we just want to search and learn? - Human right critical theory: this approach allows them to look at the last status (what it is) and then go back to looking into data to see how it started. - Social sciences: Hypothesis generation in social sciences are based on assumptions derived by a specific group of people working on selected cases and examples - it changes with people, their environment and cases/examples - Engineer: we need tools for exploration and other tools for trends - trends can allow us to avoid averaging out (overfitting or underfitting) of observations. - when you are working with millions of article, you will find an article that matches your ideal observation - improving how we use methods to go from distant to close reading or vice versa: avoiding cherry picking of observation made through an analysis by using computational methods that can help them avoid these bias - Do computational approaches to history help historian ask new question, or provide new methods to explore old question? - In other fields, general observations allowing future predictions, for e.g. conflict, infections - It also depends of the relevance in our community, for e.g. coronavirus vs infection in general - Some questions can't be addressed with the close reading because it is about trend (and vice versa) --- ## Notes: 05 February, 2020 - **Hosts:** - Fede, Leontien ### Topic - [Data Science Tutorials & Humanities Scholars](https://docs.google.com/presentation/d/1ZfY0_GyYBkRyvkrCt_7hJhShFNdYGaRYQUsKpizUBAY/edit?usp=sharing) **Participants (write your names below)** - Malvika, Kasra, Katie, Daniel W, DanVan, Dave, Kaspar, Mariona, Olivia ... :books: References --- - Programming Historians: https://programminghistorian.org/ - :dart: Discussion Goal --- - The focus will be on the benefits and drawbacks of tutorials enabling humanities scholars to easily use data science methods. :mag: Main arguments from the discussion --- 1. Tutorials are useful for very specific tasks, to learn what a tool could/should do, not necessarily to learn data science. They are a first step into the field, but from the discussion it became apparent that it is easier to learn data science from books, courses or internships. 2. Different methodological frameworks between science and humanities education. We talked about whether data science will ever become part of humanities curricula, due to the demand from students which see it necessary for entering the job market. We had a comparison with the training in biology, a science that (partly) relies on qualitative methods. 3. Tutorials often don’t have an interactive component (compared to working in groups). This leads to less of a community feeling; also it is unclear how reliable they generally are. The Programming Historian addresses many of these issues, with peer-review, frequent updates and an active Twitter community. :closed_book: Closing thoughts -- - ### Additional Drafted Notes <!-- Other important details discussed during the meeting can be entered here. --> - None for this meeting. # Template -- ## Notes: dd Month, yyyy ## Topic - ## Aim of this meeting - ## Volunteer to take notes - - - - - ## Participants **Participants (write your names below)** *Name / Institute / What's your experience with preprints?* (answer in ONE short sentence) - - - - - :dart: Discussion Goal --- - **2 minutes silent note-taking: personal reflection** "Add '+1' (plus 1_) next to a statement that you agree with and would like to discuss" *Name / response* - - :books: Reference and other works mentioned during the discussion --- *Please add link and reference, any work that has been discussed and mentioned* - :mag: Main arguments from the discussion --- *anyone can help taking notes.* - :closed_book: Closing thoughts -- - ### Additional Drafted Notes <!-- Other important details discussed during the meeting can be entered here. --> -