Peeter Tinits
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- tags: dh2022, juhend, nlib --- DH2022 omadele näidiseid: - https://osf.io/hbfmy/ - Keskne ülevaatekoht - https://osf.io/asmbu/ - Tinits andmestik - Aur, Elekter, Hobu - https://osf.io/d8e2c/ - Kaukonen andmestik - Esinaine # How to make a reproducible NLIB study. The Open Science Movement has recently grown to greatly influence the way research in the humanities and social sciences is being done. When dealing with computational studies - such as those done in digital humanities or computational social science, an ideal of reproducible research has been proposed: any study should strive to be as easily repeatable and transparent as possible. For computational studies, this means publication of code and data alongside with the research manuscripts. This can be hard work if done as an afterthought, but fairly easy if kept in mind during the project. Often research articles have included the phrase "data is available upon [reasonable] request" with the good intention to do that. However, even with good intentions, software changes, hard drives crash and other engagements stop people from doing this work. A recent study found that only 6% of researchers were actually able to provide the data they promised to share on request [https://www.nature.com/articles/d41586-022-01692-1], [https://royalsocietypublishing.org/doi/10.1098/rsos.210450]. Practically, it is fair to expect that if data is not published with the paper, it is likely very difficult or impossible to gain access the data later on - and even in the best case, it means a lot of additional work for the authors. ![](https://i.imgur.com/BVSeUXH.png) Graph Michener et al. 1997 We have compiled a small guide how to structure your data so that it can be easily reused later by you or anyone else, to reproduce the results and build on them in future research. We offer a way to make particularly studies with digitized materials at the National Library of Estonia reproducible and easy to share. # 1 Collect your code and data First step in making the study easy to share and reproduce is simply to gather all the files in one place. In collaborations and longer projects, the conversation tends to distribute between different channels. Not everything needs to be preserved, but this may mean that when reconstructing what was done for a collague, the materials may prove difficult to find (or may have become inaccessible). The first step to avoid this is to gather and keep all the relevant files in one place so that they can be easily shared if needed. If relevant, potentially you may want to keep some files separately, but also in one place for in-group use. For online documents - such as google documents, spreadsheets etc - make sure you keep a link around, and make a safe-copy if concluding the project for a while. ## 1.1 Structure the files Commonly, when doing data-intensive research on digital collections, you will have some relevant data files, and some relevant processing files. These files should be clearly structured and clearly named. For data files, we would recommend any common file format, e.g. .csv, .tsv, .json, with a name that clearly describes its contents. Don't use 'data.csv', use 'sports_articles_metadata.csv'. For script files, comment your code and add explainers within the code that describe what and why you do. This is both for you and your future readers - you are likely not to remember well what you did when completing the project, and won't need to work to reconstruct these steps again. .Rmd or .ipynb files offer nice ways to organize code into text and code blocks, but any format can work. The code files should also use clear names that indicate their contents. Don't use script1.R, script2.R, but use preprocess_data.R, analyse.R. If the code is run in a particular sequence, it may help to number the files so that they are always in order - e.g. 1_preprocess.R, 2_analyse.R. ## 1.2 Use a simple directory structure It is helpful to use a simple standard directory structure for the files. One common way to structure data science projects is to use separate folders for 1) code files, 2) data files, 3) documentation and reports, 4) plots and visualizations. For larger projects, a more fine-grained structure can help (e.g. separate raw data and test data into separate directories). An example structure is given below, but details will depend on the particular project. 📦project ┣ 📂 code ┃ ┣ 📜 1_preprocess.Rmd ┃ ┗ 📜 2_analyse.Rmd ┣ 📂 data ┃ ┣ 📂 raw ┃ ┃ ┣ 📜 raw_corpus.txt ┃ ┃ ┗ 📜 raw_corpus_metadata.tsv ┃ ┗ 📂 processed ┃ ┃ ┣ 📜 testset_simple.tsv ┃ ┃ ┗ 📜 testset_detailed.tsv ┃ ┗ 📂 annotated ┃ ┃ ┣ 📜 testset_detailed_annots_UT.tsv ┃ ┃ ┗ 📜 testset_detailed_annots_CD.tsv ┣ 📂 docs ┃ ┣ 📜 logs.md ┃ ┣ 📜 blogpost.md ┃ ┗ 📜 article1.md ┣ 📂 plots ┃ ┣ 📜 results1.png ┃ ┗ 📜 results2.png ┣ 📜 LICENSE ┗ 📜 readme.md Changing the directory structure will change your file paths, so make sure you keep them up to date. A best practice will be to keep the working directory at the project root and refer to files from there. For R, this can be assisted with the package 'here'. ## 1.3 Store and download the data When dealing with the DIGAR text data, the same query can give the next user the same dataset. However only if the raw dataset on the server is the same - and if the dataset is compiled across general features (e.g. all Estonian newspapers in the 1920s), then if there are any new sources added to the collections, the query will not exactly replicate your initial query anymore. For this reason it is necessary to store the data that you got in your query as a separate file to be added to the code and data. If using only open data, then there are no issues in just including the raw data in the set. If using data that you accessed through a contract, then including the article ids or publication names can do part of the job. ## 1.4 Zip what you can If your study uses large data files - e.g. queries across the large corpus, it may become unwieldy to publish them in raw format. However they are easy to compress - e.g. raw text files can be zipped up to use roughly 40% of the original size. Most data readers can decompress them on the go, causing no delay in processing. E.g. see some options here for R - https://www.roelpeters.be/how-to-read-zip-files-in-r/ (e.g. package vroom for tidyverse). An example code that also zips up the raw text files in an R notebook is available here. ## 1.5 Save images as separate files Usually when analysing data, the results are also given in a visual overview. While there are many ways to extract this from the code and introduce them in the manuscript (e.g. just exporting and pasting), the best way to do this is to save the visualization into an image file - e.g. .png, .svg, or .pdf depending on which format you need. And then, if needed, it can be linked into your manuscript, copied into a word file, or is often added directly by the editors. This allows the image to follow the same parameters and be the same size across all platforms, and also preserve this shape when creating minor adjustments. If using vector graphics (e.g. .svg, .pdf) you will also be able to zoom into the graph without losing quality. ## 1.6 Store intermediary steps, all steps you checked Often your data analysis workflow will include steps where the results are manually checked and annotated. These are important creative steps and decisions that are also good to keep a record of: in case you need to backtrack the steps you made, or if a collague wants to follow up with another study. E.g. if you create a clean test set of bicycle advertisements or if you annotate sentences on their grammatical forms. For transparent research, these should all be preserved along with the raw data and code. They may also serve as derived datasets that can be useful for future studies. ## 1.7 Rerun study Once you have gathered all the files in one place and adjusted their locations and contents, it is important to check if your files still run. It is much easier for you to troubleshoot any errors now quickly than any reader to try this (but likely they will give up quickly). Rerun the code files once again to make sure they work (even if you don't update any data files here). # 2 Store and document the contents ## 2.1 Store your materials in a safe storage space for preservation There are a number of ways to share the files online, some more permanent than others. It is best to choose a storage location that is built to store scientific data - they usually have some frameworks in place that ensure the long-term preservation and accessibility of the contents. Personal websites or file storage services like Dropbox or OneDrive tend to move, get updated, change in structure so that links that worked on publishing will be broken. Some good ones used for cultural heritage are [OSF](https://osf.io/) or [Zenodo](https://zenodo.org/), and an Estonian scientific infrastructure [DataDOI](http://datadoi.ee/). We highly recommend OSF that has a framework and funding in place to preserve the data for 50 years. These services also give the repository a DOI - a permanent Digital Object Identifier that will adapt and link to the repository even if its hosting website changes. Some examples on OSF made on the DIGAR collections can be seen here: https://osf.io/hbfmy/ ## 2.2 Store the information on software versions used The code used in scientific computing is subject to frequent updates and changes. Thus it may be important to know exactly which package version was used to run your study. It may be that the latest version works fine and identically, but in some cases, the commands used may become obsolete, or some errors used within packages that change computation methods. In R, it is possible to run sessionInfo() in the .Rmd file and copy the results, or use the command packageVersion("packagename"). In Python, there is an equivalent command ``` import session_info session_info.show() ``` For more advanced ways of doing this, have a look at the packrat package https://rstudio.github.io/packrat/. Some more ideas on storing this information, and also citing the packages used is given here https://ropensci.org/blog/2021/11/16/how-to-cite-r-and-r-packages/. ## 2.3 License To be clear about what any future users can do with your code, you need to include a license file in your scientific repository. This license file usually contains the info of a standard software license (e.g. MIT or GPL), and/or a license for textual materials (e.g. CC-BY). Get to know the different licenses here - https://help.osf.io/article/148-licensing or here https://help.figshare.com/article/what-is-the-most-appropriate-licence-for-my-data or here https://www.kent.ac.uk/guides/open-access-at-kent/choose-a-licence-for-your-research-works. If you do not specify it, then a diligent user must conclude that this code is not for open use and is copyrighted to be restricted by the author. In scientific computing, this may mean that the potential user will move on and not use this data just in case. # 3. Don't wait until the end of the project These steps can be very easy to follow if done during the project, but quite hard work if done after the project - which is often the reason that they will never be done at all: there is often no time for this when the project has concluded. Which also means that likely your code and data will not be "available upon [reasonable] request" even if you meant to allow this. ## 3.1 Keep your code ready at all intermediate steps. Most first drafts never get updated and cleaned up because the work needed to do that grows as time passes from when the code was written. Sometimes your code needs outside input and this becomes impossible to get or the project member responsible for writing a part of it is not available for comment at the particular time you need it. Do this at every intermediate step of the work, when something worth preserving is completed. Do not wait until the end of the project to clean up your code. ![](https://i.imgur.com/TUzjOZ9.png) Quotes from Minocher et al. 2021 https://royalsocietypublishing.org/doi/10.1098/rsos.210450#d1e1219 ## 3.2 Develop the materials with an end goal in mind When working in scientific projects, clear overviews of authoring and what is aimed to be done with the project greatly simplify any possible tensions from different ideas. Does the project aim to have open data? Then open data practices are very helpful to use from the beginning. Does the project involve a computational aspect and a theoretical aspect done by different people? It may be useful to clarify the order of authors and the aimed eventual publication venues, as authors with different backgrounds may have different ideas here (e.g. some disciplines use alphabetic authorship, some really encourage first authorship, some allow for "equal authorship", some even suggest more complex frameworks, e.g. credit https://credit.niso.org/). As tasks emerge during the project, it will be clearer who should do them and for what aims. ## 3.3 Keep contact with the intended audience You may be doing this research for a particular discipline or a societal group. They may be interested in particular aspects of your data. Computational researchers may want to know about statistical distributions in your data, social scientists may want to know if your data was representative to the project at hand. Humanities scholars may want to know if this is really new information. Keep these expectations in mind when developing the project and make sure these bases are covered. Also think about how and if you may want to share this with the more general public. E.g. if you use visual materials, then it may help to create specific ones for popular science purposes, and the easiest time to do that is during the project.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully