# RSH 020: data preparation
## Intro
- Welcome to the session
- Introduce us
- Purpose of this session: Imagine if someone comes to us and says "I have to release some data". We'll go over what we would discuss with them.
- How RSH works:
- hackmd, will be public, ALWAYS at end
- questions anytime
- this is like a chat.
## Part 1: data management plans
- Why DMPs?
- What is wrong with historical process?
- I guess this came from the open scinece movement, pressuring you to
- should a DMP be fully copy-paste from best practices, by sort-of default?
- What should you actually do?
- First, learn the best way to handle your data
- Then, write a plan for funder
- Hopefully, use the plan. But as they say, plans are useless but planning es essential
- How do you typically do them (what does your university insturctions say?)
## Part 2: tidy data
- I have 2 slides as example of this, maybe I manage to create a slide-less example
- Discuss how to store tabular data
- excel
- standard formats: csv, numpy array, netcdf
- pandas
- Discuss problems with non-tidy data storage
- Tidy data: [H. Wickham, "Tidy Data"](http://vita.had.co.nz/papers/tidy-data.pdf)
- Ddvantages/disadvantages of tidy data
- Metadata
## Part 3: release
* Findable
* Does someone know your data exists only if they finds your paper first? Or can they search directly for it?
* What's the minimum required to find it?
* General search engine? Domain-specific search that understands the concepts of the data?
* RD: NOMAD repository
* Accessible
* You can somehow request access to it
* It's in a place that won't disappear later
* In a good repository
* Or maybe, it's in some isolated permanent place, disconnected from a searchable repository
* re3data.org - find a repository
* Is your own website acceptable?
* Is github acceptable?
* Interoperable
* If someone can get value by combining it with other data, can they?
* This is mostly covered in 'preparing data for release'
* If there are no existing standards, you can't make it interoperable
* There may be standards, but they may not be common enough to be useful. It depends on what your particular case.
* Or it may not be worth putting that much effort.
* Reusable
* Someone has permission to reuse it
* Licensing
## Next stream
What should we talk about in two weeks?
Possible topics: https://researchsoftwarehour.github.io/