# RSH 020: data preparation ## Intro - Welcome to the session - Introduce us - Purpose of this session: Imagine if someone comes to us and says "I have to release some data". We'll go over what we would discuss with them. - How RSH works: - hackmd, will be public, ALWAYS at end - questions anytime - this is like a chat. ## Part 1: data management plans - Why DMPs? - What is wrong with historical process? - I guess this came from the open scinece movement, pressuring you to - should a DMP be fully copy-paste from best practices, by sort-of default? - What should you actually do? - First, learn the best way to handle your data - Then, write a plan for funder - Hopefully, use the plan. But as they say, plans are useless but planning es essential - How do you typically do them (what does your university insturctions say?) ## Part 2: tidy data - I have 2 slides as example of this, maybe I manage to create a slide-less example - Discuss how to store tabular data - excel - standard formats: csv, numpy array, netcdf - pandas - Discuss problems with non-tidy data storage - Tidy data: [H. Wickham, "Tidy Data"](http://vita.had.co.nz/papers/tidy-data.pdf) - Ddvantages/disadvantages of tidy data - Metadata ## Part 3: release * Findable * Does someone know your data exists only if they finds your paper first? Or can they search directly for it? * What's the minimum required to find it? * General search engine? Domain-specific search that understands the concepts of the data? * RD: NOMAD repository * Accessible * You can somehow request access to it * It's in a place that won't disappear later * In a good repository * Or maybe, it's in some isolated permanent place, disconnected from a searchable repository * re3data.org - find a repository * Is your own website acceptable? * Is github acceptable? * Interoperable * If someone can get value by combining it with other data, can they? * This is mostly covered in 'preparing data for release' * If there are no existing standards, you can't make it interoperable * There may be standards, but they may not be common enough to be useful. It depends on what your particular case. * Or it may not be worth putting that much effort. * Reusable * Someone has permission to reuse it * Licensing ## Next stream What should we talk about in two weeks? Possible topics: https://researchsoftwarehour.github.io/