owned this note
owned this note
Published
Linked with GitHub
# FAIR Data and Software at [TIB][fds]
Welcome to this workshop :-) We will use this Etherpad-like document to:
- share links and snippets of code,
- take notes,
- ask and answer questions, and
- whatever else comes to mind.
It's in the Markdown format, which you can learn in a few minutes on [commonmark.org/help][MD] for example.
[fds]: https://events.tib.eu/fair-data-software/
[MD]: http://commonmark.org/help/
## [Schedule, with hyperlinked lesson material](https://tibhannover.github.io/2018-07-09-FAIR-Data-and-Software/#schedule)
## Monday's discussion over lunch: What does “FAIR” mean for your domain of study / research / work or field of expertise?
### Group 1
FAIR data:
* Samples/patient identifier must be unique
* Data must be available after funding runs dry
* Data must be described in some way. Ontologies - e.g.: all must be in cm or m
* Documentation to provide for when accessing the data
Discussion point: There is a lot of funding for repositories right now, which might make it difficult to find the data researcher are looking for. Two long-term solutions (?):
* European Open Science Cloud (EOSC) https://ec.europa.eu/research/openscience/index.cfm?pg=open-science-cloud
* Every discipline has a different solution. But should work off repositories.
Noted decentralized web... there is a project out of MIT led by Tim Berners-Lee that might be of interest https://solid.mit.edu/
### Group 2
- FAIR means lowering barriers and not "reinventing the wheel" in order to enable meaningful (novel) research
- standardized practices / terminology
- allow easier cross validation for e.g. statistical methods
- tracking of software changes/version control (especially for experimental work)
- implementation of research protocols
- reproducibility (to save time especially) / solutions like Docker to reproduce the environment in which the code/project was created
- access to (source) data without connections/a network
- FAIR is relevant for the future of scientific publishing
- open access model
- DMPs and SMPs assist thorough, critical peer review
### Group 3
* Open Access (ArrayExpress EBI)
* reusable for other researchers to address specific and new question
* Apply new meta-analyses
* Reproducibility of the analysis (create R-package, version control)
### Group 4
IR: FAIR for energy economics
Key focus: results should be understandable, reproducable and used for further research
(i) infancy stage (I feel) - researchers are not motivated enough
However, great initiatives and cooperations already exist: http://www.openmod-initiative.org/
(ii) papers that focus on similar questions have (sometimes) different results. With FAIR principles widely applied the situation should get better
(iii) now: open research is "cool/fair"; in 10 years: open research is a common (or almost required) practice
KB: FAIR for environmental modeling, planning and development of decision-support tools based on model data - interdisciplinary environmental studies:
- Important for data sharing and interpretaition. Proper metadata documentation is of key importance for the interpretation and communication of results
- Access to FAIR data is an important resource for research in the field, since research projects often don't include funding for data collection and production
- Interdisciplinarity requires transfereability of data
- There is low motivation and capacity in the community to share and document properly their data and models, often due to lack of information and incentives
- I learned that FAIR principles are not a standard, but perhaps it should become at least a good practice example for the community to move forward
- DPMs are often required in project development.
TD: FAIR principles should be used to compare results in the community. Therefore, proper data management is necessary especially metadata should be available for other scientist in the community. Documentation of this metadata and the methods used for processing the data are necessary as well.
Helpful links for open publishing:
https://ask-open-science.org/
-> can usually suggest some funding sources
open-science@lists.okfn.org
-> are always willing to help with anything open science related :)
Frontiers has a fee-waiver program for authors unable to pay publication fees - see ‘Fee support for authors’ on this page https://www.frontiersin.org/about/publishing-fees
The Open Acces directory has a list of journals and funding opportunities: http://oad.simmons.edu/oadwiki/OA_publication_funds
For example, PeerJ is a nice journal with a different funding scheme - depending on number of authors, it might be cheaper.
### Group 5
Gene Express Omnibus https://www.ncbi.nlm.nih.gov/geo/ - Common repo in discipline, kind of machine readable, so many different fields using these experiments -> metadata structures are broad w/ no standards -> importance of JSON/CSV as a common format, but what about the context?
Codemeta as a metadata file that follows the data and provides context
fdz.DZHW https://metadata.fdz.dzhw.eu/#!/de/search?page=1&type=studies
https://github.com/dzhw/metadatamanagement
Scrum in the devops team of the research data center, but hopefully in the research institution in the future as well: https://dzhw.github.io/agile-devops
### Group 6
Difficulty of documenting all working steps
What to share and what not? And when?
Full dataset is too big to share
Standards lacking: who has to create them? How can they be made acceptable? Where to get help to develop/contribute standards?
Profit of doing data management?
A metadata structure cannot be valid forever
People know what data they have and what laws apply, especially with personal data
Data management allow data to survive projects and persons
We want to know who uses our data / scripts
## GitHub demo
- GitHub Desktop for Linux: https://github.com/shiftkey/desktop/releases
- Carpentry-style lesson: https://github.com/caltechlibrary/git-desktop
## Wikidata API Access
- Overview: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/de
- Python Tutorial: http://ramiro.org/notebook/us-presidents-causes-of-death/
## Cookiecutter template for Python projects
https://github.com/audreyr/cookiecutter-pypackage
## [SWCarpentry.GitHub.io/make-novice](http://swcarpentry.github.io/make-novice/)
## Python Packaging
- Tutorial on Python Packaging, "Packaging from start to finish for both PyPI and conda", (Scipy 2018 Tutorial): https://python-packaging-tutorial.readthedocs.io/en/latest/
- "Inside The Cheeseshop: How Python Packaging Works" (Scipy 2018 Talk): https://speakerdeck.com/di_codes/inside-the-cheeseshop-how-python-packaging-works
## Testing in Python
https://travis-ci.org/
in `.travis.yml`
```
language: python
python:
- "3.6"
script: pytest
```
The testings lesson: https://katyhuff.github.io/python-testing/
# R session
- [Session material](https://tibhannover.github.io/FAIR-R/)
- [Setup instructions](https://tibhannover.github.io/FAIR-R/setup/)
## Useful resources:
- [R Packages book (free and online)](http://r-pkgs.had.co.nz/)
- [R for Data Science book (online and free)](http://r4ds.had.co.nz/)
- [Advanced R book (online and free)](http://adv-r.had.co.nz/); for programmers familiar with another language
- [usethis package website](http://usethis.r-lib.org/)
## [PANGAEA example](https://raw.githubusercontent.com/TIBHannover/2018-07-09-FAIR-Data-and-Software/gh-pages/_episodes_rmd/FAIR-remix-PANGAEA.Rmd) (Download or copy into a `.Rmd` file)