--- title: Earth scientists first meeting tags: meetings,earth science --- ## When? Thursday, 24 November⋅11:00 – 12:00 (CET) ## Who? James, Rike, @maxulysse, Jonathan Bader, Kevin Styp-Rekowksi, Florian Katerndahl, Fabian Lehmann Part of https://fonda.hu-berlin.de/ ## Agenda - Introducing nf-core: - best practice pipelines - Shared modules/subworkflows to be used outside of nf-core as well - Shared configs - Tooling around that to make it easy-to-use - Guidelines/Requirements - Current use: - Use pipelines to test computational hypothesis, at the moments not always focues on specific domain - Domain in earth science - Overall question: feasibility of opening nf-core to earth sciences as discipline? - Currently we are very bioinformatics focused, exploring opening up - Inspired by: - https://arxiv.org/abs/2210.08897 - Cited a nf-core pipelines -> possible candidates whether nf-core can expand to earth science - Discuss with them: - What workflow manangers are actually used?: - Geomagnetism: 0, currently not used at all. Manual python/fortran scripts, chained in SLURM. - HPC used. Sharing of workflow is a concern & reproducibilty - Kubernetes - How standardised is earth sciences pipelines etc (very fragmented, or people like sharing?): - nothing is standardised (neither geomagentism and remote sensing), often not really a wf management systems - FORCE (remote sensing specific pipeline framework) https://davidfrantz.github.io/code/force/: wrapped into nextflow worked well (https://github.com/CRC-FONDA/FORCE2NXF-Rangeland) - How familiar is the field with the command line, how much do people want to have a gui: - Varies by group: some prefer GUI interfaces (e.g. remote sensing), some are very experiences with low-level calls) - How much is open-source adopted in earth sciences? - ELIXIR? - mixed bag - Is there already a similar concept/community in this field? - If not what are things we can improve on our side? - What is nf-core missing for earth sciences researchers to add their pipelines to nf-core - One container to rule them all (Force), there might be challenges to split it up into smaller containers --> this could probably be circumvented by reusing the same container for all processes. - Proprietary software that is not publishable at the moment - What types of data do they typically use (files? streaming)? - - Sizes of data? Typical database(?) sizes? - Because will influence compatibility with github actions, test datasets etc. - TB - PB depending on the group ( varies widely) - Are tools typically part of conda/containerised? - pip, conda, difficulties with different gcc versions - Some tools need to be pre-compiled, or are proprietary (slowly transitioning open software/open code; ESA is a major funder in the field and pushing towards it) - Similar challenges: code is not always shared and usable in papers - Tangential question? - https://arxiv.org/abs/2211.12076 - Cool! What was the test data? Publish the more optimised output? - Idea: share trace files voluntarily to a shared repo to allow benchmarking/optimising based historical runs. --> awesome idea, need a bigger infrastructure setup for this -