---
title: Earth scientists first meeting
tags: meetings,earth science
---
## When?
Thursday, 24 November⋅11:00 – 12:00 (CET)
## Who?
James, Rike, @maxulysse, Jonathan Bader, Kevin Styp-Rekowksi, Florian Katerndahl, Fabian Lehmann
Part of https://fonda.hu-berlin.de/
## Agenda
- Introducing nf-core:
- best practice pipelines
- Shared modules/subworkflows to be used outside of nf-core as well
- Shared configs
- Tooling around that to make it easy-to-use
- Guidelines/Requirements
- Current use:
- Use pipelines to test computational hypothesis, at the moments not always focues on specific domain
- Domain in earth science
- Overall question: feasibility of opening nf-core to earth sciences as discipline?
- Currently we are very bioinformatics focused, exploring opening up
- Inspired by:
- https://arxiv.org/abs/2210.08897
- Cited a nf-core pipelines -> possible candidates whether nf-core can expand to earth science
- Discuss with them:
- What workflow manangers are actually used?:
- Geomagnetism: 0, currently not used at all. Manual python/fortran scripts, chained in SLURM.
- HPC used. Sharing of workflow is a concern & reproducibilty
- Kubernetes
- How standardised is earth sciences pipelines etc (very fragmented, or people like sharing?):
- nothing is standardised (neither geomagentism and remote sensing), often not really a wf management systems
- FORCE (remote sensing specific pipeline framework) https://davidfrantz.github.io/code/force/: wrapped into nextflow worked well (https://github.com/CRC-FONDA/FORCE2NXF-Rangeland)
- How familiar is the field with the command line, how much do people want to have a gui:
- Varies by group: some prefer GUI interfaces (e.g. remote sensing), some are very experiences with low-level calls)
- How much is open-source adopted in earth sciences?
- ELIXIR?
- mixed bag
- Is there already a similar concept/community in this field?
- If not what are things we can improve on our side?
- What is nf-core missing for earth sciences researchers to add their pipelines to nf-core
- One container to rule them all (Force), there might be challenges to split it up into smaller containers --> this could probably be circumvented by reusing the same container for all processes.
- Proprietary software that is not publishable at the moment
- What types of data do they typically use (files? streaming)?
-
- Sizes of data? Typical database(?) sizes?
- Because will influence compatibility with github actions, test datasets etc.
- TB - PB depending on the group ( varies widely)
- Are tools typically part of conda/containerised?
- pip, conda, difficulties with different gcc versions
- Some tools need to be pre-compiled, or are proprietary (slowly transitioning open software/open code; ESA is a major funder in the field and pushing towards it)
- Similar challenges: code is not always shared and usable in papers
- Tangential question?
- https://arxiv.org/abs/2211.12076
- Cool! What was the test data? Publish the more optimised output?
- Idea: share trace files voluntarily to a shared repo to allow benchmarking/optimising based historical runs. --> awesome idea, need a bigger infrastructure setup for this
-