Archive - HackMD

# Archive --- ### Relevant other data services - SSCWeb & CDAWeb - through HAPI http://hapi-server.org/servers/ - Cluster? - CDAWeb http://hapi-server.org/servers/#server=CDAWeb&dataset=C1_CP_FGM_SPIN - AMDA http://hapi-server.org/servers/#server=AMDA&dataset=clust1-fgm-prp - CAIO http://hapi-server.org/servers-dev/#server=CAIO&dataset=C1_CP_FGM_SPIN ### Similar/relevant projects - Panhelio: https://heliocloud.org/ - Resen: https://github.com/EarthCubeInGeo - Pangeo: https://pangeo.io/ ... - ESA Datalabs - ESA EO Cloud https://openeo.cloud/ - SERPENTINE (e.g. solarmach) - https://serpentine-h2020.eu/hub/ - https://github.com/serpentine-h2020/serpentine - https://github.com/jgieseler/solarmach --- ## Python packages ### Swarm related See https://notebooks.vires.services/docs/vre-details.html#swarm-community-software for list of packages in VRE - Good (but could use help with maintenance): - PyAMPS https://github.com/klaundal/pyAMPS/ - ChaosMagPy https://github.com/ancklo/ChaosMagPy/ - to review: - ibpmodel - ionospheric bubble probablity - https://igit.iap-kborn.de/iw01/ibp-model/ (formerly at https://git.gfz-potsdam.de/rother/ibp-model) - SwarmFACE, Blagau et al.: https://github.com/ablagau/SwarmFACE - Need work: - SwarmPyFAC https://github.com/Swarm-DISC/SwarmPyFAC Kind of pointless as it only performs exactly what is given by the FAC product. TODO: Collect functionality into reimplemented version within SwarmPAL - Upcoming: - What should be a new package and what could be built within SwarmPAL? - Refer to SwarmPAL [Contributory Projects](/34PMux-dSVadL0K_D0rvqQ) ### Swarm adjacent - Currently installed in VRE: - MagPySV https://github.com/gracecox/MagPySV needs maintenance - ApexPy, SpacePy from PyHC - TODO: expand supported packages in VRE See https://heliopython.org (PyHC) for packages across heliophysics. The software landscape in Geophysics is a bit more complex, but one avenue to explore is https://github.com/fatiando/ or others from https://github.com/softwareunderground/awesome-open-geoscience --- See [VirES & VRE evolution](/Vo0wku3wRIWkiObmrUYBIw) for more # Development environments ## The Problem We must produce and test software so that it works on the [VRE system](https://swarm.magneticearth.org/docs/vre-details.html), but also easily deployed in other environments. This is complex because the VRE is updated periodically (~6 months) so the target environment is changing, while the many community packages used are also changing. At minimum, a Swarm Python package or notebook collection should be automatically tested against the latest VRE version and maintained to keep working with that. It should also contain a minimal environment specification (containing only the packages required) used for development work and for users to install elsewhere. To keep ahead of the curve, the software will be automatically tested against newer versions of dependencies. If an upgrade of a dependency causes a problem that cannot quickly be solved with a maintenance update of our software before the next VRE update, we can intentionally hold back the version of that dependency in the VRE specification. The correct usage of semantic versioning in the package metadata should mitigate this, but we need to pay attention to this so that Swarm packages are not holding back the necessary updates to the VRE. The Docker image used in VRE (which is not publicly accessible right now, sorry) is based on [scipy-notebook](https://github.com/jupyter/docker-stacks/tree/master/scipy-notebook). In here there is a conda environment containing jupyter & all the other libraries being used. Most of the software is obtained from conda or pip without modification but a minority are installed by script to fix some portability issues with the libraries. I am planning to provide vre-compatible environment specifications and recipes at [gh/smithara/containers](https://github.com/smithara/containers/) (work in progress) for developers & users who want to work locally with a shared environment containing many toolboxes, but each project repository should contain its own minimally-required specification for development and testing. Developers can use any tools they want to create a compatible environment on their machine - problems arising should be caught by the CI strategy (i.e. running tests on GitHub Actions). These pages are recommendations to help us arrive at a common understanding and to provide instructions to get set up with something that should work! --- For sustainability, Swarm software should embed well within the [Python in Heliophysics Community (PyHC)](https://heliopython.org/) - this means we should encourage use of (and contribution to) the PyHC core packages, and follow their [standards](https://github.com/heliophysicsPy/standards/blob/main/standards.md). The development workflow planned here is compliant with the PyHC standards. Note that other communities exist (e.g. [Fatiando](https://www.fatiando.org/)), which may also be relevant in some cases. # Tools & Technologies ## pip The Python package manager... if possible we should create packages that build "simply" using Python Wheels that can be installed on any platform using pip. Such packages also install within a conda environment okay, but be careful how that changes the environment as conda and pip don't talk to each other. Ideally depencies for a project can be acquired with pip as well as conda: this means that where conda is not desired, the setup using pip (within venv or similar) should also be simple. For a more complete development environment, or where there are non-Python dependencies, it is nice to use conda. Can use https://github.com/conda-incubator/grayskull to generate recipes for conda-forge from pypi packages, e.g. as with viresclient: https://github.com/conda-forge/staged-recipes/pull/20918 ## [conda](https://docs.conda.io/en/latest/) - Purpose: - Manage environments containing packages from many languages (not just Python) - required for non-Python dependencies - Easy installation of many libraries across platforms (Linux/Mac/Windows) - Problems: - conda packages are not always good for HPC systems - environment management is complex (be careful with conda+pip configurations) - `conda` can become super slow -> use [`mamba`](https://mamba.readthedocs.io/); and build environments from scratch from a environment.yml or conda-lock, instead of layering in new packages over time with `install` (also prevents breaking your environment when a newly added package scrambles versions of those already installed) - packages are still platform-dependent so exact environment is not portable between systems (though the popular libraries generally work well) Recommendation: Install conda/[mamba](https://mamba.readthedocs.io/) to manage your development environment. Use a specific `environment.yml` based on package versions used within VRE. If on Windows, use this within WSL. If on Mac, use the Mac conda packages? or use Docker? ## Docker / podman - Purpose: - Can create totally portable environment - Problems: - More complex to use - Most HPC do not support it (though some have Singularity) This can be used on the CI side but it is maybe not necessary for local use (as long as the conda environments are sufficient). ## IDEs: VSCode, Pycharm, (Jupyter) *TODO* --- ## Testing and Automation - nice intro to the point of using pytest, tox, and GH Actions: https://www.youtube.com/watch?v=DhUpxWjOhME - more specific guidance here: https://scikit-hep.org/developer/tasks --- # Using Windows (WSL) :::info These steps inform how to set up a (conda/mamba)-based Jupyter development environment on Windows using WSL (["Windows Subsystem for Linux"](https://docs.microsoft.com/en-gb/windows/wsl/)). ::: Why WSL? Some scientific libraries are difficult to install on Windows and your exact environment wouldn't be reproducible on other systems - WSL lets us bypass this problem by running all the Linux software directly. It also means that your development environment is isolated from whatever quirks you might have in your Windows installation. Why conda? Python environments are complex - conda helps us to manage the dependencies (particularly non-Python dependencies) and separate your development environment(s) from the base system and from each other. Doing this inside WSL means that you use the Linux conda packages instead of the Windows ones. Why mamba? The conda environment/package manager is very slow when working with the `conda-forge` package channel because there are so many packages to search. [mamba](https://mamba.readthedocs.io/) is a drop-in replacement for conda which performs much faster. mamba can be installed within an existing base conda environment, but these steps will show how to install mambaforge: an alternative to miniconda (the only difference is that mamba and conda-forge are configured already). TODO: - extend this to cover using docker? - steps for close-to identical environments in linux & mac? - produce vre-compatible conda env or dockerfile? - conda/mamba within windows directly: check https://dev.to/voodu/windows-terminal-conda-d3e to configure within Windows Terminal - how to use spyder? install it in windows then somehow point its Python interpretor option to within WSL? ## Setup instructions 0. Using an up-to-date Windows 10 system... 1. Install *Windows Terminal* from the Microsoft Store 2. Open *Windows Terminal* as admin and install Ubuntu: - Right click on *Windows Terminal* in the start menu and click `More/Run as administrator` - Now install WSL with Ubuntu 20.04: ``` wsl --install -d Ubuntu-20.04 ``` After a restart, the linux system should install and you will be asked to create a username and password to use within that system. 3. Check Ubuntu is running okay and update it. Open *Windows Terminal* and run `bash` or `Ubuntu` to enter it and then: ``` sudo apt update && sudo apt upgrade --yes ``` The Linux filesystem is separate from the Windows system, but you can access the Windows files within `/mnt/c/`. :::warning Sometimes we found that Ubuntu could not access the internet using WSL2 (see [here](https://stackoverflow.com/q/62314789)). We worked around this by using WSL1 instead, but this may introduce a drop in performance (and other issues, like access to your GPU) so better avoided if you can fix it instead. To check the version of WSL being used: `wsl -l -v` To convert the already-installed Ubuntu-20.04 to use WSL1 instead of WSL2: `wsl --set-version Ubuntu-20.04 1` ::: 4. Install mambaforge inside Ubuntu (https://github.com/conda-forge/miniforge#mambaforge) ``` curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh" bash Mambaforge-$(uname)-$(uname -m).sh ``` Accept the defaults and `yes` to initializing Mambaforge. Once complete, exit and re-enter bash so the base conda environment should be active (your command prompt begins with `(base)`). :::info I suggest using mambaforge over conda/miniconda because 1. the base environment is setup directly with [mamba](https://mamba.readthedocs.io) (a drop-in faster replacement for [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/index.html)) 2. only the conda-forge channel is pre-configured to reduce chance of broken environments by mixing with the anaconda channel. ::: 5. Check mamba is working and update it if you want ``` mamba update --all --yes ``` :::info At this stage, you should have a working setup with mamba, so you can continue setting up a new environment using it, or follow the steps bellow to experiment with a minimal jupyter+vires setup ::: 6. Create a new conda environment with the packages we want (in this case, I call the environment *jupyter* and configure it with a minimal setup for viresclient and jupyterlab) ``` mamba create --name jupyter jupyterlab xarray jinja2=2 pytables tqdm pandas xarray requests urllib3 attrs mamba activate jupyter pip install viresclient ``` TODO: best practices and tips with conda environments and ready-to-use environment files 7. Now we can run jupyterlab from that environment (pointing to our Windows home as the working directory), e.g.: ``` mamba activate jupyter jupyter lab --no-browser --notebook-dir /mnt/c/Users/ash ``` (replace `ash` with your Windows username or replace with a specific path for your notebooks) Open your lab by holding Ctrl and clicking the link shown like `https://127.0.0.1:8888/lab?token=04fsj3u.....` 8. Automate the lab startup with a script. Create a file called `start_jupyter.bat` containing: ``` wt.exe bash -c "/home/ash/mambaforge/envs/jupyter/bin/jupyter lab --no-browser --notebook-dir /mnt/c/Users/ash" ``` Replace `envs/jupyter` with the name of the environment you want, and `c/Users/ash` with the location of your notebooks ## Usage instructions TODO --- # conda environments TODO: Guide for using conda & some demo environments at https://github.com/smithara/containers/ * https://coderefinery.github.io/reproducible-research/dependencies/#conda * https://the-turing-way.netlify.app/reproducible-research/renv/renv-package.html ## vscode setup hints ![image](https://hackmd.io/_uploads/ByJs5QFsT.png) # SwarmPAL planning notes Things to discuss - Why should we centralise toolboxes into one repository & package? (SwarmX) - Pros: - Can take care of maintenance and all the CI etc in one place - Hope: Easier to get funding from ESA for long-term support / large enough community base that volunteers solve it - Dependency management is quite critical for a library that should be installable in many places including the VRE. This is a long-term issue for being able to keep using the library in new environments - the library *will* need to be updated over the years. - Most toolbox environment requirements are very similar - we can better share the burden of maintaining those when they are in one place - Easier to share functionalities between toolboxes (especially data i/o) - More coherent for the end user - Cons: - More difficult to coordinate - Less freedom for individual toolboxes - Needs long term sustainable solution - Should toolboxes be plugins instead of just submodules? I'm not familiar with how that works - too much extra complication at this stage. - Why GitHub and public? - It's where everyone is - Easy to cross-reference issues in other projects, and other integration - Excellent tooling for automation with GitHub Actions - Be discoverable - Stimulate community development under https://github.com/Swarm-DISC - Easy for others to contribute (e.g. we might ask EOX to make performance improvements; get external code review) - How do we mesh better with existing packages (https://heliopython.org/projects/) - Should we instead be building a toolbox as part of one of those (pyHC: e.g. pySPEDAS, pysat, spacepy)? (No(?)) - Would need to add viresclient to their data sources (requiring a significant rework of how they handle things, as their data access is typically file-based rather than service-based... though there is a HAPI adapter in SPEDAS at least...) - Difficult to figure out and coordinate - But we risk fragmenting the landscape further - Anyway: Swarm isn't just a heliophysics mission (are there other relevant communities like PyHC?), it is an Earth Explorer and has its own divergent needs - Looking carefully to see what we can use from PyHC; how PyHC users can also use elements from swarmpal - We need to figure out a workflow & contribution process for everyone to follow that is not too difficult - A large part of a scientific library is use case examples, showing how to apply the low level tools available within the library. This can be realised as Jupyter notebooks, either: - as part of the `./docs/` for the swarmpal package - as a separate Jupyter Book [(like this one)](https://swarm-fac-exploration.magneticearth.org/): - good for when there are other dependencies: shows how to blend libraries together - easier for others to contribute to (tooling will come soon for VRE that makes this more user friendly) ## Dependencies - What dependencies do we require? What other projects are relavent? - trivial with pip: - xarray, scipy, numpy, matplotlib - viresclient - trivial with conda, but with extra steps for pip: - cartopy - a bit awkward: - spacepy (installation needs improvement) - to investigate possibilities: - [hvplot (&holoviz)?](https://hvplot.holoviz.org/) - [pytplot?](https://pytplot.readthedocs.io/en/latest/introduction.html) - [Kamodo?](https://ccmc.gsfc.nasa.gov/Kamodo/) - [pyaurorax?](https://github.com/aurorax-space/pyaurorax) - dependent on needed functionality: - pyproj - [apexpy](https://github.com/aburrell/apexpy) - [chaosmagpy](https://github.com/ancklo/ChaosMagPy) - [pyamps](https://github.com/klaundal/pyAMPS/) - [eoxmagmod](https://github.com/ESA-VirES/MagneticModel/) - [click](https://click.palletsprojects.com/) for easy CLI - data access: - hapiclient - pooch - Should probably split dependencies up so that a user does not need to install all of them, e.g. ``` pip install . only numpy & scipy (the algorithms) .[tfa] spacepy .[io] viresclient, cdflib ... .[viz] matplotlib, cartopy, hvplot ... .[all] ``` - Hopefully we can make a pure Python package (no binary extensions used, i.e. C++) but if it becomes necessary, review how to handle the packaging (probably using [pybind11](https://github.com/pybind/pybind11)) and [automate the builds](https://scikit-hep.org/developer/gha_wheels) ---