owned this note
owned this note
Published
Linked with GitHub
---
tags: icenet
---
Welcome 👋
# IceNet Project
- This shared document: https://hackmd.io/@climate/icenet
## Project related links
* original idea: https://github.com/usegalaxy-eu/project-ideas/issues/35
* Notebook: https://edsbook.org/notebooks/gallery/ac327c3a-5264-40a2-8c6e-1e8d7c4b37ef/notebook.html
* Planemo tutorial: https://planemo.readthedocs.io/en/latest/writing.html
* Tool schema: https://docs.galaxyproject.org/en/latest/dev/schema.html
* `<configfiles>` could be useful to create the notebook code on the fly: https://docs.galaxyproject.org/en/latest/dev/schema.html#tool-configfiles-configfile
* Paper: https://www.nature.com/articles/s41467-021-25257-4
* Galaxy IceNet, https://github.com/vanessa-tamara/galaxy-tools/tree/prepare-tools-for-icenet/tools/regrid
## The Team
*Name, Institution, location*
- Anne Fouilloux, Simula Research Laboratory, Oslo, Norway
- Vanessa Tamara, University of Freiburg, Germany
- Bjoern Gruening, Galaxy Hub, Germany
- Alejandro Coca-Castro, the Alan Turing Institute
- Jean Iaquinta, IT Department, University of Oslo, Norway
## Meeting, Every Two Weeks
- [x] Kick-off, 12th January
- [x] Meeting 1, Friday 27th January
- [x] Meeting 2, Friday 10th Febraury
- [x] Meeting 3, Friday 3rd March
- [x] Meeting 4, Friday 10th March
- [x] Meeting 5, Friday 24th March
- [x] Meeting 6, Tuesday 11th April
- [x] Meeting 7, Tuesday 19th May
- [x] Meeting 8, Tuesday 25th May
- [x] Meeting 9, Tuesday 5th June
- [x] Meeting 10, Friday 16th June
- [x] Meeting 11, Friday 4th August
- [x] Meeting 12, Friday 18th August
## Kick-off
Thursday 12th January at 10:00 CET
**Agenda**:
- Get to know each other
- Overview of IceNet
- Project goals
- Showcase a workflow of seasonal ice forecasting using Pretrained Models
- Outputs: consultation with the main author
- Roadmap
### Notes
- Vanessa's skills in Python and Jupyter notebook, ok with both
- Galaxy, containerized fully annotated tool, graphical interface (input/output)
- Timeline: up to 6 months
### Roadmap
#### Overview

Some more details about what is available in Galaxy, and what needs to be developed:

#### Getting familiar
- [x] Galaxy
- [x] Jupyter
Notebook
- [x] IceNet
- [x] Paper
- [x] Notebook
- [ ] Python package
##### Training useful for this step
To go through the Galaxy training material, register on Galaxy Europe instance: https://usegalaxy.eu
To get familiar with galaxy, you can go through training material in the [Introduction to Galaxy analyses](https://training.galaxyproject.org/training-material/topics/introduction/). For instance:
- [Introduction to Galaxy](https://training.galaxyproject.org/training-material/topics/introduction/tutorials/galaxy-intro-short/tutorial.html)
- [Galaxy 101 for everyone](https://training.galaxyproject.org/training-material/topics/introduction/tutorials/galaxy-intro-101-everyone/tutorial.html)
#### Parametrisation and creation of Galaxy Tool
- [x] Check key steps to parametrise (mandatory/preset) in XML according to target end-users
##### Training useful for this step
Learn about creating new Galaxy Tools:
* Planemo tutorial: https://planemo.readthedocs.io/en/latest/writing.html
* Tool schema: https://docs.galaxyproject.org/en/latest/dev/schema.html
* `<configfiles>` could be useful to create the notebook code on the fly: https://docs.galaxyproject.org/en/latest/dev/schema.html#tool-configfiles-configfile
There are other tutorials in the section "[Development in Galaxy](https://training.galaxyproject.org/training-material/topics/dev/)".
#### Deployment
- [ ] Publication of the Galaxy tool in Galaxy Tool Shed
- [ ] Installation on Galaxy Climate (Galaxy Europe)
##### Training useful for this step
#### Workflow showcasing sea-ice forecast
##### Training useful for this step
- [Extracting workflow from Galaxy histories](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/history-to-workflow/tutorial.html)
- [Automating Galaxy workflows using the command line](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/workflow-automation/tutorial.html)
## Some useful links and references
* IceNet
* Video
* [General, 4 min](https://www.youtube.com/watch?v=06fT46YklVY&ab_channel=TheAlanTuringInstitute)
* [Technical, 60 min](https://www.youtube.com/watch?v=JAKWhEU09Xo&ab_channel=OxfordMLandPhysicsSeminars)
* Galaxy
* Paper
* Open Science Communities
* Turing Way
* Pangeo
### Q & A
#### 3rd October 2023
Present: Anne, Jean, Vanessa
Add a summary section providing links to the repository, tools, workflow, history, etc.
Between the implementation and execution sections of the thesis.
A draft for the EGD presentation is there: https://docs.google.com/presentation/d/1cH5DLsWZomjhTk36aQdfAvB2FfxwG3E-i8anr8MvWPE/edit?usp=sharing
#### 28th September 2023
Present: Anne, Jean, Vanessa
The workflow now runs (there were typos in the tools due to copy/paste from Vanessas's code, but that has been fixed now)!
Vanessa is now addint instructions to assess the running time:
```
import time
start = time.time()
...
end = time.time()
print(end - start)
````
For the validation is should be possible to compare intermediate (.csv) files, but this is not a priority.
For now the most urgent for now is to finish editing the thesis and have something in all the sections (if not, merge/delete the empty sections).
Next meeting Monday 2nd October 14:30-15:30.
#### 26th September 2023
Present: Anne, Jean, Vanessa
Still issues with the workflow: the output of the "Preprocess IceNET tool" is not listed, perhaps because they are not of the correct type by the "IceNET forecast tool" (listed as "npy" dataset in the history) -> putting it as "binary" might work?
The script used to download the sea ice concentration was "cleaned" so it can be referenced in the thesis (old bits of code not used made it confusing).
An indication of the performance with Galaxy vs. Jupyter could be given, along with benefits: using more models, getting automatically data, publishing dashboards (html)
Next meeting 28th 15:00-16:00
#### 22nd September 2023
Present: Anne, Jean, Vanessa
Older version of Vanessa's bachelor thesis: https://www.overleaf.com/2379276164qfdpcdfjfpqd
Current version: https://www.overleaf.com/7928186492myfmfzmmkvhx
Thee was an issue with the name of the file downloaded from Copernicus: is has to be called "download.nc"
Next meeting Tuesday 26th at 11:00
#### 19th September 2023
Present: Jean, Vanessa and Anne
Vanessa had issues:
* when installing the tools (with xarray, etc.): this may be linked to how Bjoern has installed the tool? - that now seems to work again.
* with the data uploaded by Bjoern in the Galaxy Data Library with "auto detect", probably because it was seen as HDF5 instead óf NetCDF (https://zenodo.org/record/8328634), however the problem remains: there must be something wrong with the data in the Data Library (all empty) and we should ask Bjoern to delete the old files and load them again.
The current workflow can be found at: https://usegalaxy.eu/u/vstoeckl/w/icenet
The thesis is still mainly in German and will have to be translated into English before we (Anne/Jean) can review it, otherwise Bjoern will have to do it since we cannot read German. Deadline to deliver it is 4th October.
#### 8th September 2023
Present: Jean, Vanessa and Anne
Vanessa made progress with her thesis (in German, so it need translated into English).
For the workflow the climatologies are also needed (18MB for each variable): they are in zenodo but it makes sens to have this in the data library.
Next meeting 18th September 15:00 to 16:00.
#### 28th August 2023
Present: Vanessa and Anne
Finalise workflow: still need one tool (concatenate).
Worked on the workflow inputs for users (parameterize Copernicus request with start_date and end_date)
Vanessa will share the workflow once finalised.
Thesis report: needs to focus on Icenet. In related work can talk about IceNet paper and Icenet notebook
#### 18th August 2023
Present: Alejandro, Vanessa and Anne
The Turing Way: https://the-turing-way.netlify.app/index.html
Vanessa started to write her Thesis.
Link to the thesis: https://www.overleaf.com/2379276164qfdpcdfjfpqd
Deadline for the thesis is end of September.
Is it part of Vanessa's thesis to write a github action to automate the download of input data (from CDS)? It could also be done afterwards.
#### 4th August 2023
Present: Anne, Jean, Vanessa
Vanessa had an incident with her computer which had to be replaced, ans hence she wasted several days but could recuperate all the data.
Bjoern is in Australia for another 2 months at least.
For updating the data which as to be downloaded we will have to write a script and apply it regularly (possibly every year). Bjoern prefers to do that in the data library. Jean will have a similar issue with the FArLiG (Forecasting of Arctic Lichen browninG) use case and will experiment with that.
Vanessa started to work on her report in overleaf (she will share a link so we can view it). In the method section introduce Galaxy, then ICENET, the work done from the Jupyter, the data used, etc.
We meed again next week (Friday 11th August) then people will be on holiday but still able to have short meetings and read/comment the report until September.
#### 16 June 2023
Present: Anne, Jean, Vanessa
For GCC Vanessa got an email for the poster, the deadline is the next Monday (26th) and a template was provided. It will be displayed on a screen (28cm x 50cm).
https://galaxyproject.org/events/gcc2023/abstracts
There is #poster-chat channel on GCC2023 conference Slack. This channel will be used by virtual attendees and allow asynchronous communication about the posters so please join this Slack channel and upload your poster.
Here are examples of the posters we presented at EGU (https://drive.google.com/drive/folders/1pD-3M93M-9aER_O6_FA6O7JkOoNkacSm?usp=share_link, https://docs.google.com/presentation/d/1eyXMrGgpYsFYOKgz1lpoVMEtWVWyZVs6y7bhNhRlv0Y/edit#slide=id.g22d30fe875b_0_0), it does not mean that you have to follow the same format.
There can be links to external material, like Jupyter notebook, Research Object, Galaxy history with the workflow (address only, not the individual elements), video, etc.
Here is a link to generate QR code: https://www.qrcode-monkey.com for the external link(s)
The title of the poster submitted was IceNet@GalaxyClimate (#21)
Possible titles:
- A Galaxy workflow to forecast seasonal sea-ice concentration using Artificial Intelligence
- Pipeline ...
Issue with the file on zenodo which contains more than one variable: use something like open_mfdataset(*.nc)
in prepare_sic_data.py (https://github.com/tom-andersson/icenet-paper/blob/main/icenet/download_sic_data.py)
replace open_dataarray by open_mfdataset, then extrace the relevant variable.
Next week: 26th 09-10
#### 5 June 2023
Jean downloaded the sea ice concentration and produced monthly means which are now on Zenodo.
However no mask is used there, and it is not clear what the mask is used for
See https://raw.githubusercontent.com/tom-andersson/icenet-paper/main/icenet/gen_masks.py
- This mask were already downloaded by Vanessa.
- https://github.com/vanessa-tamara/galaxy-tools/tree/icenet-visualization-tool/tools/visualize-icenet-forecast/masks
- so we should be able to apply them and generate the file called "siconca_EASE.nc"
We also need all the ERA5 variables
The user provides a date and then the rest is done automaticaly (calculation of the trends and anomalies).
We also need a separate workflow to download new sea-ice and ERA5 data (every month) and perhaps merge it with the already available data.
There is no support for the creation of collections, so we will have to keep one field per variable?
Next meeting: 16th June 16:00 CET
#### Data download
- ice_conc:long_name = "fully filtered concentration of sea ice using atmospheric correction of brightness temperatures and open water filters"
- From ftp://osisaf.met.no/reprocessed/ice/conc/v2p0
- 1979/01/02 to 2015/12/31
- one file per day ~10MB
- nv = 2 xc = 432 yc = 432 (whole Earth?)
Theoretical total size for 37 years = 10 * 365.25 * 37 = 135.142GB
- Issue : nearly 50% of the days missing before 1991
```
import os
for year in range(2015, 1978, -1):
for month in range(12, 0, -1):
command = 'wget -r --user="anonymous" ftp://osisaf.met.no/reprocessed/ice/conc/v2p0/' + str(year) + '/' + str(month).zfill(2) + '/*nh*.nc'
print(command)
os.system(command)
```
- from ftp://osisaf.met.no/reprocessed/ice/conc-cont-reproc/v2p0
- 2016/01/01 to 2023/04/30
- one file per day ~ 6MB
- nv = 2 xc = 432 yc = 432 (whole Earth?)
Theoretical total size for 7 years and 4 months = 6 * 365.25 * (7 + 4/12) = 16.071GB
- Issue : 2 days missing in 2022 (20 & 21 February) and 1 day missing in 2021 (9th November)
```
for year in range(2023, 2015, -1):
for month in range(12, 0, -1):
command = 'wget -r --user="anonymous" ftp://osisaf.met.no/reprocessed/ice/conc-cont-reproc/v2p0/' + str(year) + '/' + str(month).zfill(2) + '/*nh*.nc'
print(command)
os.system(command)
```
The monthly means from 1979-01 to 2023-04 can now be downloaded from https://zenodo.org/record/8004596
#### 25 May 2023
- Meeting with Tom
- ICENET tools on Galaxy: https://usegalaxy.eu/?tool_id=icenet_forecast&version=0.1.0%20galaxy0
- Still need to precomputed 35 year
- Download the raw data (try with AWS, perhaps with https://github.com/planet-os/notebooks/blob/master/aws/era5-s3-via-boto.ipynb)
- Climatologies, netCDF
- Only hard, linear trend
- Computing and map linear
- Config files customisable? It should be fixed
- Only change if you retrain a different
- Deadline
- 1st September
- NetCDF, more convenient, coordinates (time of year)
#### Actions
- Code to :
- download ERA5 and ice concentration data
- compute linear trend & anomalies
- Test tool online and make field types right
- convert individual files into a collection (follow the tutorial https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html)
Next meeting: Friday 02 June <- Jean will not be able attend (coming back from Bergen by train)
#### 19 May 2023
- Meeting with IceNet team
- Showcase progress so far
- Questions on the codebase
- Anne F suggests to have all workflows into a single folder,
#### Actions
- Invite Tom and James to meeting
#### 11 April 2023
Present: Anne, Jean, Alejandro, Vanessa
- Vanessa managed to make a forecast, for various leadtimes, but only the 6-month worked :dancer:
- The pull request she made is still waiting for approval, and then Bjoern will have the final word
- There is a Galaxy tool which can help with editing Python scripts (Galaxy language server extension -https://github.com/galaxyproject/galaxy-language-server) which highlights errors in Visual Studio.
- Test data cannot exceed 1MB, so for now there will be no extensive test on the input data (only something for the usage, with command lines, on the shape of the outputs, etc.)
- Bjoern asked about having an Icenet folder containing the 3 tools (since they are not generic) - Do it in the branch with the pull request
- The tool for the visualization is ready, the next step would be to automatize the process from a github repo with an action that would generate a webpage
- we can look into keeping small files on Galaxy or github with linear trend values calculated each month (or each year) over the previous 35 years
Next meeting: Tuesday 18th April 12:00-12:45 CET
#### 24 March 2023
Present: Anne, Jean, Alejandro, Vanessa
- Discussion on Git with Tom about the input data required, including the linear trends since 1979 and climatologies (stastistics for all the variables, so-called the anomalies)
- a priori not needed if we only re-use the weights from the pre-trained model
- Vanessa would like to have a chat with Tom and Jim to clarify a few points
- norm_params.json we use for now as it is. It contains mean and std values used to pre-process the input data and normalize it. The general formulas are:
normalized_parameter = (parameter - mean) / std
or normalized_parameter = (parameter - min) / (max - min)
- The Jupyter notebook runs and uses the input dataset from Zenodo
- We should not need future data: that was used in the paper for validation purposes only (no not reproduce this "validation" stage in the Galaxy tool)
- Alejandro calculated the linear trends for 35 years before the year he was forecasting
- there is a section in the code with:
```
if data_format != 'linear_trend':
...
input_months = [present_date - relativedelta(months=1b) for lb in lbs]
elif data format == 'linear trend'
input_months = [present_date + relativedelta(months=forecast leadtime) for forecast leadtime in np.arange
...
```
which makes it look that the model is using future data to make predictions!?!
- Vanessa needs to finish in June
- work on finalizing the forecasting Galaxy tool
- including all the input data required for testing (may be too large for github: we could use "dummy" tests, like checking that things are present instead of running them)
- work on the visualization
- make a small workflow showing how it performs
- ignore the icenet package for now (still under development)
- Anne is to look into that, to find out what is actually computed ión-the-fly
- Questions for the authors (Tom & Jim?):
- Why do we need to compute linear trends for future dates?
- Are the **future** linear trends used to make the actual forecast?
or only for "validation"
(and hence not needed unless we wanted to reproduce the validation graphs from the paper)?
- For the Galaxy tool we only need to worry about preparing the input data (from the past) and running the model with existing weigths to make a forecast (output); We do not repeat the validation
Next meeting: 11th April 11:00 (CET)
#### Actions
- [name=Alejandro]: to arrange a meeting with the authors
- [name=Anne]: to look into the icenet package & linear trend calculations
#### 10 March 2023
Present: Vanessa, Anne, Jean, Bjoern, Alejandro
- Errors when trying to run the code without linear trends: 6 inputs variables missing (incompatible shape with only 44 out of 50 expected). Supporting material in the Nature paper in Table-2 there is a list of all the variables. showing ther the "Linear trend forecast" with 1-6 : this is obtained from linear regression (with a script provided) and will have to be adapted every time the reference period is modified.
- [name=Alejandro] There is a function called `linear_trend_forecast(forecast_month, n_linear_years='all', da=None, dataset='obs')` in icenet-paper/icenet/models.py.
- Another issue was than when forecasting for 2020 data from 2022 was also requested, which seems strange.
- For collections as inputs, is it possible to provide only the folder instead of every file there is in the folder? Not possible says Bjoern, it needs every single file listed.
- Do we want the user to download all the datasets?
- As a start we want to reproduce what is in the paper.
- When someone downloads some data we should keep it in a "cache" to avoid repeating the download every time.
- Regarding the threadd server, the syntax to download data is quite complicated, it may be possible to get global files and mirror them? There is an intake (https://github.com/intake/intake-thredds/tree/main)
- The problem Vanessa has is that the file is very large (dates start from 1979): we could download it and store it as netCDF? Anne will check
- Put the slide in Zenodo so that Vanessa gets a DOI which can be referenced (it is a live document for which there can be various versions)
- So far Vanessa only used the dataset from Zenodo corresponding to what was published, not with the actual data downloaded directly
- The actual input data is needed for testing, but for outputs using *hashes* could be sufficient
- Use a small area (one point?) for the test
- New custom domain and logo for EDS book (: www.edsbook.org
- Next meeting 2023-03-24
#### Actions
- [name=Anne]: to check the size of the data from met.no and try to generate a single file from it
- [name=Alejandro]: try https://github.com/intake/intake-thredds?
- [name=Vanessa]:
- to contact **Tom**, briefly introduce herself and ask about the linear trend data needs. Do it in the issues on the same git repository
- to push the tool as it is (in a new branch, which makes it easier to review) - Issue with the test data (probably too large to be on Github). In the future could be referenced as *url*
- [name=Bjoern] To send Galaxy stickers to Alejandro
#### 3 March 2023
Present: Vanessa, Anne, Jean, Alejandro
- Vanessa concerned about weird messages with Planemo-serv: this is a normal behaviour, you have to start it and open a browser (http://127.0.0.1:9090, for instance)
- For the sake of reproducibility take note of which version of the packages was installed (tensorflow == 2.10, etc.) so that they can be pinned
- Problem to download from met.no the sea-ice concentration - Ask Bjoern about adding it as a remote file in Galaxy (since it is a thredds server)
- Anomalies ('amon') are computed on the fly
- Upload of a small netCDF file to Galaxy takes ages: always specify that it is a netcdf format (instead of automated detection) normally works better
- Linear trend data (for comparison) not needed now
- CMIP6 data not relevant either
- Try to make a prediction over the same period as in the paper, for validation of the output
- In March meeting organised by the Turing, and it would be nice to highlight what Vanessa is doing and get feedback, perhaps better later, with sufficient notice
- Weird that the output files with variables all have the same size in bytes (746624): re-load the data from file (in a Python notebook) with Numpy to check that they are correct
We keep the next meeting originally planned for next week unless Vanessa decides to shift by one week (to be decided Monday/Tuesday).
##### Actions
- [name=Alejandro]: notify IceNet team about the progress of the implementation in Galaxy
- [name=Anne]: Check with Bjoern about the sea-ice data & provide commands to load the data fron file and check with Numpy
#### 10 February 2023
Present: Vanessa, Anne, Alejandro
- Using IceNet source code
- Progress in Galaxy and tools
- Not longer using icenet python package
##### Actions
- [name=Vanessa]:
- Preprocessing
#### 27 January 2023
Present: Vanessa, Anne, Alejandro, Jean
- Start with the icenet conda package
- the prediction notebook is not very demanding (ran on MyBinder) but the training probably requires a GPU for Tensorflow
- CMIP6 data not needed at this stage (only if we run the training again)
- Lead-time possible values 1 to 12 months?
##### Actions
- [name=Alejandro]
- ~~Ask pretrained model for daily forecasting~~
- ~~Difference between icenet-pipeline vs icenet~~
- Confirm input data for prediction
- In dataloader, why `max_lag` equals to 12 compared to 3 in other input vars
- 'siconca': {'abs': {'include': True, 'max_lag': 12}
- [name=Vannessa]
- ~~Find and match icenet python package and notebook functions, validate functions with the IceNet research team~~
- Galaxy Climate tools: https://github.com/NordicESMhub/galaxy-tools
- [x] Fork it and create a new branch
- [ ] Galaxy tool for preparation
- [x] Repo with the tool: https://github.com/vanessa-tamara/galaxy-tools/
- [x] Regrid ERA5 standard grid
- [ ] Convert to numpy for ML
- [ ] Galaxy tool for running the monthly forecast
- [ ] Static data from Galaxy Library (AF)
- [ ] Galaxy tool for one ensemble model
- [ ] Galaxy tool for elaborating statistics
- [ ] Ask Bjoern tools folder name and structure
- Problems with preprocessing
- downloading sic data with the copernicus tool doesn't work
- add object storage : https://thredds.met.no/thredds/dodsC/osisaf/met.no/reprocessed/ice/conc_v2p0_nh_agg?xc[0:1:431],yc[0:1:431],lat[0:1:431][0:1:431],lon[0:1:431][0:1:431],time[0:1:1],ice_conc[0:1:1][0:1:431][0:1:431]
- uploading a netcdf file on galaxy takes long?
- incompatible package versions (tensorflow)
- is linear-trend data or cmip6 data needed?
- for variable anomalies training data is needed, should i just take the normalization parameter file from the notebook?
- planemo test works, but planemo serve doesn't ?
- [name=Anne]
- [ ] Check and prepare static data in Galaxy Library.
- [ ] Check how to properly test netCDF outputs (similar to what is done for HDF5)
- [ ] extract netcdf file from download_sic_data.py (thredds)
- [ ] Publish slides in Zenodo and add comment in issue
- [ ] Test Galaxy tool (and check inputs for size)
#### IceNet Support
- tag James Byrne, @JimCircadian, questions of the icenet package
- tag Tom Andersson, @tom-andersson, general questions of input data, icenet paper
### To develop Galaxy Tools
https://github.com/galaxyproject/galaxy-language-server