owned this note
owned this note
Published
Linked with GitHub
# Containers and Workflows in Bioinformatics
This is a collaborative "notebook" for the course on "Containers and workflows in bioinformatics", organised in November 2021 at CSC -IT center for Science.
Some useful links for course:
* [Course page in eLena platform](https://e-learn.csc.fi/course/view.php?id=73)
* [Zoom-link for sessions](https://cscfi.zoom.us/j/66318334180?pwd=ZGRwYWVsMUdyOWQ5a0FMWmV4UzVFZz09)
:::info
:bulb: **Hint:** HackMD is great for sharing information in this kind of courses, as the code formatting is nice & easy with [MarkDown](https://www.markdownguide.org/basic-syntax/)! Just add 3 ticks (``` ` ```) for the``` code blocks ```.
Otherwise, it's like a Google doc: it allows simultaneous editing. There's a section for practice down there ⬇️
:::
[ToC]
## 📅 Agenda
### Day 1: 23rd, November
:::spoiler
| Time | Content |
|-------|---------|
| 9:00 | Course preliminaries |
| 9:20 | Warm up with HackMD Environment |
| 9:30 | **Topic 1: Introduction to CSC HPC environment** |
| 10:00 | **Topic 2: Tutorial - HPC basics** |
| 10:30 | _Break_ |
| 10:40 | **Topic 3: Fundamentals of containers** |
| 11:10 | **Topic 4: Tutorial - Hello-World example** |
| 11:40 | _wrap-up_ |
:::
### Day 2: 24th, November
:::spoiler
| Time | Content |
|-------|---------|
| 9:00 | **Topic 6: Using container images in HPC environment** |
| 9:30 | **Topic 6:** Tutorials: Using existing Singularity images|
| 10:30 | _Break_ |
| 10:40 | **Topic 8: Containerised bio applications** |
| 10:50 | _Break_ |
| 11:10 | **Topic 8:** Tutorial - Containerised bio applications|
| 11:40 | Recap |
| 12:00 | Finish |
:::
### Day 3: 25th, November
:::spoiler
| Time | Content |
|-------|---------|
| 9:00 | **Topic 7: Converting docker images to singularity images** |
| 9:30 | **Topic 7:** Tutorials: Converting docker images to singularity images |
| 10:30 | _Break_ |
| 10:40 | **Topic 7:** Building singularity container images |
| 11:10| **Topic 9: Tutorials: Building singularity container images**
| 11:40 | Recap |
| 12:00 | Finish |
:::
### Day 4: 26th, November
:::spoiler
| Time | Content |
|-------|---------|
| 9:00 | **Topic 7: Introduction to nextflow** |
| 9:30 | **Topic 7:** Tutorial - hello-world example |
| 10:00 | **Topic 9: Using singularity containers in nextflow** |
| 10:30 | _Break_ |
| 10:40 | **Topic 9:** Tutorial - Using singularity containers in nextflow |
| 11:30 | Running nextfflow at CSC |
| 12:00 | Finish |
:::
---
## 📝 Q & A
Your questions are answered here. We will answer them, and this document will store the answers for you for later use! :rocket:
:::info
Scroll :arrow_down: to the bottom of the page to submit a question
:::
## General and practical matters
- [x] **Q: I have difficulty pasting my questions into HackMD (here). Do you have some instructions on how to write here?**
- A: Can you see these three icons on top left corner, next to HackMD text? There’s pencil, this side-by-side symbol, and an eye. In eye view, you can’t edit, you are just viewing. The other two reveal the markdown (MD) version of the page, which you can edit. I find it easiest to edit with the side-by-side view.
:::info
:bulb: **Hint:** You can also apply styling from the toolbar at the top :arrow_upper_left: of the editing area.
:::
- [x] **Q: Slides available after the course?**
- A: They sure are! You will also have the access to this HackMD document (save the link).
- The slides and tutorials are available here: **link here**
- Access to e-lena (and the quizzes there) may discontinue.
- We encourage you to share and use the material also in your own courses. The material is in git, and we welcome all feedback and edit suggestions (pull requests)!
- [x] **Q: Has the Zoom meeting already started? I'm just gettin "The meeting has not started" -note.**
- A: Make sure you have the full Zoom link when connecting! It's long and contains the password :)
- [x] **Q: Should we be able to access the slides in e-elena? I can only see the first slide.**
- A: Try navigating with the arrow keys :)
## CSC HPC ENVIRONMENT
- [x] **Q: How can you place your own scratch to a global variable (like $WRKDIR), so that you can easily navigate there in each session? Always forgot to export between sessions.**
- A: One way is to define alias in bashrc profiles or use export commands in `.bashrc` file so that every time you login those variables are set for you
```
export WRKDIR="/path/to/folder"
```
- Note, that GENERALLY we don't recommend adding stuff to your `.bashrc` as that is likely to cause problems down the line, but this is ok :) Some care should be taken if you are amember in more than one project, of course.
- [x] **Q: Where does so called standard out and error go from SLURM jobs?**
- A: There will be files created on the same folder where batch job is submitted. By default both stdin and stdout will go to file: `slurm-<jobid>.out`. You can also set them with specific file names in batch scripts using parameters like `--error` and `--output`.
- [x] **Q: Will these work in ePouta as well?**
- A: ePouta is a cloud service, and things work differently there. What has been described so far is regarding CSCs supercomputers, Puhti and Mahti.
- [ ] **Q: I'm quite impressed by the Puhti web-inteface. Do you plan to add more kernels to Jupyter? I regularly use R, someties also python2 and bash in my own Jupyter installations. At least supporting R would be nice.**
- A: Yes, we are planning to add more. Any requests and comments are very welcome. The system is still new and developing.
- [x] **Q: There is no emacs or nano or pico when ins interactive mode, how to get those?**
- A: Try ```module load nano```
- [x] **Q: Running ```esearch -db protein -query "Pythium [ORGN]"``` leads to: ```perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").```
And yes, I guess this is again the same thing I end up often with Puhti, I am using Mac and I think I once fixed this but likely no then. Or then this is something new. It works anyhow, just gives errors on the way.**
- A: On Mac sometimes locale is not set. when you ssh from MAC to CSC environment, you see some wanrings on HPC systems. You may export locale settings.
```
export LANG=en_US.UTF-8
or
export LC_ALL=C.UTF-8
```
On Mac you can stop forwarding locale settings as below:
```
sudo vi /etc/ssh/ssh_config
```
comment the following line in `ssh_config` file with (#)
```
SendEnv LANG LC_*
```
After that, you can again ssh into HPC system and check if you have resolved the issue. Usually the issue should not effect the results but warnings are annoying.
- Attendant comment: Add entry to ```~/.ssh/config``` for puhti (before general entry)
```
Host puhti.csc.fi
SetEnv LC_ALL=en_US.UTF-8
SetEnv LANG=en_US.UTF-8
User cscuserid
````
- [ ] **Q: Can the account for SBATCH be set in an environment variable for the session?**
- A:You can at least do the following:
```
export account=project_xxxx
sbatch --account=$account batchscript.sh
```
- [x] **Q: I got ```sbatch: error: AssocMaxSubmitJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)``` message when trying to submit the job.**
- A: For me it looks like you are launching two interactive sessions or you have wrong project name in batch script. We can surely help you after the presentation if needed
- I (Julia) have the same error. I just solved it with Tuomo's tip. Copy&Paste added rogue spaces. (could you please share your script here if it a batch job)
- Course participant tip: for me the problem was that when I copied the slurm script directly to nano, comments were broken to different lines. FIxing that fixed the error **<- try this!** (Great! Many thanks; idea is that **batch script should not have empty spaces or line before # (shebang) symbol**)
- [x] **Q: Any special steps to end the ```sinteractive```session?**
- A: Just use `exit` to come out of `sinteractive` environment
- [ ] **Q: What is the most convenient way of getting a notice when a sbatch job is finished, its metrics and output?**
- A: you set your e-mail in the SBATCH directives (using `--mail-type=END ` and also optionally setting an e-mail with `-email-user`) . On commandline, you can always monitor your jobs using `squeue -u $USER`. or even with `seff <job id>` to check the status of your job
-
- [ ] **Q: The job log that is emailed, is it available in the HPC environment? Like billing units used, memory and CPU time used**
- Yes try with `seff <job-id>` commdand. You will get more details about the job including the ones you asked
- [ ] **Q: Ok that is good, how do I find the job id when it is finished? Now I look in the out file name?**
- A: Yes. you can check if you have file names like `out_<job id>.txt` in job-submitted folder. But there are more professional ways finding your jobs using command like 'sacct'. You can get the information on all of your jobs e.g., using `sacct -u $USER --format=JobID,JobName,MaxRSS,Elapsed -S 2021-11-20` (last flag with S is starting time. you can alos change as needed)
- [ ] **Q: What is the email note setting in `#SBATCH --mail-user with e-mail id and mail type with --mail-type=END`?**
(Works with mail-user set to email address and mail-type set to END; true we want notification when job finished, mail-user seems not to be needed; mail-type is enough) (I think by default it will send to your CSC-registered e-mail)
- [ ] **Q: can you comment out SBATCH statements?**
- A: try with `# and space before SBATCH ddirective`
- [ ] **Q: What's the overhead (space, time) of SquashFS?**
- A: I haven't yet really tested e.g. SquasFS vs NVMe. Benefits over lustre are clear. There is some time and memory overhead related to mounting the file, but at this point I don't have any specifics. Does not seem to be too large. I'm planning on doing more benchmarking later this year when I have time.
- According to [Wikipedia](https://en.wikipedia.org/wiki/SquashFS) it's RO system and compressed. So I guess there's overhead in building it and it also partially duplicates the data. Useful nevertheless.
## DOCKER CONTAINERS
- [ ] **Q: What if I get ```docker: failed to register layer: Error processing tar file(exit status 1): write /usr/bin/wall: no space left on device.``` on PWD?**
- Try: Check docker images on host machine if you have downloaded many images using `docker images` command on the terminal
- You can delete any images you don't need (use `docker rmi <image id> ` )
- [x] **Q: I have a [program](https://github.com/ariloytynoja/pagan2-msa#1-using-the-precompiled-docker-image-from-docker-hub) that runs with the docker command ```docker run --rm -v `pwd`:/data pagan2 --pileup --homopolymer -q 454_reads.fas -o 454_aligned ```. I can run it with singularity as ```singularity run -B /scratch:/scratch pagan2.sif --pileup --homopolymer -q /scratch/project_2003682/aloytyn/singularity/454_reads.fas -o /scratch/project_2003682/aloytyn/singularity/454_aligned.fas```. Is there possibility to map ```$PWD``` to ```/data``` as with docker.**
-A: It works with ```singularity run -B `pwd`:/data pagan2.sif --pileup --homopolymer -q /data/454_reads.fas -o /data/454_aligned.fas```. Singularity doesn't understand Docker's WORKDIR.
- [ ] **Q: What is an entrypoint script?**
- A: ENTRYPOINT is an instruction in dockerfile to set executables that are run when the container is initiated
- [ ] **Q: A general question: What are the drawbacks of Singularity -- or why is it so much less used than Docker? Or is it superior and going to replace Docker over time?**
- A: Docker is more flexible if you have root access on machine you are running. (Writable containers etc). Main benefit of Singularity is that it can be run with normal user rights. Docker will probably remain the most popular technology for the foreseeable future, and Singularity will be used mostly for "niche" uses like HPC.
- [ ] **Q: I've downloaded and converted a docker container to a singularity container for the proteomics program MaxQuant. I can't figure out how to properly run it if I type ```singularity exec ../maxquant_2.0.1.0.sif maxquant -help``` I get: ```/usr/local/bin/maxquant: line 4: dotnet: command not found```. If I just run the env ```singularity run ../maxquant_2.0.1.0.sif``` and then launch maxquant ```Singularity> maxquant``` I get ```Failed to initialize CoreCLR, HRESULT: 0x80004005```. How should I try to run this?**
## SINGULARITY CONTAINERS
- [ ] **Q: Differences between Singularity and Docker?**
- A: Docker can't be run with normal user rights, so it can't be run on Puhti. Singularity doesn't need any root rights.
- [X] **Q:** There was a mention of ```okd``` container support in Rahti, will singularity containers run there?
- A: On Rahti container cloud, singularity containers would work fine. Main thing is that the containerised application should be running under the user, not with root previleges.
- [ ] **Q: How do I find the mounted paths from within a singularity session**
- A: Can you please specify a bit what you want to find out. At least you can run `df`
- I want to know which paths are writable, for eg command outputs. Side question - is there some other sink than ```/dev/null``` which is not writable?
- A: Nothing is writable except what you bind on host. Bind targets need not exist inside the container. They will be created if necessary. So you can do something like `$PWD:$PWD` and current directory is writable from container. Stdout from container can be directed to host /dev/null. For other you could bind host /dev: `--bind /dev:/dev`.
- [x] **Q:Where is the "hello2" exectuable (suggested by the `singularity run-help tutorial.sif` output)? Can't find it in singularity.sif container**
- A: It is in ´found/me/hello2´
- A: How to find it is covered in tomorrow's tutorial
- [X] ] **Q: Is headless X11 sessions available in singularity ?**
- A: Yes if all necessary libraries are avilable in the container. The $DISPLAY issue I mentioned is at least partly related to this.
-
- [ ] **Q: I'm getting an error when I run:**
```
singularity exec --bind $SCRATCH:/scratch tutorial.sif ls /scratch
WARNING: skipping mount of /scratch/project_2003682/training185: stat
/scratch/project_2003682/training185: no such file or directory
FATAL: container creation failed: mount /scratch/project_2003682
/training185->/scratch error: while mounting /scratch/project_2003682
/training185: mount source /scratch/project_2003682/training185 doesn't
exist
```
- A: Make sure to have folder on host before you bind it in singularity
- [X] **Q: In a batch file can --cleanenv be set in #SBATCH statements?**
- A: The `--cleanenv` in singularity command can be set in Batch job also but make sure that you are not using any host environment variables inside the singularity container.
- [ ] **Q: When using command 'singularity shell' in sinteractive mode then ls gives filesystem of container but when running singularity shell from login node then ls lists the home directory. What is the explanation?**
- A: Use full path for listing files; Only login nodes have the HOME directories and are mounted by default `ls /` shows the container root etc.
- [X] **Q: What would be the best practices you could recommend when using an containerized environment (my specific reference is R) that you want to enrich with project-specific libraries that should be available to all users in the project in a similar manner. The project-specific libraries may be rapidly evolving.**
- A: In CSC envI aironment, r tools actually are rendered as singularity containers and updated with new versions of R all the time. For project-specific installations, we have good documentation on our CSC doc pages here : https://docs.csc.fi/apps/r-env-singularity/. Most relevant information for your question is under the section "R package installations". Please maintain package versions for reproducibility. If you plan to use completely different set of packages that are incompatible some how with currently available container versions at CSC, best is to build container by yourself by modifying one of the dockerfile from Rocker (https://github.com/rocker-org/rocker-versioned2/tree/master/dockerfiles) and add your custom packages on the top of it. And then convert it into singualrity container image. The later approach may be involved and requires bit docker/singularity skills.
- [ ] **Q: What is the difference between Docker or Singularity containers and Anaconda environments?**
- A: When it's in a container, it's all there (form host point-of view). Anaconda is a folder inside the host system, so it's less isolated from the host system. Conda will use host libraries, so updates might cause problems, unlike with Docker and Singularity. We will learn about this during the course :)
- [ ] **Q: Can you explain again what the XDG_RUNTIME_DIR is?**
-A: It's an environment variable needed by the host system. By default it is set to /run/user/<userid>. Since that folder does not exist inside the container, you get some (harmless but annoying) warnings.
- [X] **Q: Is there a image with singularity installed that is has shown usable available in some image repository?**
- A: I think there is some docker image where singularity is installed. I have tested the following image and has singularity installed in it:
```
docker pull quay.io/singularity/singularity:v3.9.1-slim-arm64
```
You can pick different flavours/versions on quay.io registry
- [X] **Q: (Bonus) I installed singularity with default settings in an ubuntu docker container. When trying to build an image I get the error `Failed to create user namespace: not allowed to create user namespace`. Any immediate ideas to resolve this?**
- A: are you doing in PWD? PWD is docker-in-docker and might have some limitations.
- No, doing it in local Docker installation on laptop. Singularity install page says "Nested installations inside containers are not recommended, and require the outer container to be run with full privilege."
- A: The docker container needs to be running with the `--privileged` flag and you may want to look at model dockerfiles here: https://github.com/singularityhub/singularity-docker/blob/v3.9.0-slim/Dockerfile
- Is this the definition for the image above?
- Yes. You can check similar dockerfiles in the same github for different flavours and corresponding images in quay.io
## NEXTFLOW
- [ ] **Q: Where is the `fastqc_demo` folder for Tutorial 2?**
- A: It was included in the atr file yoy downloaded in tutorial 1. Go up one directory: `cd ../cd ../fastqc_demo`
- [ ] **Q: What is the benefit over Snakemake?**
- A: There are some comparison made in the article mentioned in the slides: https://www.nature.com/articles/nbt.3820 Main benefit according to that article is better integration of containers.
## ☃️ ICE BREAKER (HackMD -practice)
Let's learn how to use this HackMD document by answering an ice breaker question!
- [x] Do you need a course participation certificate from this course? (Format: First name (yes/no))
- Laxman (No)
- Jonas (No) :smile:
- Matias (yes)
- Santtu (Yes)
- Nick (no)
- Lisa (yes)
- Kari (yes)
- Tuomo (yes)
- Eva (yes)
- Julia (yes)
- Sonja (yes)
- Tiina (yes)
- Panu (no)
- Sami (no)
- Kisun (no)
- Carlos (yes)
- Binisha (Yes)
- Tia (no)
- Antti (no)
- Tapani (no)
- Ari (no)
- Neha Goel (yes)
- Katariina (no)
## ✏️ Add your questions here
Please type your questions here. We will answer them, and organise the document topically.
- [x] **Q: Have I clicked the edit mode on?**
- A: Probably not yet.. ↖️
- [x] **Q: Should I copy-paste an old question to get started with a new one?**
- A: A really good idea! Here's a template for you ⬇️
- [x] **Q: I think there are typos: should be ```python3-dev``` and ``` --recurse-submodules```. Also ```ln -s /usr/bin/python3 /usr/bin/python``` is needed.**
- A:
- [x] **Q: Can you explain again what the XDG_RUNTIME_DIR is?**
- A: It's an environment variable needed by the host system. By default it is set to /run/user/>userid>. Since /run does not exist inside the container, you get some (harmless but annoying) warnings. Tip with environment variables: you can use `echo` command to see their values!
- [ ] **Q: Can you somehow wrap-up here those steps you demoed to persist changes in container, or is that in some later tutorial?**
- A: Ok, it is basically in tutorial 6, check that one :)
- [ ] **Q: Can you please walk through the fastqc example. Why are the files handled as a pair and not as a list? Where does sample_id come from?**
- A: Thanks. I just think that fastqc handles one file at time and considering them a pair is not necessary.
- [x] **Q: We learned that in NextFlow the channel is implemented as files that are passed on to processes. Could the channels be read and witten from other sources, eg a database or a message queue?**
- A: As long as you can get thoses database as variables, you can pass that information to create channels. Although reading those datases as channel object can be tricky. This may require some plug ins. Please read here: https://github.com/nextflow-io/nf-sqldb
- [ ] **Q: Can you run some of the nextflow processes as array jobs (like sbatch_commandlist usuage)?**
- A: I have not seen this case before. Because nextflow processes are triggered by the data in channels, data will be automically submitted to processes due to implicit parallelism in nextflow. No need to submit as array jobs separately (=nextflow does for you).
- [ ] **Q: I'm trying the sarek example but get an error that "tabix" is not available. I supposed that it is in "samtools" and loaded that but it doesn't help. Any idea why is that?
More generally, how can I find the content of a specific module? For example, does "samtools" also contain tabix and bcftools?**
- A: As all necessary tools for Sarek worflow should be in the singularity image, it is possible that container image has not been built/downloaded properly. Restarting from scratch could help you. No need to load anny modules.
---
**Write your questions ABOVE this line**
The questions are moved upwards :arrow_up: into their categories when they get an answer.