Environment - HackMD

--- tags: Overview --- # Environment (Overview, Pipeline, Containers) ## Overview **Reproducibility:** From the point of view of research reproducibility, the location where you carry an analysis is just as important as the analysis itself. If the version of the library used at some step is different to someone who reproduces your work, or he provided the wrong input to a certain task, or the computer used to run the analysis is different there can be large differences in results or the analysis would not run at all. This is all the more important in metabolomics, where almost every lab has different internal protocol and there is little consensus around data processing. Links: - [Reproducible research course at NBIS (free participation)](https://nbis-reproducible-research.readthedocs.io/en/latest/) ## Pipeline The various scripts can be arranged in a pipeline. Pipelines are a way to structure code that provides a clean overview of the inputs and outputs of every task, as well as hangling the task dependency graphs. They can be implemented using specialised software that allows massive scale job control. On Bianca for example, the workflow tools most commonly used are: - [Snakemake](https://snakemake.github.io/) - [Nextflow](https://www.nextflow.io/) The above offer job control on Bianca, plots and analysis tools. However, given that inputs and outputs are provided in a uniform manner over the scripts it is possible to also use custom pipelining from any programming language. ## Notebooks Remember the lab notebooks? It is hard to share them with someone that wants to follow up on your study, it is also likely that you lose track of your ideas or simply forget things. Rather than making disorganised notes in code or on paper, it is advised to get started with a notebook software. Unfortunately, keeping such software on Bianca is currently *hard* due to the cluster being hyper-secure (no https protocol is allowed for example). But a notebook does not need to be kept on Bianca, provided it does not contain sensitive material. - [Jupyter](https://jupyter.org/) - [Rmarkdown](https://rmarkdown.rstudio.com/) ## Containers Containers are a way of freezing up a picture of the system, so that a collaborator could simply take your data, or his own data, and use the container to run your workflow, whithout the need to install and configure his own environment (which if you remember is a great source of miss-reproducibility). A container can also be run on HPC clusters such as Bianca, but the same container can be rolled out in public or private clouds, or a personal computer. You can also version container imaves, which allows you to regress to a previous environment for a certain task in your pipeline (different tasks can be associated with different containers). The most common containers are provided by Docker, but in order to run them on Bianca, they need to be first converted into [Singularity](https://sylabs.io/docs/) containers. Containers also allow for the exact specification of an environment in text files called recipes, which can then be used to precisely rebuild the environment. It has become common practice for a container to be attached to a publication, containing both data, code, notebooks and source code! (TODO: link the full recipe code) Here is how you would build an environment in a Singularity container. Bellow is a small example, please follow the link for the full recipe ``` BootStrap: docker From: centos:7 %runscript Rscript /tmp/run_script.R >> /tmp/log.txt %environment export LANG=en_US.UTF-8 export LC_ALL=en_US.UTF-8 %post yum clean all yum -y update yum -y install sudo wget python-devel redhat-lsb-core which yum -y install epel-release yum -y install libxml2-devel netcdf-devel openssl-devel libcurl-devel yum -y install libjpeg-turbo-devel yum -y install R # installing packages from cran sudo R --slave -e 'if (!require("tidyverse")) install.packages("tidyverse",repos="https://cran.rstudio.com/")' sudo R --slave -e 'if (!require("stringi")) install.packages("stringi",repos="https://cran.rstudio.com/")' # installing from BioC, xcms and IPO sudo R --slave -e 'if (!requireNamespace("BiocManager",quietly=TRUE)) install.packages("BiocManager", repos="https://cran.rstudio.com/")' sudo R --slave -e 'BiocManager::install()' sudo R --slave -e 'if (!require("xcms")) BiocManager::install("xcms",ask=FALSE)' sudo R --slave -e 'if (!require("IPO")) BiocManager::install("IPO",ask=FALSE)' ``` Once such a recipe is put into a file, such as "container.txt" one can build the container on his own PC and then move its image, which is a single file, to Bianca, or build it on Bianca directly. Example commands: ``` # example build command, building an image file called "peak.simg" sudo singularity build peak.simg container.txt >>install_log.txt 2>&1 # example job script on Bianca: #!/bin/bash -l #SBATCH -A sens2018586 #SBATCH -p core -n 10 #SBATCH -t 5:00:00 singularity run --bind /proj/data/location:/data /proj/container_image/location/peak.simg ``` Notice above that the data is bound at run time to the container, this helps with reproducible and avoids needing the container image to be locked to Bianca! ## Naming strategies **Instrument file names in A_B_C_D_E_F format** This naming strategy is important for CMSITools to properly extract all relevant information from the filename - A is the date written as YYYY-MM-DD - B is the Batch number and which week it was analyzed written as BZZWXX (e.g. B02W43). - C is for chromatography, either RP (Reversed Phase) or HILIC. - D is for polarity, either POS or NEG. - E is sample identifier, marking the sample as either a sQC, ltQC, blank, cond (conditioning plasma) or a sample (samples named so that they can be backtraced to which sample it belongs to). Each element here needs to have a unique name for every batch (e.g. sQC01, or 125c). - F is injection number, marking in what orders samples in each batch were injected. **Features in ABc@d format** - A is either H or R for chromatography (**H**ILIC / **R**eversed phase) - B is either P or N for polarity (**P**ositive / **N**egative) - c is the recorded m/z - @ is a separator between m/z and rt - d is the recorded retention time (rt) NB! To facilitate tracking of features between different modules in the pipeline, *m/z and rt should be given with full resolution* - i.e. not truncated to a certain number of decimals. **RAMClusters in ABCn@d format** - A is either H or R for chromatography (**H**ILIC / **R**eversed phase) - B is either P or N for polarity (**P**ositive / **N**egative) - C for cluster - n is the cluster number - @ is a separator between C and rt - d is the recorded retention time (rt) ## Data security (working with sensitive data) In Sweden sensitive data is processed differently to other countries. Our group needs to keep such data on Bianca, and IBM HPC cluster, and the handling of this machine is described on another document. (TODO: describe at what point data analysis can leave this machine)