Linux for Bioinformatics - Crash Course

# Linux for Bioinformatics - Crash Course [toc] ## Small Bash cheat sheet ````bash # change directory cd /scratch/$USER # show content of current directory ls # show current path you are located pwd # make a new directory called 'myfolder' mkdir myfolder # show full content of a file cat some-file.txt # show first 10 lines of a file head some-file.txt # concatenate the content of files into a new file cat file1.txt file2.txt file3.txt > file123.txt # ... or via a wild card cat file?.txt > file123.txt # compress a file gzip sample.fastq # this will generate a compressed file 'sample.fastq.gz' # uncompress a gz file gunzip sample.fastq.gz # make conda environment and activate it conda create -n nanoplot conda activate nanoplot # install a tool that is available on conda into the activated env conda install -c bioconda nanoplot # run a program, show the help page to check how to us the tool and parameters NanoPlot --help ```` ## Practices and examples Here are some example commands and things you can try on your own. Always remember: * use _auto completion_ as often as possible, you can always use the _tab_/_tabulator_ key to get suggestions for a command you are typing and to auto-complete folder/file names and paths - it's much faster and less error prone! (preventing typos!) * prevent whitespaces in all folder and file names! You can use whitespaces in general, but it will complicate your work on a Linux system! Use `-`, `_`, etc... instead, e.g. `new-file.txt` * be always careful when you delete a folder or file! It's not as easy as on windows to get your data back! Now, open a terminal and try the following commands. ```bash # when opening a new terminal, you always start in your home directory # the following command shows the current path you are located (remember the tree-like structure of folders on a linux system!) pwd # create a new directory mkdir testdir # change into that new directory cd testdir # check where you are located now pwd # generate a new empty file touch genome.fasta # list content of the current directory ls # list more details, in a human readable format ls -lah # write some content into that file printf ">Sequence\nATCGTACGTACGTAC" > genome.fasta # check content of the file cat genome.fasta # change to your home directory # ~ is a short version of /home/$USER cd ~ # check again the content of the file you created # now you have to type the full path to find the file! Use auto-complete! Here we use the so-called relative path cat testdir/genome.fasta # you can also use the absolute path cat /home/$USER/testdir/genome.fasta # Hint: $USER is a so-called variable. To see the content of a variable you can also use echo: echo $USER # in $USER your terminal stored the information about the current user running the session. You can also define your own variables, for example you could store the absolute path to your file in a variable for easier re-usage: GENOME=/home/$USER/testdir/genome.fasta cat $GENOME # please notice that we always use a leading $ sign when we want to access the content of a variable! See the difference: echo GENOME echo $GENOME # generate another file touch genome2.fasta # copy the file to the test folder cp genome2.fasta testdir/ # list the content of the test folder ls -lah testdir/ # remove the original file we just generated in your home dir rm genome2.fasta # is it gone? ls -lah # however, remember we copied the file so a copy of the file we just deleted is still in the test folder ls -lah testdir/ ``` ## Install conda * Conda is a packaging manager that will help us to install bioinformatics tools and to handle their dependencies automatically * In the terminal enter: ````bash # Download conda installer wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # Run conda installer bash Miniconda3-latest-Linux-x86_64.sh # Use space to scroll down the license agreement # then type 'yes' # accept the default install location with ENTER # when asked whether to initialize Miniconda3 type 'yes' # ATTENTION: the space in your home directory might be limited (e.g. 10 GB) and per default conda installs tools into ~/.conda/envs # Thus, take care of your disk space, if necessary (e.g. on a high-performance cluster)! # Now start a new shell or simply reload your current shell via bash # You should now be able to create environments, install tools and run them ```` * Set up conda ````bash # add repository channels for bioconda conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge ```` * Create and activate a new conda environment ````bash # -n parameter to specify the name of your environment conda create -n nanoplot # activate this environment conda activate nanoplot # You should now see (nanoplot) at the start of each line. # You switched from the default 'base' environment to the 'nanoplot' environment. ```` __Hint 1:__ You can create as many environments as you want! It is often convenied to have separate environments for separate tasks, pipelines, or even tools. __Hint 2:__ An often faster and more stable alternative to `conda` is `mamba`. Funningly, `mamba` can be installed via `conda` and then used in the similar way. Just replace `conda` then with `mamba` (like shown in the bioinformatics tool slides, linked below). ## Install and use analysis tools * Here we just install one example tool ([NanoPlot](https://anaconda.org/bioconda/nanoplot)) in the environment we just created ````bash # in activated 'nanoplot' enviroment! conda activate nanoplot conda install nanoplot # test the tool you just installed NanoPlot --help ```` __You can also install specific versions of a tool!__ * important for full reproducibility * e.g. `conda install nanoplot=v1.40.0` * per default, `conda` will try to install the newest tool version based on your configured channels and system architecture * you can also create a new environment and install a tool in one step, for example: `conda create -n minimap2 -c bioconda minimap2` ## Useful additional links * [Comprehensive introduction to remote computing & Linux](https://ngs-docs.github.io/2021-august-remote-computing/index.html) * [Conda in general](https://docs.conda.io/en/latest/) * [Bioconda](https://anaconda.org/bioconda)