Try   HackMD

Linux for Bioinformatics - Crash Course

Small Bash cheat sheet

# change directory
cd /scratch/$USER
# show content of current directory
ls
# show current path you are located
pwd
# make a new directory called 'myfolder'
mkdir myfolder
# show full content of a file
cat some-file.txt
# show first 10 lines of a file
head some-file.txt
# concatenate the content of files into a new file
cat file1.txt file2.txt file3.txt > file123.txt
# ... or via a wild card
cat file?.txt > file123.txt
# compress a file 
gzip sample.fastq # this will generate a compressed file 'sample.fastq.gz'
# uncompress a gz file
gunzip sample.fastq.gz

# make conda environment and activate it
conda create -n nanoplot
conda activate nanoplot
# install a tool that is available on conda into the activated env
conda install -c bioconda nanoplot
# run a program, show the help page to check how to us the tool and parameters
NanoPlot --help

Practices and examples

Here are some example commands and things you can try on your own. Always remember:

  • use auto completion as often as possible, you can always use the tab/tabulator key to get suggestions for a command you are typing and to auto-complete folder/file names and paths - it's much faster and less error prone! (preventing typos!)
  • prevent whitespaces in all folder and file names! You can use whitespaces in general, but it will complicate your work on a Linux system! Use -, _, etc instead, e.g. new-file.txt
  • be always careful when you delete a folder or file! It's not as easy as on windows to get your data back!

Now, open a terminal and try the following commands.

# when opening a new terminal, you always start in your home directory

# the following command shows the current path you are located (remember the tree-like structure of folders on a linux system!)
pwd

# create a new directory
mkdir testdir

# change into that new directory
cd testdir

# check where you are located now
pwd

# generate a new empty file
touch genome.fasta

# list content of the current directory
ls

# list more details, in a human readable format
ls -lah

# write some content into that file
printf ">Sequence\nATCGTACGTACGTAC" > genome.fasta

# check content of the file
cat genome.fasta

# change to your home directory
#     ~ is a short version of /home/$USER
cd ~

# check again the content of the file you created
# now you have to type the full path to find the file! Use auto-complete! Here we use the so-called relative path
cat testdir/genome.fasta

# you can also use the absolute path
cat /home/$USER/testdir/genome.fasta

# Hint: $USER is a so-called variable. To see the content of a variable you can also use echo:
echo $USER

# in $USER your terminal stored the information about the current user running the session. You can also define your own variables, for example you could store the absolute path to your file in a variable for easier re-usage:
GENOME=/home/$USER/testdir/genome.fasta
cat $GENOME

# please notice that we always use a leading $ sign when we want to access the content of a variable! See the difference:
echo GENOME
echo $GENOME

# generate another file
touch genome2.fasta

# copy the file to the test folder
cp genome2.fasta testdir/

# list the content of the test folder
ls -lah testdir/

# remove the original file we just generated in your home dir 
rm genome2.fasta

# is it gone?
ls -lah

# however, remember we copied the file so a copy of the file we just deleted is still in the test folder
ls -lah testdir/

Install conda

  • Conda is a packaging manager that will help us to install bioinformatics tools and to handle their dependencies automatically
  • In the terminal enter:
# Download conda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 

# Run conda installer
bash Miniconda3-latest-Linux-x86_64.sh
# Use space to scroll down the license agreement
# then type 'yes'
# accept the default install location with ENTER
# when asked whether to initialize Miniconda3 type 'yes'
# ATTENTION: the space in your home directory might be limited (e.g. 10 GB) and per default conda installs tools into ~/.conda/envs
# Thus, take care of your disk space, if necessary (e.g. on a high-performance cluster)! 

# Now start a new shell or simply reload your current shell via
bash

# You should now be able to create environments, install tools and run them
  • Set up conda
# add repository channels for bioconda
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
  • Create and activate a new conda environment
# -n parameter to specify the name of your environment
conda create -n nanoplot

# activate this environment
conda activate nanoplot

# You should now see (nanoplot) at the start of each line.
# You switched from the default 'base' environment to the 'nanoplot' environment.

Hint 1: You can create as many environments as you want! It is often convenied to have separate environments for separate tasks, pipelines, or even tools.

Hint 2: An often faster and more stable alternative to conda is mamba. Funningly, mamba can be installed via conda and then used in the similar way. Just replace conda then with mamba (like shown in the bioinformatics tool slides, linked below).

Install and use analysis tools

  • Here we just install one example tool (NanoPlot) in the environment we just created
# in activated 'nanoplot' enviroment!
conda activate nanoplot
conda install nanoplot
# test the tool you just installed
NanoPlot --help

You can also install specific versions of a tool!

  • important for full reproducibility
  • e.g. conda install nanoplot=v1.40.0
  • per default, conda will try to install the newest tool version based on your configured channels and system architecture
  • you can also create a new environment and install a tool in one step, for example: conda create -n minimap2 -c bioconda minimap2