---
tags: 'workshops'
---
2023-06 Nanopore Workshop, MISSION 2023 -- Bioinformatics part
===
###### tags: `bioinformatics` `sequencing` `workshop` `nanopore` `bacteria`
**Martin Hölzer, Hugues Richard**
MF1, _Experimental & Bioinformatics Innovations_
[toc]
# Day #1
## General information & Setup
* this document is publicly available here: https://hackmd.io/@GqOnlbqgSdKAMwgCUU_ljQ/rJQDt4EPh
* Refresh this document throughout to always see the newest version
* Please raise your hand to ask a question at any time
* We continue to work with the laptops you already used in the Illumina part
* In the future, you might also use the HPC (High-performance Cluster) at RKI, which we can also give a try during the course
#### Short Linux and bash re-cap
* Linux/Bash basics + conda setup ([slides](https://docs.google.com/presentation/d/14W8YPnMPd0GUmL6HvzHJTCjrigfbz9z1EUHr2P9-200/edit?usp=sharing))
* [Introduction to the UNIX command line](https://ngs-docs.github.io/2021-august-remote-computing/introduction-to-the-unix-command-line.html)
* small Bash cheat sheet:
````bash=
# change directory
cd /home/$USER
# show content of current directory
ls
# make a new directory called 'myfolder'
mkdir myfolder
# make conda environment and activate it
conda create -n nanoplot
conda activate nanoplot
# run a program
NanoPlot reads.fq.gz ...
````
#### Install conda (should be already done, skip)
* Conda is a packaging manager that will help us to install bioinformatics tools and to handle their dependencies automatically
* In the terminal enter:
````bash=
# Switch to a directory with enough space
cd /scratch/$USER
# make a new folder called 'workshop'
mkdir workshop
# switch to this folder
cd workshop
# Download conda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Run conda installer
bash Miniconda3-latest-Linux-x86_64.sh
# Use space to scroll down the license agreement
# then type 'yes'
# accept the default install location with ENTER
# when asked whether to initialize Miniconda3 type 'yes'
# ATTENTION: the space in your home directory might be limited (e.g. 10 GB) and per default conda installs tools into ~/.conda/envs
# Thus, take care of your disk space!
# Now start a new shell or simply reload your current shell via
bash
# You should now be able to create environments, install tools and run them
````
* Set up conda
````bash=
# add repository channels for bioconda
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
````
* Create and activate a new conda environment
````bash=
# -n parameter to specify the name
conda create -n workshop
# activate this environment
conda activate workshop
# You should now see (workshop) at the start of each line.
# You switched from the default 'base' environment to the 'workshop' environment.
````
__Hint:__ An often faster and more stable alternative to `conda` is `mamba`. Funningly, `mamba` can be installed via `conda` and then used in the similar way. Just replace `conda` then with `mamba` (like shown in the bioinformatics tool slides, linked below).
## Bacterial _de novo_ genome assembly from Nanopore data
__[Slides: Intro](https://docs.google.com/presentation/d/1hb3P6RIPsmyRiJkZoSQHQMf6FgFJrORhB4Pjk0FZTO4/edit?usp=sharing)__
#### Install and use analysis tools
* Bioinformatics tools overview ([slides](https://docs.google.com/presentation/d/1I3Z2ArbBWAAm3NCVLbldcaByXvVAamyWSe2T-8Q-tDo/edit?usp=sharing))
* some command are a bit different to the examples here
* also keep in mind that tools are regulary updated and input parameters can change (use `--help` or `-h` to see the manual for a tool!)
* Install most of them into our environment
````bash=
# in activated 'workshop' enviroment!
conda create -n workshop # if not already created
conda activate workshop
conda install nanoplot filtlong flye bandage minimap2 tablet racon samtools igv
# test
NanoPlot --help
flye --version
````
__Reminder: You can also install specific versions of a tool!__
* important for full reproducibility
* e.g. `conda install flye=2.9.0`
* per default, `conda` will try to install the newest tool version based on your configured channels and system architecture and dependencies to other tools
#### Get some example long-read data
We will get some example data and save the path to the read file in a variable called `SAMPLE`. __This is important__! We will use from now on this variable to refer to the read file when we start analyzing it.
For example, find some public Nanopore read data for _E. coli_ on [ENA](https://www.ebi.ac.uk/ena/browser/view/SRR12012232).
````bash=
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR120/032/SRR12012232/SRR12012232_1.fastq.gz
ls -lah SRR12012232_1.fastq.gz
# You just downloaded 397MB of compressed Nanopore raw reads
SAMPLE=`pwd`/SRR12012232_1.fastq.gz
````
### Quality control (NanoPlot)
```bash=
NanoPlot -t 4 --fastq $SAMPLE --title "Raw reads" \
--color darkslategrey --N50 --plots hex --loglength -f png -o nanoplot/raw
```
[Publication](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty149/4934939) | [Code](https://github.com/wdecoster/NanoPlot)
### Read filtering (Filtlong)
```bash=
# we use quite strict parameters here to reduce the sample size and be faster with the assembly. ATTENTION: for your "real" samples other parameters might be better
filtlong --min_length 5000 --keep_percent 90 \
--target_bases 500000000 $SAMPLE > clean_reads.fq
NanoPlot -t 4 --fastq clean_reads.fq --title "Filtered reads" \
--color darkslategrey --N50 --plots hex --loglength -f png -o nanoplot/clean
```
[Code](https://github.com/rrwick/Filtlong)
### Assembly (Flye)
```bash=
flye --nano-raw clean_reads.fq -o flye_output -t 8 --meta --genome-size 5M
cp flye_output/assembly.fasta assembly-long.fasta
```
[Publication](https://doi.org/10.1038/s41587-019-0072-8) | [Code](https://github.com/fenderglass/Flye)
### Visualization of assembly (Bandage)
```bash=
# open the GUI
Bandage &
# load graph file generated by flye:
-> flye_output/assembly_graph.gfa
# click "draw graph"
```
[Publication](http://bioinformatics.oxfordjournals.org/content/31/20/3350) | [Code](https://rrwick.github.io/Bandage/)
__Tools that have a graphical user interface can cause problems on a cluster machine__.
Alternative:
* go to https://rrwick.github.io/Bandage/
* download the correct version for your Operating system, e.g.
* download Windows version
* or do a `wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Windows_v0_8_1.zip`
* unzip
* start `Bandage.exe`
* load graph file produced by flye and "Draw graph"
## Homework
For the following tasks, you can use Nanopore FASTQ data of _Salmonella_ from the European Nucleotide Archive (ENA, project ID https://www.ebi.ac.uk/ena/browser/view/PRJNA887350). The Nanopore data corresponds to the Illumina samples you already worked on. There are three Nanopore samples, you can work on all of them or pick one! The data is a bit older, from 2019 and was sequenced on a MinION flow cell (FLO-MIN106) . Basecalling was done with the `FAST` basecalling model. **You should also find the FASTQ files on your laptops, in case the internet is too slow for downloading.**
* 8640-Nanopore, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR218/090/SRR21833890/SRR21833890_1.fastq.gz (928 MB file size)
* 9866-12-Nanopore, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR218/071/SRR21833871/SRR21833871_1.fastq.gz (2.6 GB)
* 8640-41-Nanopore, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR218/078/SRR21833878/SRR21833878_1.fastq.gz (1.5 GB)
Now, perform quality control on the long reads. How does your data look like? How long are the reads? Do you see any problems? Also try `fastqc` with the appropriate long-read parameter on your sample. What is the GC content? Does it match the expected GC content of _Salmonella_?
Next, filter the long reads via `filtlong`. Choose appropriate parameters. When you run QC again, how does the data compare to the raw data?
Now we want to assemble the genome. Read the [`flye` paper](https://www.nature.com/articles/s41587-019-0072-8) (**Maybe first start the assembly, then read the paper while it is running**). Install `flye` and run on the filtered reads. Investigate the results via `Bandage`. How good is your assembly?
Finally, annotate genes in your assembly like learned for Illumina data before (e.g. `Prokka`, `Bakta`, `Abricate` ...). How many genes do you find? Can you compare that to Illumina? Is it better? Worse?
**Bonus**: Try a different assembly tool, e.g. other long-read assemblers are given and compared here: https://www.frontiersin.org/articles/10.3389/fmicb.2022.796465/full
# Day #2
## Re-cap content and tasks from yesterday
Explain Hugues what you learned. : )
How good are your assemblies? How many genes did you find? Any problems with the tools?
## Bacterial _de novo_ assembly improvement and variant calling from Nanopore data
* Bioinformatics tools overview ([slides](https://docs.google.com/presentation/d/1I3Z2ArbBWAAm3NCVLbldcaByXvVAamyWSe2T-8Q-tDo/edit?usp=sharing))
* some command are a bit different to the examples here
* also keep in mind that tools are regulary updated and input parameters can change (use `--help` or `-h` to see the manual for a tool!)
### Mapping (minimap2)
```bash=
minimap2 -ax map-ont assembly-long.fasta clean_reads.fq > mapping.sam
```
[Publication](https://doi.org/10.1093/bioinformatics/bty191) | [Code](https://github.com/lh3/minimap2)
Check the [SAM format specification](https://samtools.github.io/hts-specs/SAMv1.pdf).
### Visualization of mapping (IGV)
```bash!
samtools view -bS mapping.sam | samtools sort -@ 8 > mapping.sorted.bam
samtools index mapping.sorted.bam
# start IGV browser
igv &
```
### Visualization of mapping (Tablet)
```bash=
# open the GUI
tablet &
# load mapping file as 'primary assembly'
-> mapping.sam
# load assembly file as 'Reference/consensus file'
-> assembly-long.fasta
```
[Publication](http://dx.doi.org/10.1093/bib/bbs012) | [Code](https://ics.hutton.ac.uk/tablet/)
__An alterantive nice way to visualize such a mapping is given by Geneious! Or IGV (Integrative Genomics Viewer)__
### Assembly polishing (Racon)
```bash=
# run racon
racon -t 4 clean_reads.fq mapping.sam assembly-long.fasta > consensus-long.fasta
# map to new consensus
minimap2 -ax map-ont consensus-long.fasta clean_reads.fq > consensus_mapping.sam
# now look at it in tablet again
```
[Publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411768/) | [Code](https://github.com/isovic/racon)
* a common procedure is 2x racon polishing followed by 1x Medaka (see below)
### Assembly polishing and final consensus (Medaka)
* Make a new environment for `medaka`
* `medaka` might have many dependencies that conflict
* an alternative to `conda` is `mamba`
* `mamba` can be much faster in solving your environment, e.g. here for the tool `medaka`
* thus, let us install `mamba` via `conda` and then install `medaka`
````bash=
conda create -n medaka mamba python=3.8
conda activate medaka
mamba install -c bioconda medaka
````
````bash=
# Run Medaka
# ATTENTION: it is always good to assign an appropriate Medaka model based on
# the performed basecalling! Here, we dont do that for illustration
# (and because the RKI HPC is restricted and can not download the model when first used ...)
medaka_consensus -i clean_reads.fq -d consensus-long.fasta -o medaka -t 4
# Exercise: look at it in tablet
# Hint: first need a mapping to the new consensus
````
[Code](https://github.com/nanoporetech/medaka)
## Homework
* continue with the same Nanopore data from yesterday
* polish your assembly from yesterday. Did the per-base quality improve? Annotate genes again! How many genes (CDS) do you find now?
* Now, call variants for your Nanopore sample in comparison to a reference sequence
* To do so, download a reference genome for _Salmonella_ from [NCBI](https://www.ncbi.nlm.nih.gov/genome/)
* use `Medaka` now for variant calling and not directly for consensus calculation, here are some hints (that you can/must adjust! Check also https://github.com/nanoporetech/medaka):
```bash=
# Call variants with Medaka
#__Important__: Always use the matching `medaka` model based on how the `guppy` basecalling was done! You can check which `medaka` models are available via:
medaka tools list_models | grep -v Default
# first, use the `medaka consensus` command like before
# for the Salmonelle ONT data from 2019 MinION device was used and the FAST model
medaka consensus --model r941_min_sup_g507 --threads 4 --chunk_len 800 --chunk_ovlp 400 <your-assembl> output.consensus.hdf
# actually call the variants
medaka variant <reference-genome> output.consensus.hdf medaka.vcf
```
* inspect the resulting [VCF file](https://www.ebi.ac.uk/training/online/courses/human-genetic-variation-introduction/variant-identification-and-analysis/understanding-vcf-format/), read about the format and its structure
* can you find the called variants that you see in the VCF file also in a genome browser, when you load the mapping file? Do you see any differences/problems?
# Appendix
## Contig taxonomic classification (Sourmash)
* use `Sourmash` to quickly compare your assembled sequences against known ones
* an alternative could be `kraken2` on read level, for example
```bash=
conda activate workshop # we installed sourmash before already
# download sourmash GTDB index v202 w/ taxonomy
# GTDB genomic representatives (47.8k genomes), LCA, kmer 31 --> https://sourmash.readthedocs.io/en/latest/databases.html
# already downloaded and placed here:
DB='/scratch/autumn_school/gtdb-rs202.genomic-reps.k31.lca.json.gz'
# calculate signatures for your genome
sourmash compute --scaled 10000 -k 31 medaka/consensus.fasta -o sourmash.sig
sourmash lca classify --db ${DB} --query sourmash.sig -o taxonomic-classification.txt
# check results quickly
cat taxonomic-classification.txt
```
[Code](https://github.com/sourmash-bio/sourmash) | [Publication](https://joss.theoj.org/papers/10.21105/joss.00027)
## Anti-microbial resistance gene detection (ABRicate)
* annotation of AMR genes, as an example annotation task for the fresh genome assembly
* Make a new environment for __ABRicate__
```bash=
conda deactivate
conda create -n abricate_env abricate
conda activate abricate_env
```
[Code](https://github.com/tseemann/abricate)
```bash=
abricate medaka/consensus.fasta
```
## Clean-up
* we only did a few commands and you can already see that good folder structure and file management are key
* also, delete/zip files that you dont need (intermediate results) and that can be easily recalculated (space is always limited)
* it is also a good idea to clean-up your conda installations once in a while:
* `conda clean --all` will remove unused packages and installation files, you can still run all tools
## Bioinformatics Analysis: _de novo_ genome assembly from Illumina data
This course is mainly focused on long-read Nanopore data but here also some commands are given to analyse short-read Illumina data.
#### Install and use analysis tools
* Install most of them into our environment
````bash=
# create a new environment and install tools simultaneously
conda create -n workshop-short fastqc fastp spades bandage bwa tablet sourmash
conda activate workshop-short
# test
fastqc --help
spades --version
````
#### Get some example short-read data or directly use data generated during the course
We will get some example data and save the path to the __paired-end__ read files in variables called `SAMPLE_R1` and `SAMPLE_R2`. __This is important__! We will use from now on these variables to refer to the read files when we start analyzing the sample.
For example, find some public Illumina read data for _E. coli_ on [ENA](https://www.ebi.ac.uk/ena/browser/view/SRR1928200) project.
````bash=
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR192/000/SRR1928200/SRR1928200_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR192/000/SRR1928200/SRR1928200_2.fastq.gz
ls -lah SRR1928200*.fastq.gz
# You just downloaded 2x ~670MB of compressed Illumina raw reads in paired-end format
SAMPLE_R1=`pwd`/SRR1928200_1.fastq.gz
SAMPLE_R2=`pwd`/SRR1928200_2.fastq.gz
````
__Or directly use the data generated during the course!__
```bash=
# get data generated during the course, for example:
SAMPLE_R1=/scratch/autumn_school/Illumina_Sequenzen/renamed/group01/illumina_IMT44627_R1.fastq.gz
SAMPLE_R2=/scratch/autumn_school/Illumina_Sequenzen/renamed/group01/illumina_IMT44627_R2.fastq.gz
# this is important! We store the path to the read file in a variable called "SAMPLE" that we will use from now on
```
#### Quality control (FastQC)
```bash=
fastqc -t 2 $SAMPLE_R1 $SAMPLE_R2
```
[Publication](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
#### Read quality trimming (fastp)
```bash=
fastp -i $SAMPLE_R1 -I $SAMPLE_R2 -o clean_reads.R1.fastq.gz -O clean_reads.R2.fastq.gz --thread 4 -M 28 --cut_right
fastqc -t 2 clean_reads.R{1,2}.fastq.gz
```
[Publication](https://doi.org/10.1093/bioinformatics/bty560) | [Code](https://github.com/OpenGene/fastp)
#### Assembly (SPAdes)
```bash=
spades --careful --cov-cutoff auto -1 clean_reads.R1.fastq.gz -2 clean_reads.R2.fastq.gz -t 8 -o spades_output
cp spades_output/scaffolds.fasta assembly-short.fasta
```
[Publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342519/) | [Code](https://github.com/ablab/spades)
#### Visualization of assembly (Bandage)
See long-read analyses part. `SPAdes` also produces a graph file you can visualize with `Bandage`: e.g. `assembly_graph_with_scaffolds.gfa`.
#### Mapping (BWA)
* here we will use `bwa` for short-read mapping to the reconstructed consensus
* and directly pipe the output in SAM format into another tool called `samtools` to directly generate a BAM file
* BAM is the binary format of SAM and much smaller and easier to handle by certain downstream software tools
```bash=
# generate an index for the assembly that will be used for the mapping algorithm
bwa index assembly-short.fasta
# map and convert the SAM to BAM output
bwa mem -t 6 assembly-short.fasta clean_reads.R1.fastq.gz clean_reads.R2.fastq.gz | samtools view -Sb -@ 4 -o mapping-short
```
[Publication](http://arxiv.org/abs/1303.3997) | [Code](https://github.com/lh3/bwa)
#### Visualization of mapping (Tablet)
See long-read analyses part.
#### Clean-up
See long-read analyses part.
## Comparison with Illumina data
There are different ways, metrics and tools to compare short- and long-read data. Here, we will focus on the comparison of the resulting assemblies calculated before.
__Here you will also find the pre-calculated short-read assemblies based on the commands above:__
```bash
## Group 01
ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_IMT44627/assembly-short.fasta
ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_IMT46687/assembly-short.fasta
## Group 02
ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_DSM30054/assembly-short.fasta
ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_IMT44628/assembly-short.fasta
## Group 03
ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_49658-1/assembly-short.fasta
ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_IMT44666/assembly-short.fasta
## Group 04
ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_IMT44660/assembly-short.fasta
ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_49658-3/assembly-short.fasta
```
#### Compare assembly stats (QUAST)
```bash=
# create a new env and install quast
conda create -n workshop-compare quast
conda activate workshop-compare
# select assemblies to compare, e.g.
ASSEMBLY_LR=assembly-long.fasta
#ASSEMBLY_LR_MEDAKA=medaka/consensus.fasta
echo $ASSEMBLY_SR # see list above to set this variable, or use something you calculated
quast $ASSEMBLY_SR $ASSEMBLY_LR $ASSEMBLY_LR_MEDAKA
```
[Publication](https://pubmed.ncbi.nlm.nih.gov/23422339/) | [Code](https://github.com/ablab/quast)
#### Compare number of complete ORFs via IDEEL
A neat way to do so is using so-called IDEEL plots as [initially proposed by Mick Watson](http://www.opiniomics.org/a-simple-test-for-uncorrected-insertions-and-deletions-indels-in-bacterial-genomes/). Different people implemented this approach, for example [github.com/phiweger/ideel](https://github.com/phiweger/ideel).
Here, we will combine various tasks to get this approach running.
Lets first install the code from [this repository](https://github.com/phiweger/ideel) and the necessary dependencies via `mamba`. Then, we generate the protein database index, prepare the input genome files and run the workflow.
```bash
git clone https://github.com/phiweger/ideel.git
cd ideel
conda create -n ideel mamba
conda activate ideel
mamba install snakemake prodigal diamond r-ggplot2 r-readr
# get a reference database of protein sequences
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
# generate an index with diamond
diamond makedb --in uniprot_sprot.fasta.gz -d database_uniprot
# edit the config.json file that was copied from the ideel github repository
# specifying e.g. the path to the Diamond database.
# open the config.json with any texteditor
# The output of the workflow will be written to --directory.
mkdir ideel-results
# In there, make a directory called "genomes"
mkdir ideel-results/genomes
# put assemblies in there with .fa file extension
cp /scratch/$USER/workshop/*.fasta ideel-results/genomes/
# The workflow wants the files to have .fa instead of .fasta!
rename 's/\.fasta$/.fa/' ideel-results/genomes/*.fasta
# run the workflow
snakemake --configfile config.json --directory ideel-results --cores 4
```
##### Short-read assembly

##### Long-read assembly (no polishing)

##### Long-read assembly (with long-read polishing)

## Hybrid assembly
When you have both Nanopore and Illumina data available for the same sample it can be worth to assemble the data "together". Commonly used for this task is `Unicycler`
__Exercise__
* make a `conda` environment and install `Unicycler`
* check out the [`Unicycler` code repository](https://github.com/rrwick/Unicycler)
* .. and the `--help` to learn how to use the tool
* provide short and long reads as input
* you can decide to use the raw or length-filtered long reads
* In the section above about Illumina short-read assembly you will find the paths to the Illumina reads
* how does the resulting assembly compare to the short- and long-read-only assemblies?
---
## Additional test data files
#### Artificial data sets
These data sets contain combined data of a groundwater metagenome, a _Chlamydia_ singlestrain, and a _Mycoplasma bovis_ single strain.
* Large: [fastq.gz (390 MB)](https://www.rna.uni-jena.de/supplements/nanolog/reads/meta.fastq.gz)
* Smaller: [fastq.gz (166 MB)](https://www.rna.uni-jena.de/supplements/nanolog/reads/meta10k.fastq.gz)
The above data sets are combined from these files:
- groundwater [fastq.gz (110 MB)](https://www.rna.uni-jena.de/supplements/nanolog/reads/aquadiva.fastq.gz)
- _M. bovis_: [fastq.gz (90 MB)](https://www.rna.uni-jena.de/supplements/nanolog/reads/myco.fastq.gz)
- _Chlamydia_: [fastq.gz (192 MB)](https://www.rna.uni-jena.de/supplements/nanolog/reads/chlamy.fastq.gz)
#### Artificial data sets with pre-selected long reads (>= 20k)
* Large [fastq.gz (318 MB)](https://www.rna.uni-jena.de/supplements/nanolog/new/aq4k_my4k_ch10k.fastq.gz), runs longer [40min on 24 cores], 50 contigs with genome size 10M
* Medium_1 [fastq.gz (116 MB)](https://www.rna.uni-jena.de/supplements/nanolog/new/aq500_my500_ch500.fastq.gz), 33 contigs with genome size 10M
* Medium_2 [fastq.gz (60 MB)](https://www.rna.uni-jena.de/supplements/nanolog/new/aq250_my250_ch250.fastq.gz), 12 contigs with genome size 10M
* Small [fastq.gz (24 MB)](https://www.rna.uni-jena.de/supplements/nanolog/new/aq400_my400_ch400.fastq.gz), fast and 3 contigs with genome size 10M