2023-06 Nanopore Workshop, MISSION 2023 -- Bioinformatics part

--- tags: 'workshops' --- 2023-06 Nanopore Workshop, MISSION 2023 -- Bioinformatics part === ###### tags: `bioinformatics` `sequencing` `workshop` `nanopore` `bacteria` **Martin Hölzer, Hugues Richard** MF1, _Experimental & Bioinformatics Innovations_ [toc] # Day #1 ## General information & Setup * this document is publicly available here: https://hackmd.io/@GqOnlbqgSdKAMwgCUU_ljQ/rJQDt4EPh * Refresh this document throughout to always see the newest version * Please raise your hand to ask a question at any time * We continue to work with the laptops you already used in the Illumina part * In the future, you might also use the HPC (High-performance Cluster) at RKI, which we can also give a try during the course #### Short Linux and bash re-cap * Linux/Bash basics + conda setup ([slides](https://docs.google.com/presentation/d/14W8YPnMPd0GUmL6HvzHJTCjrigfbz9z1EUHr2P9-200/edit?usp=sharing)) * [Introduction to the UNIX command line](https://ngs-docs.github.io/2021-august-remote-computing/introduction-to-the-unix-command-line.html) * small Bash cheat sheet: ````bash= # change directory cd /home/$USER # show content of current directory ls # make a new directory called 'myfolder' mkdir myfolder # make conda environment and activate it conda create -n nanoplot conda activate nanoplot # run a program NanoPlot reads.fq.gz ... ```` #### Install conda (should be already done, skip) * Conda is a packaging manager that will help us to install bioinformatics tools and to handle their dependencies automatically * In the terminal enter: ````bash= # Switch to a directory with enough space cd /scratch/$USER # make a new folder called 'workshop' mkdir workshop # switch to this folder cd workshop # Download conda installer wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # Run conda installer bash Miniconda3-latest-Linux-x86_64.sh # Use space to scroll down the license agreement # then type 'yes' # accept the default install location with ENTER # when asked whether to initialize Miniconda3 type 'yes' # ATTENTION: the space in your home directory might be limited (e.g. 10 GB) and per default conda installs tools into ~/.conda/envs # Thus, take care of your disk space! # Now start a new shell or simply reload your current shell via bash # You should now be able to create environments, install tools and run them ```` * Set up conda ````bash= # add repository channels for bioconda conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge ```` * Create and activate a new conda environment ````bash= # -n parameter to specify the name conda create -n workshop # activate this environment conda activate workshop # You should now see (workshop) at the start of each line. # You switched from the default 'base' environment to the 'workshop' environment. ```` __Hint:__ An often faster and more stable alternative to `conda` is `mamba`. Funningly, `mamba` can be installed via `conda` and then used in the similar way. Just replace `conda` then with `mamba` (like shown in the bioinformatics tool slides, linked below). ## Bacterial _de novo_ genome assembly from Nanopore data __[Slides: Intro](https://docs.google.com/presentation/d/1hb3P6RIPsmyRiJkZoSQHQMf6FgFJrORhB4Pjk0FZTO4/edit?usp=sharing)__ #### Install and use analysis tools * Bioinformatics tools overview ([slides](https://docs.google.com/presentation/d/1I3Z2ArbBWAAm3NCVLbldcaByXvVAamyWSe2T-8Q-tDo/edit?usp=sharing)) * some command are a bit different to the examples here * also keep in mind that tools are regulary updated and input parameters can change (use `--help` or `-h` to see the manual for a tool!) * Install most of them into our environment ````bash= # in activated 'workshop' enviroment! conda create -n workshop # if not already created conda activate workshop conda install nanoplot filtlong flye bandage minimap2 tablet racon samtools igv # test NanoPlot --help flye --version ```` __Reminder: You can also install specific versions of a tool!__ * important for full reproducibility * e.g. `conda install flye=2.9.0` * per default, `conda` will try to install the newest tool version based on your configured channels and system architecture and dependencies to other tools #### Get some example long-read data We will get some example data and save the path to the read file in a variable called `SAMPLE`. __This is important__! We will use from now on this variable to refer to the read file when we start analyzing it. For example, find some public Nanopore read data for _E. coli_ on [ENA](https://www.ebi.ac.uk/ena/browser/view/SRR12012232). ````bash= wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR120/032/SRR12012232/SRR12012232_1.fastq.gz ls -lah SRR12012232_1.fastq.gz # You just downloaded 397MB of compressed Nanopore raw reads SAMPLE=`pwd`/SRR12012232_1.fastq.gz ```` ### Quality control (NanoPlot) ```bash= NanoPlot -t 4 --fastq $SAMPLE --title "Raw reads" \ --color darkslategrey --N50 --plots hex --loglength -f png -o nanoplot/raw ``` [Publication](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty149/4934939) | [Code](https://github.com/wdecoster/NanoPlot) ### Read filtering (Filtlong) ```bash= # we use quite strict parameters here to reduce the sample size and be faster with the assembly. ATTENTION: for your "real" samples other parameters might be better filtlong --min_length 5000 --keep_percent 90 \ --target_bases 500000000 $SAMPLE > clean_reads.fq NanoPlot -t 4 --fastq clean_reads.fq --title "Filtered reads" \ --color darkslategrey --N50 --plots hex --loglength -f png -o nanoplot/clean ``` [Code](https://github.com/rrwick/Filtlong) ### Assembly (Flye) ```bash= flye --nano-raw clean_reads.fq -o flye_output -t 8 --meta --genome-size 5M cp flye_output/assembly.fasta assembly-long.fasta ``` [Publication](https://doi.org/10.1038/s41587-019-0072-8) | [Code](https://github.com/fenderglass/Flye) ### Visualization of assembly (Bandage) ```bash= # open the GUI Bandage & # load graph file generated by flye: -> flye_output/assembly_graph.gfa # click "draw graph" ``` [Publication](http://bioinformatics.oxfordjournals.org/content/31/20/3350) | [Code](https://rrwick.github.io/Bandage/) __Tools that have a graphical user interface can cause problems on a cluster machine__. Alternative: * go to https://rrwick.github.io/Bandage/ * download the correct version for your Operating system, e.g. * download Windows version * or do a `wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Windows_v0_8_1.zip` * unzip * start `Bandage.exe` * load graph file produced by flye and "Draw graph" ## Homework For the following tasks, you can use Nanopore FASTQ data of _Salmonella_ from the European Nucleotide Archive (ENA, project ID https://www.ebi.ac.uk/ena/browser/view/PRJNA887350). The Nanopore data corresponds to the Illumina samples you already worked on. There are three Nanopore samples, you can work on all of them or pick one! The data is a bit older, from 2019 and was sequenced on a MinION flow cell (FLO-MIN106) . Basecalling was done with the `FAST` basecalling model. **You should also find the FASTQ files on your laptops, in case the internet is too slow for downloading.** * 8640-Nanopore, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR218/090/SRR21833890/SRR21833890_1.fastq.gz (928 MB file size) * 9866-12-Nanopore, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR218/071/SRR21833871/SRR21833871_1.fastq.gz (2.6 GB) * 8640-41-Nanopore, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR218/078/SRR21833878/SRR21833878_1.fastq.gz (1.5 GB) Now, perform quality control on the long reads. How does your data look like? How long are the reads? Do you see any problems? Also try `fastqc` with the appropriate long-read parameter on your sample. What is the GC content? Does it match the expected GC content of _Salmonella_? Next, filter the long reads via `filtlong`. Choose appropriate parameters. When you run QC again, how does the data compare to the raw data? Now we want to assemble the genome. Read the [`flye` paper](https://www.nature.com/articles/s41587-019-0072-8) (**Maybe first start the assembly, then read the paper while it is running**). Install `flye` and run on the filtered reads. Investigate the results via `Bandage`. How good is your assembly? Finally, annotate genes in your assembly like learned for Illumina data before (e.g. `Prokka`, `Bakta`, `Abricate` ...). How many genes do you find? Can you compare that to Illumina? Is it better? Worse? **Bonus**: Try a different assembly tool, e.g. other long-read assemblers are given and compared here: https://www.frontiersin.org/articles/10.3389/fmicb.2022.796465/full # Day #2 ## Re-cap content and tasks from yesterday Explain Hugues what you learned. : ) How good are your assemblies? How many genes did you find? Any problems with the tools? ## Bacterial _de novo_ assembly improvement and variant calling from Nanopore data * Bioinformatics tools overview ([slides](https://docs.google.com/presentation/d/1I3Z2ArbBWAAm3NCVLbldcaByXvVAamyWSe2T-8Q-tDo/edit?usp=sharing)) * some command are a bit different to the examples here * also keep in mind that tools are regulary updated and input parameters can change (use `--help` or `-h` to see the manual for a tool!) ### Mapping (minimap2) ```bash= minimap2 -ax map-ont assembly-long.fasta clean_reads.fq > mapping.sam ``` [Publication](https://doi.org/10.1093/bioinformatics/bty191) | [Code](https://github.com/lh3/minimap2) Check the [SAM format specification](https://samtools.github.io/hts-specs/SAMv1.pdf). ### Visualization of mapping (IGV) ```bash! samtools view -bS mapping.sam | samtools sort -@ 8 > mapping.sorted.bam samtools index mapping.sorted.bam # start IGV browser igv & ``` ### Visualization of mapping (Tablet) ```bash= # open the GUI tablet & # load mapping file as 'primary assembly' -> mapping.sam # load assembly file as 'Reference/consensus file' -> assembly-long.fasta ``` [Publication](http://dx.doi.org/10.1093/bib/bbs012) | [Code](https://ics.hutton.ac.uk/tablet/) __An alterantive nice way to visualize such a mapping is given by Geneious! Or IGV (Integrative Genomics Viewer)__ ### Assembly polishing (Racon) ```bash= # run racon racon -t 4 clean_reads.fq mapping.sam assembly-long.fasta > consensus-long.fasta # map to new consensus minimap2 -ax map-ont consensus-long.fasta clean_reads.fq > consensus_mapping.sam # now look at it in tablet again ``` [Publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411768/) | [Code](https://github.com/isovic/racon) * a common procedure is 2x racon polishing followed by 1x Medaka (see below) ### Assembly polishing and final consensus (Medaka) * Make a new environment for `medaka` * `medaka` might have many dependencies that conflict * an alternative to `conda` is `mamba` * `mamba` can be much faster in solving your environment, e.g. here for the tool `medaka` * thus, let us install `mamba` via `conda` and then install `medaka` ````bash= conda create -n medaka mamba python=3.8 conda activate medaka mamba install -c bioconda medaka ```` ````bash= # Run Medaka # ATTENTION: it is always good to assign an appropriate Medaka model based on # the performed basecalling! Here, we dont do that for illustration # (and because the RKI HPC is restricted and can not download the model when first used ...) medaka_consensus -i clean_reads.fq -d consensus-long.fasta -o medaka -t 4 # Exercise: look at it in tablet # Hint: first need a mapping to the new consensus ```` [Code](https://github.com/nanoporetech/medaka) ## Homework * continue with the same Nanopore data from yesterday * polish your assembly from yesterday. Did the per-base quality improve? Annotate genes again! How many genes (CDS) do you find now? * Now, call variants for your Nanopore sample in comparison to a reference sequence * To do so, download a reference genome for _Salmonella_ from [NCBI](https://www.ncbi.nlm.nih.gov/genome/) * use `Medaka` now for variant calling and not directly for consensus calculation, here are some hints (that you can/must adjust! Check also https://github.com/nanoporetech/medaka): ```bash= # Call variants with Medaka #__Important__: Always use the matching `medaka` model based on how the `guppy` basecalling was done! You can check which `medaka` models are available via: medaka tools list_models | grep -v Default # first, use the `medaka consensus` command like before # for the Salmonelle ONT data from 2019 MinION device was used and the FAST model medaka consensus --model r941_min_sup_g507 --threads 4 --chunk_len 800 --chunk_ovlp 400 <your-assembl> output.consensus.hdf # actually call the variants medaka variant <reference-genome> output.consensus.hdf medaka.vcf ``` * inspect the resulting [VCF file](https://www.ebi.ac.uk/training/online/courses/human-genetic-variation-introduction/variant-identification-and-analysis/understanding-vcf-format/), read about the format and its structure * can you find the called variants that you see in the VCF file also in a genome browser, when you load the mapping file? Do you see any differences/problems? # Appendix ## Contig taxonomic classification (Sourmash) * use `Sourmash` to quickly compare your assembled sequences against known ones * an alternative could be `kraken2` on read level, for example ```bash= conda activate workshop # we installed sourmash before already # download sourmash GTDB index v202 w/ taxonomy # GTDB genomic representatives (47.8k genomes), LCA, kmer 31 --> https://sourmash.readthedocs.io/en/latest/databases.html # already downloaded and placed here: DB='/scratch/autumn_school/gtdb-rs202.genomic-reps.k31.lca.json.gz' # calculate signatures for your genome sourmash compute --scaled 10000 -k 31 medaka/consensus.fasta -o sourmash.sig sourmash lca classify --db ${DB} --query sourmash.sig -o taxonomic-classification.txt # check results quickly cat taxonomic-classification.txt ``` [Code](https://github.com/sourmash-bio/sourmash) | [Publication](https://joss.theoj.org/papers/10.21105/joss.00027) ## Anti-microbial resistance gene detection (ABRicate) * annotation of AMR genes, as an example annotation task for the fresh genome assembly * Make a new environment for __ABRicate__ ```bash= conda deactivate conda create -n abricate_env abricate conda activate abricate_env ``` [Code](https://github.com/tseemann/abricate) ```bash= abricate medaka/consensus.fasta ``` ## Clean-up * we only did a few commands and you can already see that good folder structure and file management are key * also, delete/zip files that you dont need (intermediate results) and that can be easily recalculated (space is always limited) * it is also a good idea to clean-up your conda installations once in a while: * `conda clean --all` will remove unused packages and installation files, you can still run all tools ## Bioinformatics Analysis: _de novo_ genome assembly from Illumina data This course is mainly focused on long-read Nanopore data but here also some commands are given to analyse short-read Illumina data. #### Install and use analysis tools * Install most of them into our environment ````bash= # create a new environment and install tools simultaneously conda create -n workshop-short fastqc fastp spades bandage bwa tablet sourmash conda activate workshop-short # test fastqc --help spades --version ```` #### Get some example short-read data or directly use data generated during the course We will get some example data and save the path to the __paired-end__ read files in variables called `SAMPLE_R1` and `SAMPLE_R2`. __This is important__! We will use from now on these variables to refer to the read files when we start analyzing the sample. For example, find some public Illumina read data for _E. coli_ on [ENA](https://www.ebi.ac.uk/ena/browser/view/SRR1928200) project. ````bash= wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR192/000/SRR1928200/SRR1928200_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR192/000/SRR1928200/SRR1928200_2.fastq.gz ls -lah SRR1928200*.fastq.gz # You just downloaded 2x ~670MB of compressed Illumina raw reads in paired-end format SAMPLE_R1=`pwd`/SRR1928200_1.fastq.gz SAMPLE_R2=`pwd`/SRR1928200_2.fastq.gz ```` __Or directly use the data generated during the course!__ ```bash= # get data generated during the course, for example: SAMPLE_R1=/scratch/autumn_school/Illumina_Sequenzen/renamed/group01/illumina_IMT44627_R1.fastq.gz SAMPLE_R2=/scratch/autumn_school/Illumina_Sequenzen/renamed/group01/illumina_IMT44627_R2.fastq.gz # this is important! We store the path to the read file in a variable called "SAMPLE" that we will use from now on ``` #### Quality control (FastQC) ```bash= fastqc -t 2 $SAMPLE_R1 $SAMPLE_R2 ``` [Publication](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) #### Read quality trimming (fastp) ```bash= fastp -i $SAMPLE_R1 -I $SAMPLE_R2 -o clean_reads.R1.fastq.gz -O clean_reads.R2.fastq.gz --thread 4 -M 28 --cut_right fastqc -t 2 clean_reads.R{1,2}.fastq.gz ``` [Publication](https://doi.org/10.1093/bioinformatics/bty560) | [Code](https://github.com/OpenGene/fastp) #### Assembly (SPAdes) ```bash= spades --careful --cov-cutoff auto -1 clean_reads.R1.fastq.gz -2 clean_reads.R2.fastq.gz -t 8 -o spades_output cp spades_output/scaffolds.fasta assembly-short.fasta ``` [Publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342519/) | [Code](https://github.com/ablab/spades) #### Visualization of assembly (Bandage) See long-read analyses part. `SPAdes` also produces a graph file you can visualize with `Bandage`: e.g. `assembly_graph_with_scaffolds.gfa`. #### Mapping (BWA) * here we will use `bwa` for short-read mapping to the reconstructed consensus * and directly pipe the output in SAM format into another tool called `samtools` to directly generate a BAM file * BAM is the binary format of SAM and much smaller and easier to handle by certain downstream software tools ```bash= # generate an index for the assembly that will be used for the mapping algorithm bwa index assembly-short.fasta # map and convert the SAM to BAM output bwa mem -t 6 assembly-short.fasta clean_reads.R1.fastq.gz clean_reads.R2.fastq.gz | samtools view -Sb -@ 4 -o mapping-short ``` [Publication](http://arxiv.org/abs/1303.3997) | [Code](https://github.com/lh3/bwa) #### Visualization of mapping (Tablet) See long-read analyses part. #### Clean-up See long-read analyses part. ## Comparison with Illumina data There are different ways, metrics and tools to compare short- and long-read data. Here, we will focus on the comparison of the resulting assemblies calculated before. __Here you will also find the pre-calculated short-read assemblies based on the commands above:__ ```bash ## Group 01 ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_IMT44627/assembly-short.fasta ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_IMT46687/assembly-short.fasta ## Group 02 ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_DSM30054/assembly-short.fasta ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_IMT44628/assembly-short.fasta ## Group 03 ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_49658-1/assembly-short.fasta ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_IMT44666/assembly-short.fasta ## Group 04 ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_IMT44660/assembly-short.fasta ASSEMBLY_SR=/scratch/autumn_school/Illumina_assemblies/illumina_49658-3/assembly-short.fasta ``` #### Compare assembly stats (QUAST) ```bash= # create a new env and install quast conda create -n workshop-compare quast conda activate workshop-compare # select assemblies to compare, e.g. ASSEMBLY_LR=assembly-long.fasta #ASSEMBLY_LR_MEDAKA=medaka/consensus.fasta echo $ASSEMBLY_SR # see list above to set this variable, or use something you calculated quast $ASSEMBLY_SR $ASSEMBLY_LR $ASSEMBLY_LR_MEDAKA ``` [Publication](https://pubmed.ncbi.nlm.nih.gov/23422339/) | [Code](https://github.com/ablab/quast) #### Compare number of complete ORFs via IDEEL A neat way to do so is using so-called IDEEL plots as [initially proposed by Mick Watson](http://www.opiniomics.org/a-simple-test-for-uncorrected-insertions-and-deletions-indels-in-bacterial-genomes/). Different people implemented this approach, for example [github.com/phiweger/ideel](https://github.com/phiweger/ideel). Here, we will combine various tasks to get this approach running. Lets first install the code from [this repository](https://github.com/phiweger/ideel) and the necessary dependencies via `mamba`. Then, we generate the protein database index, prepare the input genome files and run the workflow. ```bash git clone https://github.com/phiweger/ideel.git cd ideel conda create -n ideel mamba conda activate ideel mamba install snakemake prodigal diamond r-ggplot2 r-readr # get a reference database of protein sequences wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz # generate an index with diamond diamond makedb --in uniprot_sprot.fasta.gz -d database_uniprot # edit the config.json file that was copied from the ideel github repository # specifying e.g. the path to the Diamond database. # open the config.json with any texteditor # The output of the workflow will be written to --directory. mkdir ideel-results # In there, make a directory called "genomes" mkdir ideel-results/genomes # put assemblies in there with .fa file extension cp /scratch/$USER/workshop/*.fasta ideel-results/genomes/ # The workflow wants the files to have .fa instead of .fasta! rename 's/\.fasta$/.fa/' ideel-results/genomes/*.fasta # run the workflow snakemake --configfile config.json --directory ideel-results --cores 4 ``` ##### Short-read assembly ![](https://i.imgur.com/TtcnW1I.png) ##### Long-read assembly (no polishing) ![](https://i.imgur.com/k3igrbO.png) ##### Long-read assembly (with long-read polishing) ![](https://i.imgur.com/NUJUo0j.png) ## Hybrid assembly When you have both Nanopore and Illumina data available for the same sample it can be worth to assemble the data "together". Commonly used for this task is `Unicycler` __Exercise__ * make a `conda` environment and install `Unicycler` * check out the [`Unicycler` code repository](https://github.com/rrwick/Unicycler) * .. and the `--help` to learn how to use the tool * provide short and long reads as input * you can decide to use the raw or length-filtered long reads * In the section above about Illumina short-read assembly you will find the paths to the Illumina reads * how does the resulting assembly compare to the short- and long-read-only assemblies? --- ## Additional test data files #### Artificial data sets These data sets contain combined data of a groundwater metagenome, a _Chlamydia_ singlestrain, and a _Mycoplasma bovis_ single strain. * Large: [fastq.gz (390 MB)](https://www.rna.uni-jena.de/supplements/nanolog/reads/meta.fastq.gz) * Smaller: [fastq.gz (166 MB)](https://www.rna.uni-jena.de/supplements/nanolog/reads/meta10k.fastq.gz) The above data sets are combined from these files: - groundwater [fastq.gz (110 MB)](https://www.rna.uni-jena.de/supplements/nanolog/reads/aquadiva.fastq.gz) - _M. bovis_: [fastq.gz (90 MB)](https://www.rna.uni-jena.de/supplements/nanolog/reads/myco.fastq.gz) - _Chlamydia_: [fastq.gz (192 MB)](https://www.rna.uni-jena.de/supplements/nanolog/reads/chlamy.fastq.gz) #### Artificial data sets with pre-selected long reads (>= 20k) * Large [fastq.gz (318 MB)](https://www.rna.uni-jena.de/supplements/nanolog/new/aq4k_my4k_ch10k.fastq.gz), runs longer [40min on 24 cores], 50 contigs with genome size 10M * Medium_1 [fastq.gz (116 MB)](https://www.rna.uni-jena.de/supplements/nanolog/new/aq500_my500_ch500.fastq.gz), 33 contigs with genome size 10M * Medium_2 [fastq.gz (60 MB)](https://www.rna.uni-jena.de/supplements/nanolog/new/aq250_my250_ch250.fastq.gz), 12 contigs with genome size 10M * Small [fastq.gz (24 MB)](https://www.rna.uni-jena.de/supplements/nanolog/new/aq400_my400_ch400.fastq.gz), fast and 3 contigs with genome size 10M