Taxonomic Classification using wf-metagenomics (EPI2ME)

# Taxonomic Classification using EPI2ME wf-metagenomics pipeline # Link https://bit.ly/NGS_TaxClass ## Setup Seting up an environment for computation is a huge process in and of itself. **We have done this for you to avoid the need to troubleshoot software installation during the course.** <img src="https://hackmd.io/_uploads/SJ2hs6jqJg.png" width="300"> However, these are the steps used to setup a work environment to run the EPI2ME wf-metagenomics workflow if you want to repeat it on another computer. 1) #### Install Nextflow Instructions to install can be found here: https://www.nextflow.io/docs/latest/install.html This consist of installing an up to date version of java as well as the nextflow executable. 2) #### Install Docker or Singularity Docker and singularity are both containerization technologies that allow for packaging software and dependancies as portable units (containers). Singularity is specifically tailored for containerization in the High Performance Computing (HPC) environment since Docker requires user privledges that cannot be obtained by on those systems. <div style="display: flex; align-items: center;"> <img src=https://miro.medium.com/v2/resize:fit:1358/0*O8xKdkdiOOPfYllp.png width="80" /> <p> Docker installation instructions: <a href="https://docs.docker.com/engine/install/" target="_blank"> https://docs.docker.com/engine/install/ </a> </p> </div> <div style="display: flex; align-items: center;"> <img src=https://docs.sylabs.io/guides/3.0/user-guide/_static/logo.png width="80" /> <p> Singularity installation instructions: <a href="https://docs.sylabs.io/guides/3.0/user-guide/installation.html" target="_blank"> https://docs.sylabs.io/guides/3.0/user-guide/installation.html </a> </p> </div> For Docker you may need to add your user to the docker group using the following command: ``` sudo usermod -aG docker {username} && newgrp docker ``` Also, for MacOS see the Docker related instructions here: https://labs.epi2me.io/epi2me-docs/installation/ as there are some parameters that need altereed related to CPU and memory usage. 3) #### Download the pipeline using the following command: ``` nextflow run tslaird/wf-metagenomics --help ``` **Note**: You will be downloading a slightly modified version of the EPI2ME wf-metagenomics workflow. We have modified certain steps in the workflow to output additional files that we can further analyze. Alternatively, EPI2ME offers a desktop application (https://labs.epi2me.io/downloads/) which can download and run EPI2ME workflows through a graphical user interface (GUI). 4) #### Download and unpack databases While the workflow itself can download certain databases intrinsically, we pre-downloaded two of them (from here: https://benlangmead.github.io/aws-indexes/k2) and in a folder called ```databases``` using the following commands: * ###### 8GB Standard Database ``` wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20241228.tar.gz mkdir k2_standard_8gb && tar -xvf k2_standard_08gb_20241228.tar.gz -C k2_standard_8gb ``` * ###### 16GB Standard Database ``` wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_16gb_20241228.tar.gz mkdir k2_standard_16gb && tar -xvf k2_standard_16gb_20241228.tar.gz -C k2_standard_16gb ``` * ###### 8GB plusPFP ``` wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_08gb_20241228.tar.gz mkdir k2_plusPFP_08gb && tar -xvf k2_pluspfp_08gb_20241228.tar.gz -C k2_plusPFP_08gb ``` * ###### Viral ``` wget https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20241228.tar.gz mkdir k2_viral && tar -xvf k2_viral_20241228.tar.gz -C k2_viral ``` # Background ## Why use Nextflow for Bioinformatics? Many times, bash scripts can be used to string together one bioinformatic process to the next in an automated way. However, in recent years, dedicated workflow management tools have been developed to better handle the complexities of workflow installation, sharing, and deployment across various compute environments. One such workflow management tool is Nextflow. <div style="display: flex; align-items: center;"> <img src=https://repository-images.githubusercontent.com/9052236/ecd9481e-f4b3-4324-b832-a08ee1d99564 width="150" /> in a nutshell </div> * Nextflow simplifies the orchestration and execution of complex, reproducible workflows in bioinformatics and data science. * It enables seamless scaling across diverse computing environments, from local machines to cloud and high-performance computing clusters, ensuring portability and scalability. * It has a robust support network, which includes the nf-core community (https://nf-co.re/) aimed at creating standardized bioinformatics pipelines. #### Main features of Nextflow: ![image](https://hackmd.io/_uploads/SkdmA9DKJe.png) <small>image source: (https://hpcbio.github.io/workflows-nextflow/01-getting-started-with-nextflow.html)</small> Here is an example of a different taxonomic profiling workflow than what we will be using (created by nf-core): ![image](https://hackmd.io/_uploads/ByWx1io9kl.png) #### Anatomy of a basic Nextflow script * **Parameter specification** * **Processes** * **Workflow** ``` #!/usr/bin/env nextflow params.in = "$baseDir/data/sample.fa" // Split a fasta file into multiple files process splitSequences { input: path 'input.fa' output: path 'seq_*' """ awk '/^>/{f="seq_"++d} {print > f}' < input.fa """ } //Reverse the sequences process reverse { input: path x output: stdout """ cat $x | rev """ } // Define the workflow workflow { splitSequences(params.in) \ | reverse \ | view } ``` #### Other workflow platforms While EPI2ME utilizes Nextflow to depoly their pipelines, There are many other workflow languages that you may find more suitable if you ever plan to develop your own. These include: * Snakemake (https://snakemake.readthedocs.io/en/stable/) * Galaxy (https://galaxyproject.org/) * CWL (https://www.commonwl.org/) * Apache Airflow (https://airflow.apache.org/) * and many more (see: https://en.wikipedia.org/wiki/Scientific_workflow_system) ## Taxonomic classifiers 101: The goal of taxonomic classifiers is to assign a taxonomic category to a DNA sequence (in this case a read that is output from the minion sequencer). Over the years, many tools have been developed which utilize different techniques aimed at achieving this goal. ### Sequence alignment based methods Some utilize scoring algorithms to find the best alignments between query and database sequences (e.g. BLAST, DIAMOND, mmseqs2) ``` Sequence alignments with genomes in the genus Anaplasma: 100% identity Query Sequenece TATGGTAGGCAGGGTACCAGTTGTAGATGAGGTTT ||||||||||||||||||||||||||||||||||| A. phagocytophilum TATGGTAGGCAGGGTACCAGTTGTAGATGAGGTTT _______________________________________________________ 94% identity Query Sequenece TATGGTAGGCAGGGTACCAGTTGTAGATGAGGTTT ||||||*|||||||||||||||||*|||||||||| A. marginale TATGGTGGGCAGGGTACCAGTTGTGGATGAGGTTT _______________________________________________________ 91% identity Query Sequenece TATGGTAGGCAGGGTACCAGTTGTAGATGAGGTTT ||||||*||*||*|||||||||||||||||||||| A. ovis TATGGTGGGTAGAGTACCAGTTGTAGATGAGGTTT ``` ... ### k-mer based methods Some utilize methods that break sequences up into small words of a specified length (k) known as k-mers that rely on exact matches between query and database sequences. ``` k = 31 TATGGTAGGCAGGGTACCAGTTGTAGATGAGGTTT TATGGTAGGCAGGGTACCAGTTGTAGATGAG ATGGTAGGCAGGGTACCAGTTGTAGATGAGG TGGTAGGCAGGGTACCAGTTGTAGATGAGGT GGTAGGCAGGGTACCAGTTGTAGATGAGGTT GTAGGCAGGGTACCAGTTGTAGATGAGGTTT A. phagocytophilum 5/5 31-mer matches A. marginale 0/5 31-mer matches A. ovis 0/5 31-mer matches ``` The EPI2ME metagenomics workflow can utilize either a sequnce alignment based method (minimap2: https://github.com/lh3/minimap2) or a k-mer based method (Kraken2: https://ccb.jhu.edu/software/kraken2/). For this workshop, we will be utilizing the k-mer based Kraken2 software. This diagram represents the basic steps of the wf-metagenomics workflow: ![wf-metagenomics.drawio](https://hackmd.io/_uploads/rk7jB4L9yl.png) 1) Reads will be taxonomically assigned by Kraken2 and then "corrected" using Bayesian re-estimation in Bracken to obtain a relative abundance for each taxa. This helps reduce the number of misclassified reads. 2) Antibiotic resistance genes will be detected using ABRicate (https://github.com/tseemann/abricate) which uses an alignemnt based approach (BLAST) against specialized databases. ### Database selection Not only are there different tools but different databases that can be paired with each tool. We have pre-downloaded 4 different databases you can try. However, we will use the 8GB Standard database first. These are stored in ```/home/data/databases/``` |Database|Description| |----|----| |8 GB Standard| Refeq archaea, bacteria, viral, plasmid, human1, UniVec_Core capped at 8GB (originally 84.1 GB) | |16 GB Standard| Refseq archaea, bacteria, viral, plasmid, human, UniVec_Core capped at 16GB | |8 GB PlusPFP| Standard plus Refseq protozoa, fungi & plant capped at 8GB (originally 195.2 GB) | |Viral| Refeq viral | Here is a nice article that deals with common problems and potential solutions when using various reference databases (https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2024.1278228/full) # Running the EPI2ME wf-metagenomics workflow <img src="https://hackmd.io/_uploads/BJY2p6j51l.png" width="200"> Now we can get to running the commands! In order to best support your learning we have post-it notes that you can use to indicate when you are done with a task or if you need help troubleshooting a task. Blue Heart = You've completed the current command/step. Yellow Square = You would like assistance troubleshooting a particular command or computer issue. ### Access the Rstudio AWS instance 1) Go to one of the following urls (depending on which one you are assigned) and enter the username and password you are given to log into the Rstudio server: * http://10.209.238.178:8787/ * http://10.209.238.48:8787/ * http://10.209.238.99:8787/ * http://10.209.238.19:8787/ **Note** :Two students will be working on each one of those instances. **For the morning session you will be sharing a computer, we ran into a problem** 2) Click on the Terminal tab in order to interact with the instance via the command line 3) Within the terminal, change to the /home/data directory ``` cd /home/data ``` 4) Check your input files. **Note:** We already transferred the data obtained from the minion (on the laptops) into the ```NGS_class_data``` folder on the AWS instance. Check to see what is in that directory: ``` ls NGS_class_data ``` You should hopefully see something like this: ``` deinococcus_radiodurans omnivore staphylococcus_pasteuri vegetarian ``` Each folder should be filled with fastq files. You can check one of them using the following commands: ``` ls NGS_class_data/omnivore ``` You should see a bunch of fastq files ``` FBA48668_pass_barcode02_194ec089_915c9f14_286.fastq.gz FBA48668_pass_barcode02_194ec089_915c9f14_92.fastq.gz FBA48668_pass_barcode02_194ec089_915c9f14_287.fastq.gz FBA48668_pass_barcode02_194ec089_915c9f14_93.fastq.gz FBA48668_pass_barcode02_194ec089_915c9f14_288.fastq.gz FBA48668_pass_barcode02_194ec089_915c9f14_94.fastq.gz FBA48668_pass_barcode02_194ec089_915c9f14_289.fastq.gz FBA48668_pass_barcode02_194ec089_915c9f14_95.fastq.gz FBA48668_pass_barcode02_194ec089_915c9f14_29.fastq.gz ... ``` 5) Now that you have checked your input data, you can get ready to run the workflow. Make a new directory where you will run the workflow. **Since 2 people are on each instance include your username in the directory name** ``` mkdir TaxClass_student ``` or ``` mkdir TaxClass_student2 ``` Then change into that directory: ``` cd TaxClass_student ``` or ``` cd TaxClass_student2 ``` 6) Run the workflow using the following command: ``` nextflow run tslaird/wf-metagenomics --fastq /home/data/NGS_class_data/ \ --out_dir output_k2_standard_08gb/ \ --kraken2_memory_mapping \ --database /home/data/databases/k2_standard_08gb \ --amr --amr_db resfinder \ --include_read_assignments ``` **Note:** There are many additional flags that can be used when running the pipeline. Those can be found by running the following command at a later time: ``` nextflow run tslaird/wf-metagenomics --help ``` or by viewing the documentation at https://github.com/epi2me-labs/wf-metagenomics ## Watching the pipeline run When running on the command line, Nextflow will log the various processes along with how many of each tasks are complete in the terminal. For the wf-metagenomics workflow the terminal should looks something like below. Notably, The hexadecimal numbers in the log, like ```28/1158db```, identify unique process execution (called tasks). These numbers are also the prefix of the directories where each task is executed in the ```work/``` directory. When troubleshooting the construction/execution of a workflow, it is sometimes helpful to navigate to and examine the files in those directories. ``` N E X T F L O W ~ version 24.10.4 Launching `https://github.com/epi2me-labs/wf-metagenomics` [pedantic_cori] DSL2 - revision: 30e979bf75 [master] WARN: NEXTFLOW RECURSION IS A PREVIEW FEATURE - SYNTAX AND FUNCTIONALITY CAN CHANGE IN FUTURE RELEASES |||||||||| _____ ____ ___ ____ __ __ _____ _ _ |||||||||| | ____| _ \_ _|___ \| \/ | ____| | | __ _| |__ ___ ||||| | _| | |_) | | __) | |\/| | _| _____| |/ _` | '_ \/ __| ||||| | |___| __/| | / __/| | | | |__|_____| | (_| | |_) \__ \ |||||||||| |_____|_| |___|_____|_| |_|_____| |_|\__,_|_.__/|___/ |||||||||| wf-metagenomics v2.12.1-g30e979b -------------------------------------------------------------------------------- Core Nextflow options revision : master runName : pedantic_cori containerEngine : docker container : [withLabel:wfmetagenomics:ontresearch/wf-metagenomics:sha1adcb94088d2397b0d61c0393035f51b1013226a, withLabel:wf_common:ontresearch/wf-common:shaabceef445fb63214073cbf5836fdd33c04be4ac7, withLabel:amr:ontresearch/abricate:sha2ae47c3f988b71f7cf81c37c000f6cb046e18357] launchDir : /home/tlaird/epi2me_test workDir : /home/tlaird/epi2me_test/work projectDir : /home/tlaird/.nextflow/assets/epi2me-labs/wf-metagenomics userName : tlaird profile : standard configFiles : /home/tlaird/.nextflow/assets/epi2me-labs/wf-metagenomics/nextflow.config Input Options fastq : input_fastq/ Reference Options database : databases/k2_pluspfp_08gb_20241228 Kraken2 Options kraken2_memory_mapping: true Antimicrobial Resistance Options amr : true Output Options out_dir : output_k2_pluspfp/ !! Only displaying parameters that differ from the pipeline defaults !! -------------------------------------------------------------------------------- If you use epi2me-labs/wf-metagenomics for your analysis please cite: * The nf-core framework https://doi.org/10.1038/s41587-020-0439-x -------------------------------------------------------------------------------- This is epi2me-labs/wf-metagenomics v2.12.1-g30e979b. -------------------------------------------------------------------------------- Checking inputs. Note: Reference/Database are custom. Note: Memory available to the workflow must be slightly higher than size of the database custom index. Note: Or consider to use the --kraken2_memory_mapping. Note: Memory available to the workflow must be slightly higher than size of the database Standard-8 index (8GB) or consider to use --kraken2_memory_mapping Searching input for [.fastq, .fastq.gz, .fq, .fq.gz] files. [- ] fastcat - [- ] fastcat - executor > local (1) executor > local (1) executor > local (2) executor > local (2) executor > local (2) [e7/01e9f8] fastcat (1) | 2 of 2, cached: 2 ✔ [skipped ] prepare_databases:download_unpack_taxonomy | 1 of 1, stored: 1 ✔ [skipped ] prepare_databases:determine_bracken_length | 1 of 1, stored: 1 ✔ [c4/505d29] kraken_pipeline:run_common:getVersions | 1 of 1, cached: 1 ✔ [11/196b50] kraken_pipeline:run_common:abricateVersion | 1 of 1, cached: 1 ✔ [3b/69e0b4] kraken_pipeline:run_common:getParams | 1 of 1 ✔ [3c/932b59] kraken_pipeline:run_kraken2 (omnivore) | 2 of 2, cached: 2 ✔ [c3/ad56e5] kraken_pipeline:run_bracken (omnivore) | 2 of 2, cached: 2 ✔ [8b/5f6d26] kraken_pipeline:createAbundanceTables | 1 of 1, cached: 1 ✔ [6c/1e25b5] kraken_pipeline:run_amr:abricate (omnivore) | 2 of 2, cached: 2 ✔ [a7/0ceb1d] kraken_pipeline:run_amr:abricate_json (2) | 2 of 2, cached: 2 ✔ [28/1158db] kraken_pipeline:makeReport | 1 of 1 ✔ Note: Empty files or those files whose reads have been discarded after filtering based on read length and/or read quality will not appear in the report and will be excluded from subsequent analysis. Kraken2 pipeline. Preparing databases. Using default taxonomy database. Checking custom kraken2 database exists Using the bracken dist file within your custom database directory. ``` ## Discussion points The wf-metagenomics output will individually assign each read but also provide aggregated metrics and statistics including: * Read quality and length metrics * Taxonomic breakdowns at various levels (e.g. Domain, Phylum, Class ... Species) and displayed using various tables and plots * Diversity indices * Antimicrobial resistance genes that were detected **Do you have any hypotheses regarding the metrics listed above when comparing omnivore and vegetarian samples?** **Do you have any predictions as to how the bacterial isolate (D. radiodurans and S. pasteuri) data will be classified?** #### Pipeline outputs After successfully running the workflow, see what folders are now in the TaxClass directory: ``` ls ``` You should see: ``` output_k2_standard_08gb store_dir work ``` ```nextflow``` is the nextflow executable ```output_k2_standard_08gb``` is where the workflow output files are stored ```store_dir``` is used to store downloaded taxonomy information from ncbi ```work``` is where all the pipeline steps were executed. Upon finishing each step they were copied over to ```output_k2_standard_08gb``` ## Analyzing the output Thankfully the wf-metagenomics pipeline authomatically generates a report file summarizing various metrics and outputs from the kraken2 analysis. This is the ```wf-metagenomics-report.html``` within the output folder (```output_k2_standard_08gb```). Lets examine that! You can open up the file by navigating with the Rstudio file navigator, hovering on the file and clicking "View in Web Browser" ![image](https://hackmd.io/_uploads/BkABxNSq1l.png) ### A quick run down of the report Take some time to go through the report. Come up with at least one question or suprising finding based on the output. Here are some more discussion points: * Do the results support or refute any of the hypotheses that you suggested about gut metagenome samples? What about the isolate samples? * How are the quality of the reads? Is the quality ideal for organism identification? Is it ideal for genome assembly? What do you think can be done to improve quality? * Is there any evidence of contamination in the isolate samples? If so how can you remedy that? ## Running with a different database As noted before, database selection can have an impact on the resulting taxonomic classification. We will run the same tool, kraken2, using a database that is double the size (16GB)(also found here: https://benlangmead.github.io/aws-indexes/k2). **Note:** the command below is very similar to the one you previously ran. The only differences being the selected databases, the specified output folder, and the ```-resume``` flag. Since the workflow already ran once, some of the data can be reused. ``` nextflow run tslaird/wf-metagenomics --fastq /home/data/NGS_class_data/ \ --out_dir output_k2_standard_16gb/ \ --kraken2_memory_mapping \ --database /home/data/databases/k2_standard_16gb_20241228 \ --amr --amr_db resfinder \ --include_read_assignments -resume ``` Examine the output, **how does it compare to the results obtained with the 8 GB database?** ## A look at unclassified reads As you probably noticed, a large portion of the reads were unclassified. Can we get a better sense of what those reads might be? Lets try! We modified the original wf-metagenomics pipeline to include return reads unclassified by kraken2 in the files ```output_k2_standard_16gb/kraken2/{sample_name}.kraken2.unclassified``` where ```{sample_name}``` stands for each one of the samples (omnivore, vegetarian, D. radiodurans, and S. pasteuri) These are fastq files which have a particular structure to them (https://en.wikipedia.org/wiki/FASTQ_format) including: 1) Sequence name prefixed by an ```@``` symbol 2) The actual DNA sequence (in As,Ts, Gs and C's) 3) A ```+``` character 4) A phred quality score represented by ASCII characters (this website has a neat phred score to ASCII converter: https://jamiemcgowan.ie/bioinf/phred.html) Lets use a command to extract the top 5 reads from that file ``` sed -n '1~4s/^@/>/p;2~4p' output_k2_standard_16gb/kraken2/omnivore.kraken2.unclassified | head -n 10 > top5_unclassified_reads.fasta ``` **Note:** above we used some built in linux file manipulation commands to extract these sequences. However, there is dedicated bioinformatic software for doing such tasks of converting fastq files to fasta files (see: https://github.com/lh3/seqtk or https://bioinf.shenwei.me/seqkit/usage/) Now take the DNA sequence of that read and BLAST it against the ncbi database through the online portal: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome You can upload the file directly or copy and paste the text from the terminal after first using the ```cat``` command to display the text: ``` cat top5_unclassified_reads.fasta ``` When the search completes, what are the results? Are they similar to the classification of other reads by the EPI2ME pipeline? ## ? ? ? Identification of mystery samples ? ? ? As an additional exercise we have simulated some nanopore reads using clinically relevant genomes using pbsim2 (https://github.com/yukiteruono/pbsim2). These can be found in the ```/home/data/SimData/``` directory. To run this analysis: Make a new directory ```SimTaxClass_student``` or ```SimTaxClass_student2``` ``` mkdir /home/data/SimTaxClass_student ``` or ``` mkdir /home/data/SimTaxClass_student2 ``` and change to that directory ``` cd /home/data/SimTaxClass_student ``` or ``` cd /home/data/SimTaxClass_student2 ``` Now, using the analyses above as a guideline, write a nextflow command to classify the simulated reads. Below is a template that you can edit. The curly brackets are where you will change out text but the brackets themselves shouldn't be in the finalized command. ``` nextflow run tslaird/wf-metagenomics --fastq {path/to/simulated/data} \ --out_dir {custom_name_for_output directory} \ --kraken2_memory_mapping \ --database /home/data/databases/{kraken_database name} \ --amr --amr_db {ABRicate_database_name} \ --include_read_assignments -resume ``` If you want to run the analysis more than once, you can change the output directory, along with whatever paramteres you would like to alter. ##### Some questions to think of: What kraken database will you choose? Is it worth changing the ABRIcate database from the default one we have been using (resfinder)? What about running the analyses on more than one database? Post-analysis questions: * What do you think each sample is composed of? * What is the clinical relevance of each sample? * Were you able to identify all samples? * How might taxonomic naming interfere with clinical treatments? * Can this workflow differentiate a pathogen from a commensal strain? ### Downloading markdown files for genome assembly exercises (after lunch) 1) ``` cd /home/data ``` 2) ``` git clone github.com/tslaird/NIST_Microbial_NGS_Course ``` This should create a folder ```NIST_Microbial_NGS_Course``` which has several markdown files. ### Some external links/articles that may be helpful or interesting: Available nanopore sequence prep kits (https://store.nanoporetech.com/us/sample-prep.html) If You're Not Confused, You're Not Paying Attention: Ochrobactrum Is Not Brucella (https://journals.asm.org/doi/full/10.1128/jcm.00438-23) Expansion of the Brucella Genus to include Ochrobactrum: Clinical Considerations https://www.cdc.gov/brucellosis/media/pdfs/CDC-Reclassification-of-Brucella-Genus2022.pdf Changes in fungal taxonomy: mycological rationale and clinical implications (https://journals.asm.org/doi/10.1128/cmr.00099-22) Seven-year performance of a clinical metagenomic next-generation sequencing test for diagnosis of central nervous system infections (https://www.nature.com/articles/s41591-024-03275-1#Sec2) ## Answer key for mystery samples SampleA: Pichia kudriavzevii SampleB: Zaire ebolavirus SampleC: Escherichia coli O157:H7 SampleD: Escherichia coli K-12 SampleE: Brucella anthropi SampleF: Balamuthia mandrillaris