Metatranscriptome analysis and e-probe design (Wheat rhizosphere)

--- title: 'Metatranscriptome analysis and e-probe design (Wheat rhizosphere)' disqus: hackmd tags: edna, wheat, metatranscriptomes --- Metatranscriptome analysis and e-probe design (wheat rhizosphere) === [repository](https://github.com/andrese52/wheat-microbiome-beta) ![downloads](https://img.shields.io/github/downloads/atom/atom/total.svg) ![build](https://img.shields.io/appveyor/ci/:user/:repo.svg) ![chat](https://img.shields.io/discord/:serverId.svg) ## Table of Contents [TOC] Data flowchart --- ```sequence Raw fastq Data (reads) --> Reads quality report: FastQC Raw fastq Data (reads) --> Trimmed reads: Trimmomatic Trimmed reads --> Normalized reads:interleave, normalize and split Normalized reads --> Dehosted reads:bowtie2,samtools,bedtools Dehosted reads --> Assembled metatranscriptome: Trinity, metavelvet, metaSpades ``` > Read more about sequence-diagrams here: http://bramp.github.io/js-sequence-diagrams/ > Read more about mermaid here: http://knsv.github.io/mermaid/ ## Data Quality Check First let's copy all our data from a different directory to our working directory which in this case is `raw-data` ```bash=1 workdir=/project02/transgenomics/andrese52/edna-wheat/raw-data for i in *.fq.gz do cp $i $workdir/$i done ``` ### FastQC We can use fastqc for this. We need to be in the folder where all the fastq files are present and then we run fastqc as follows: The code below will run in a serial fashion ```bash=1 fastqc -t 12 *.fq.gz ``` Since we don't want to run the same code many times, remember this quote: > I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. [name=Bill Gates] We can submit many `fastqc` jobs to a torque cluster as follows: 1. Create a job and save it as fastq.sh in the same folder where your `fq.gz` are saved. ```bash=1 #!/bin/bash #PBS -q batch #PBS -l nodes=1:ppn=12 #PBS -l walltime=1:00:00 #PBS -j oe #PBS -m abe -M andres.espindola@okstate.edu cd $PBS_O_WORKDIR echo "YOUR fastq analysis IS" $name fastqc -t 12 $name ``` 2. Submit your jobs by targeting the fq.gz extension of your files ```bash=1 for i in *.fq.gz do qsub -v name=$i -N fastq_$i fastq.sh echo $i sleep 1 done ``` `-t 12` refers to the number of processors that you have in the computer. > Read more about FastQC here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ ## Trimming low quality reads ### Trimmomatic This tool is utilized to trim nucleotides having bad quality. More information about the options that can be used with this tool can be found [here](http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf). We must run the trimming separate for each paired end run. 1. Create a `PBS` job that includes information about the runs in a directory named `trimomatic.sh`. Make sure that the phred score is 33 or 64 by checking the `fastqc` output. Phred +33: ```Sanger [0,40] Illumina 1.8 [0,41] Illumina 1.9 [0,41]```[source](http://bioinfo.cipf.es/courses/mda13genomics/_media/mda13:qc_talk.pdf) Phred +64: ``` Illumina 1.3 [0,40] Illumina 1.5 [3,40]``` [source](http://bioinfo.cipf.es/courses/mda13genomics/_media/mda13:qc_talk.pdf) ```bash #!/bin/bash #PBS -q batch #PBS -l nodes=1:ppn=12 #PBS -l walltime=12:00:00 #PBS -j oe #PBS -m abe -M andres.espindola@okstate.edu cd $PBS_O_WORKDIR module load jdk/1.8.0_45 module load trimmomatic/0.33 java -jar /opt/trimmomatic/0.33/trimmomatic-0.33.jar PE -phred33 -trimlog $name-log.txt \ ../raw-data/$name\_1.fq.gz ../raw-data/$name\_2.fq.gz \ $name\_output_paired_1.fq.gz $name\_output_unpaired_1.fq.gz $name\_output_paired_2.fq.gz $name\_output_unpaired_2.fq.gz \ ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:25 ``` 2. Create a list of your `paired end` runs that looks like this and save it as `run-ids.txt` in your `raw-data` directory. Keep this file in the `raw-data` folder because you will use it continuosly ``` T1R1 T1R2 T2R1R T2R1S T2R2R T2R2S ``` 3. Submit the jobs by using the `../raw-data/run-ids.txt` file. ```bash while read i; do qsub -v name=$i -N trimo_$i trimomatic.sh sleep 1 done <../raw-data/run-ids.txt ``` ## Normalizing reads ### 1. Interleaving reads and normalizing This tool creates an interleaved single `.fq` file that allows us to normalize reads. Create a directory in the parent project called `normalized_reads` and then create the `PBS` script named `normalize.sh` to run [`interleave-reads.py`](https://github.com/dib-lab/khmer/blob/master/scripts/interleave-reads.py) and [`normalize-by-median.py`](https://github.com/dib-lab/khmer/blob/9323ca4e17bccac9ec1a7d5e8321eeec61cd17cd/tests/test_normalize_by_median.py) ```bash #!/bin/bash #PBS -q batch #PBS -l nodes=1:ppn=12 #PBS -l walltime=72:00:00 #PBS -j oe #PBS -m abe -M andres.espindola@okstate.edu cd $PBS_O_WORKDIR module load khmer/1.4.1 interleave-reads.py ../trim/$name\_output\_paired\_1.fq.gz ../trim/$name\_output\_paired\_2.fq.gz -o $name\_paired.fq normalize-by-median.py -C 30 -k 25 -N 4 -x 6e9 -p --savegraph $name\_norm.ct -o - $name\_paired.fq >> $name\_norm30x.fq ``` Submit the job as we submitted previous jobs ```bash while read i; do qsub -v name=$i -N norm_$i normalize.sh sleep 1 done <../raw-data/run-ids.txt ``` ### 2. Splitting normalized reads Our output is a `*_norm30x.fq` which contains both forward and reverse reads. We want to split them for downstream analysis using [split-paired-reads.py](https://github.com/dib-lab/khmer/blob/master/scripts/split-paired-reads.py). We must create another folder called `splitted` inside the `normalized_reads` folder. Inside the newly `splitted` folder we create a `PBS` script named `spliter.sh`. ```bash #!/bin/bash #PBS -q batch #PBS -l nodes=1:ppn=12 #PBS -l walltime=10:00:00 #PBS -j oe #PBS -m abe -M andres.espindola@okstate.edu cd $PBS_O_WORKDIR module load khmer/1.4.1 interleave-reads.py ../trim/$name\_output\_paired\_1.fq.gz ../trim/$name\_output\_paired\_2.fq.gz -o $name\_paired.fq normalize-by-median.py -C 30 -k 25 -N 4 -x 6e9 -p --savegraph $name\_norm.ct -o - $name\_paired.fq >> $name\_norm30x.fq ``` Submit the job as we submitted previous jobs, in this case, we need to go up two directories to retrieve `run-ids.txt`. ```bash while read i; do qsub -v name=$i -N split_norm_$i spliter.sh sleep 1 done <../../raw-data/run-ids.txt ``` ## De-hosting Since we are mainly interested in microbial interactions, we want to remove as many host sequences as possible. We will use `bowtie2` for these purposes. To run bowtie2 first we need to index the host genome using the following `PBS` code (`bowtie_dehost_index.sh`). The aforementioned code will be run inside a directory named `dehosting-index` as follows. ```bash #!/bin/bash #PBS -N indexbluid_bowtie #PBS -q bigmem #PBS -l nodes=1:ppn=12 #PBS -l walltime=120:00:00 #PBS -j oe #PBS -m abe -M andres.espindola@okstate.edu cd $PBS_O_WORKDIR module load bowtie2/2.3.4 bowtie2-build /project02/transgenomics/161010_Chinese_Spring_v1.0_pseudomolecules.fasta Spring_Wheat_index_bowtie2 --large-index --threads 12 ``` ### Bowtie2 Mapping to reference genome (Wheat) We must move to our working directory named `dehosting` and run the following code to map all the reads to the reference indexed genome. The script will be called `dehosting.sh` ```bash #!/bin/bash #PBS -q bigmem #PBS -l nodes=1:ppn=12 #PBS -l walltime=120:00:00 #PBS -j oe #PBS -m abe -M andres.espindola@okstate.edu cd $PBS_O_WORKDIR module load bowtie2/2.3.4 module load samtools/1.6 module load bedtools bowtie2 -p 12 -x ../dehosting-index/Spring_Wheat_index_bowtie2 -1 ../normalized_reads/splitted/$name\_left.fq -2 ../normalized_reads/splitted/$name\_right.fq -S $name\_mapped_and_unmapped.sam samtools view -bS $name\_mapped_and_unmapped.sam > $name\_mapped_and_unmapped.bam samtools view -b -f 12 -F 256 $name\_mapped_and_unmapped.bam > $name\_bothEndsUnmapped.bam samtools sort -n $name\_bothEndsUnmapped.bam -o $name\_bothEndsUnmapped_sorted bedtools bamtofastq -i $name\_bothEndsUnmapped_sorted -fq $name\_host_removed_left.fastq -fq2 $name\_host_removed_right.fastq ``` Now to submit the job for all the files we must run the following loop. ```bash while read i; do qsub -v name=$i -N dehosting_$i dehosting.sh sleep 1 done <../raw-data/run-ids.txt ``` ## Assembling with MetaVelvet ### Subsampling datasets to test Metavelvet We want to test `metavelvet` for its capacity to rapidly and accurately assemble soil metatranscriptomes. We took a random subsample from the original `fastq` databases. We must be in the folder where the fastq files are found to run the following command: ```bash for i in *left.fastq do suffix="_left*" foo=${i%$suffix} echo "${foo}">> filenames.txt done ``` This will create a `filenames.txt` file that we can use to run `seqtk` to subsample the reads. We can run the following [PBS script](/scripts/seqtk_subsample.sh) by having it in the same directory and running the following loop. ```bash while read i; do qsub -v name=$i -N subsample_$i seqtk_subsample.sh sleep 1 done < filenames.txt ``` The command did take a random subsample of each of the raw reads and wrote it in a new file. The subsample contains 100k reads. #### Checking if the subsampled paired end reads are correct We can use [fastq_pair](https://github.com/linsalrob/fastq-pair) as follows: ```bash fastq_pair -t 1000 left.fastq right.fastq ``` We can automate the above code by creating a [PBS script](/scripts/subsample_check.sh) that checks the paired reads. We can submit the above jobs as follows: ```bash while read i; do qsub -v name=$i -N checksub_$i subsample_check.sh sleep 1 done < filenames.txt ``` In the above case, the `paired` files are supposed to contain the reads that are paired and the single files are the ones that contain only `singletons`. ### Running `metavelvet` on 100k paired-end reads In some cases Trinity might not work for assembling large metatranscriptome datasets. Therefore, we have decided to go with metavelvet. In order to run metavelvet we must run velvetg and velveth. I have created a [PBS script](/scripts/met-assembly.sh) which contains all the commands to run metavelvet. Now that we have the metadata for the `fastq` files, we can submit jobs to the queue using the [PBS script](/scripts/met-assembly.sh) that contains all the commands for assembling with `metavelvet` ```bash while read i; do qsub -v name=$i -N metavelvet_$i met-assembly.sh sleep 1 done < filenames.txt ``` Running the previous jobs only took 2 minutes because I was using a random subset of the total reads. The average coverage for the previous runs was 1. Ideally, while assembling the total number of reads, coverages should increase to at least 10. ### Assembling with total reads We must use a [different PBS script](/scripts/met-assembly-fullreads.sh) in this case. The script changes only in the filenames. We submit the jobs as follows: ```bash while read i; do qsub -v name=$i -N metavelvet_$i met-assembly-fullreads.sh sleep 1 done < filenames.txt ``` ### Assessing assembly Quality We can assess assembly with [quast](http://bioinf.spbau.ru/quast). Let us create a folder with soft links to all the assemblies. ```bash mkdir assemblies ``` Then create soft links ```bash while read i; do currentdir=`pwd` ln $currentdir/$i-out-fullreads/meta-velvetg.contigs.fa $currentdir/assemblies/$i-metavelvet.fasta done < filenames.txt ``` Finally we can run `quast` on the `assemblies` folder as follows: ```bash quast.py *.fasta ``` This will create a PDF report with metrics about the assemblies. ### Blasting the largest contig on each subsample First we need to obtain the metadata of the largest contig. We can use the `quast` output and `grep` the fasta files for the length of the contig. Let's create a file with the lengths of the largest contigs. ``` T1R1_host_removed 6165 T1R2_host_removed 6918 T2R1R_host_removed 9451 T2R1S_host_removed 5999 T2R2R_host_removed 5698 T2R2S_host_removed 6202 ``` Now we can retrieve the IDs of the fasta files having the above info. Keep in mind that this data had to be retrieved from the `meta-velvetg.LastGraph-stats.txt` because `quast` did not give the same number for the largest contig. ```bash while read i; do name=$(echo $i|awk '{print $1}') length=$(echo $i|awk '{print $2}') echo $name $length grep "\<NODE.*length_$length.*\>" $name-metavelvet.fasta | sed "s/$.*$/\1 $name/" | sed 's/^.\{1\}//' >> transcriptid.txt done < largestcontigs.txt ``` Now retrieving the largest contigs for each file. ```bash while read i; do id=$(echo $i|awk '{print $1}') name=$(echo $i|awk '{print $2}') echo $name $id samtools faidx $name-metavelvet.fasta $id > $name-largestcontig.fasta done < transcriptid.txt ``` ### Retrieving contigs larger than 1000 bp We must be in the `assemblies` folder to run the following command. This will work only with velvet assemblies because they have the required format. ```bash while read i; do grep "\<NODE.*\>" $i-metavelvet.fasta | sed 's/^.\{1\}//' | awk -F "_" '$4 >=1000' > $i-largerthan1000.txt done < ../filenames.txt ``` Now we can retrieve those large contigs using samtools. ```bash while read i; do echo $i xargs samtools faidx $i-metavelvet.fasta < $i-largerthan1000.txt > $i-largerthan1000.fasta done < ../filenames.txt ``` ### Retrieving functional and contig ID We will use [diamond](https://github.com/bbuchfink/diamond) for these purposes. First we need to have a formatted database from diamond. We can download the `nt` database from the [ftp site](ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/). We must download it as a `fasta` format and we will be using the [following script](/scripts/download-fasta-nt.sh). We must blast the dataset to the nt database using the diamond by following the below command and using [this script] (/scripts/diamond.sh) ```bash while read i; do echo $i qsub -v name=$i -N diamond_$i diamond.sh sleep 1 done < ../filenames.txt ``` ### Using Kraken with raw reads First we create a folder called `kraken` where we will create a folder called `database`. The installation of kraken was done following [this guidelines](http://ccb.jhu.edu/software/kraken/MANUAL.html#installation). Jellyfish v 1 is also required and I installed it following [this guidelines](http://www.cbcb.umd.edu/software/jellyfish/README). Version 1 an be downloaded [from here](http://www.cbcb.umd.edu/software/jellyfish/jellyfish-1.1.11.tar.gz). Download the kraken database using [this script](/scripts/kraken-database.sh). Once the database is ready we can use `kraken` without any issue using the [following script](/scripts/kraken-run.sh) ```bash while read i; do echo $i qsub -v name=$i -N diamond_$i diamond.sh sleep 1 done < ../filenames.txt ``` ## E-probe design A [Control file](/scripts/ProbeDesignCtrlFile.ctrl) is required to generate e-probes with EDNA. The control file contains information about basic metrics of EDNA. The job can be submitted to a computer cluster using PBS with [this script](/scripts/probedes.sh). Alternatively, a cluster using slurm can use the [following script](/scripts/probedes-slurm.sh) ## Running EDNA with the generated e-probes. To run EDNA 2.0 we are required to use a [control file](/scripts/edna2.0-control.ctrl) which will contain all the metrics about the search to be performed. Then EDNA can be run with a [PBS script](/scripts/ednarun.sh). When using raw data that usually comes in the `.fastq.gz` format, it is imperative to decompress and convert `fastq` to `fasta`. For those purposes I have made this [small bash script](/scripts/gztofasta.sh) which will determine if a file is `fastq` or `.fastq.gz` and convert them to `fasta` accordingly. The script has to bee in the same folder and must be executable. To make the script executable run the following command: ```bash chmod +x gztofasta.sh In this case I have paired end reads for 6 samples which in total means that I need to uncompress 12 files. I can use `gnu-parallel` for this as follows. ```bash parallel --gnu ./gztofasta.sh ::: *gz ``` Alternatively, when there are multiple millions of reads per file you might want to submit a job as a [pbs script](/scripts/gztofasta-pbs.sh). ## Appendix and FAQ :::info **Find this document incomplete?** Leave a comment! :::