---
title: 'Metatranscriptome analysis and e-probe design (Wheat rhizosphere)'
disqus: hackmd
tags: edna, wheat, metatranscriptomes
---
Metatranscriptome analysis and e-probe design (wheat rhizosphere)
===
[repository](https://github.com/andrese52/wheat-microbiome-beta)



## Table of Contents
[TOC]
Data flowchart
---
```sequence
Raw fastq Data (reads) --> Reads quality report: FastQC
Raw fastq Data (reads) --> Trimmed reads: Trimmomatic
Trimmed reads --> Normalized reads:interleave, normalize and split
Normalized reads --> Dehosted reads:bowtie2,samtools,bedtools
Dehosted reads --> Assembled metatranscriptome: Trinity, metavelvet, metaSpades
```
> Read more about sequence-diagrams here: http://bramp.github.io/js-sequence-diagrams/
> Read more about mermaid here: http://knsv.github.io/mermaid/
## Data Quality Check
First let's copy all our data from a different directory to our working directory which in this case is `raw-data`
```bash=1
workdir=/project02/transgenomics/andrese52/edna-wheat/raw-data
for i in *.fq.gz
do
cp $i $workdir/$i
done
```
### FastQC
We can use fastqc for this. We need to be in the folder where all the fastq files are present and then we run fastqc as follows: The code below will run in a serial fashion
```bash=1
fastqc -t 12 *.fq.gz
```
Since we don't want to run the same code many times, remember this quote:
> I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. [name=Bill Gates]
We can submit many `fastqc` jobs to a torque cluster as follows:
1. Create a job and save it as fastq.sh in the same folder where your `fq.gz` are saved.
```bash=1
#!/bin/bash
#PBS -q batch
#PBS -l nodes=1:ppn=12
#PBS -l walltime=1:00:00
#PBS -j oe
#PBS -m abe -M andres.espindola@okstate.edu
cd $PBS_O_WORKDIR
echo "YOUR fastq analysis IS" $name
fastqc -t 12 $name
```
2. Submit your jobs by targeting the fq.gz extension of your files
```bash=1
for i in *.fq.gz
do
qsub -v name=$i -N fastq_$i fastq.sh
echo $i
sleep 1
done
```
`-t 12` refers to the number of processors that you have in the computer.
> Read more about FastQC here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
## Trimming low quality reads
### Trimmomatic
This tool is utilized to trim nucleotides having bad quality. More information about the options that can be used with this tool can be found [here](http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf). We must run the trimming separate for each paired end run.
1. Create a `PBS` job that includes information about the runs in a directory named `trimomatic.sh`. Make sure that the phred score is 33 or 64 by checking the `fastqc` output.
Phred +33: ```Sanger [0,40] Illumina 1.8 [0,41] Illumina 1.9 [0,41]```[source](http://bioinfo.cipf.es/courses/mda13genomics/_media/mda13:qc_talk.pdf)
Phred +64: ``` Illumina 1.3 [0,40] Illumina 1.5 [3,40]```
[source](http://bioinfo.cipf.es/courses/mda13genomics/_media/mda13:qc_talk.pdf)
```bash
#!/bin/bash
#PBS -q batch
#PBS -l nodes=1:ppn=12
#PBS -l walltime=12:00:00
#PBS -j oe
#PBS -m abe -M andres.espindola@okstate.edu
cd $PBS_O_WORKDIR
module load jdk/1.8.0_45
module load trimmomatic/0.33
java -jar /opt/trimmomatic/0.33/trimmomatic-0.33.jar PE -phred33 -trimlog $name-log.txt \
../raw-data/$name\_1.fq.gz ../raw-data/$name\_2.fq.gz \
$name\_output_paired_1.fq.gz $name\_output_unpaired_1.fq.gz $name\_output_paired_2.fq.gz $name\_output_unpaired_2.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:25
```
2. Create a list of your `paired end` runs that looks like this and save it as `run-ids.txt` in your `raw-data` directory. Keep this file in the `raw-data` folder because you will use it continuosly
```
T1R1
T1R2
T2R1R
T2R1S
T2R2R
T2R2S
```
3. Submit the jobs by using the `../raw-data/run-ids.txt` file.
```bash
while read i; do
qsub -v name=$i -N trimo_$i trimomatic.sh
sleep 1
done <../raw-data/run-ids.txt
```
## Normalizing reads
### 1. Interleaving reads and normalizing
This tool creates an interleaved single `.fq` file that allows us to normalize
reads.
Create a directory in the parent project called `normalized_reads` and then create the `PBS` script named `normalize.sh` to run [`interleave-reads.py`](https://github.com/dib-lab/khmer/blob/master/scripts/interleave-reads.py) and [`normalize-by-median.py`](https://github.com/dib-lab/khmer/blob/9323ca4e17bccac9ec1a7d5e8321eeec61cd17cd/tests/test_normalize_by_median.py)
```bash
#!/bin/bash
#PBS -q batch
#PBS -l nodes=1:ppn=12
#PBS -l walltime=72:00:00
#PBS -j oe
#PBS -m abe -M andres.espindola@okstate.edu
cd $PBS_O_WORKDIR
module load khmer/1.4.1
interleave-reads.py ../trim/$name\_output\_paired\_1.fq.gz ../trim/$name\_output\_paired\_2.fq.gz -o $name\_paired.fq
normalize-by-median.py -C 30 -k 25 -N 4 -x 6e9 -p --savegraph $name\_norm.ct -o - $name\_paired.fq >> $name\_norm30x.fq
```
Submit the job as we submitted previous jobs
```bash
while read i; do
qsub -v name=$i -N norm_$i normalize.sh
sleep 1
done <../raw-data/run-ids.txt
```
### 2. Splitting normalized reads
Our output is a `*_norm30x.fq` which contains both forward and reverse reads. We want to split them for downstream analysis using [split-paired-reads.py](https://github.com/dib-lab/khmer/blob/master/scripts/split-paired-reads.py).
We must create another folder called `splitted` inside the `normalized_reads` folder. Inside the newly `splitted` folder we create a `PBS` script named `spliter.sh`.
```bash
#!/bin/bash
#PBS -q batch
#PBS -l nodes=1:ppn=12
#PBS -l walltime=10:00:00
#PBS -j oe
#PBS -m abe -M andres.espindola@okstate.edu
cd $PBS_O_WORKDIR
module load khmer/1.4.1
interleave-reads.py ../trim/$name\_output\_paired\_1.fq.gz ../trim/$name\_output\_paired\_2.fq.gz -o $name\_paired.fq
normalize-by-median.py -C 30 -k 25 -N 4 -x 6e9 -p --savegraph $name\_norm.ct -o - $name\_paired.fq >> $name\_norm30x.fq
```
Submit the job as we submitted previous jobs, in this case, we need to go up two directories to retrieve `run-ids.txt`.
```bash
while read i; do
qsub -v name=$i -N split_norm_$i spliter.sh
sleep 1
done <../../raw-data/run-ids.txt
```
## De-hosting
Since we are mainly interested in microbial interactions, we want to remove as many host sequences as possible. We will use `bowtie2` for these purposes. To run bowtie2 first we need to index the host genome using the following `PBS` code (`bowtie_dehost_index.sh`). The aforementioned code will be run inside a directory named `dehosting-index` as follows.
```bash
#!/bin/bash
#PBS -N indexbluid_bowtie
#PBS -q bigmem
#PBS -l nodes=1:ppn=12
#PBS -l walltime=120:00:00
#PBS -j oe
#PBS -m abe -M andres.espindola@okstate.edu
cd $PBS_O_WORKDIR
module load bowtie2/2.3.4
bowtie2-build /project02/transgenomics/161010_Chinese_Spring_v1.0_pseudomolecules.fasta Spring_Wheat_index_bowtie2 --large-index --threads 12
```
### Bowtie2 Mapping to reference genome (Wheat)
We must move to our working directory named `dehosting` and run the following code to map all the reads to the reference indexed genome. The script will be called `dehosting.sh`
```bash
#!/bin/bash
#PBS -q bigmem
#PBS -l nodes=1:ppn=12
#PBS -l walltime=120:00:00
#PBS -j oe
#PBS -m abe -M andres.espindola@okstate.edu
cd $PBS_O_WORKDIR
module load bowtie2/2.3.4
module load samtools/1.6
module load bedtools
bowtie2 -p 12 -x ../dehosting-index/Spring_Wheat_index_bowtie2 -1 ../normalized_reads/splitted/$name\_left.fq -2 ../normalized_reads/splitted/$name\_right.fq -S $name\_mapped_and_unmapped.sam
samtools view -bS $name\_mapped_and_unmapped.sam > $name\_mapped_and_unmapped.bam
samtools view -b -f 12 -F 256 $name\_mapped_and_unmapped.bam > $name\_bothEndsUnmapped.bam
samtools sort -n $name\_bothEndsUnmapped.bam -o $name\_bothEndsUnmapped_sorted
bedtools bamtofastq -i $name\_bothEndsUnmapped_sorted -fq $name\_host_removed_left.fastq -fq2 $name\_host_removed_right.fastq
```
Now to submit the job for all the files we must run the following loop.
```bash
while read i; do
qsub -v name=$i -N dehosting_$i dehosting.sh
sleep 1
done <../raw-data/run-ids.txt
```
## Assembling with MetaVelvet
### Subsampling datasets to test Metavelvet
We want to test `metavelvet` for its capacity to rapidly and accurately assemble soil metatranscriptomes.
We took a random subsample from the original `fastq` databases.
We must be in the folder where the fastq files are found to run the following command:
```bash
for i in *left.fastq
do
suffix="_left*"
foo=${i%$suffix}
echo "${foo}">> filenames.txt
done
```
This will create a `filenames.txt` file that we can use to run `seqtk` to subsample the reads. We can run the following [PBS script](/scripts/seqtk_subsample.sh) by having it in the same directory and running the following loop.
```bash
while read i; do
qsub -v name=$i -N subsample_$i seqtk_subsample.sh
sleep 1
done < filenames.txt
```
The command did take a random subsample of each of the raw reads and wrote it in a new file. The subsample contains 100k reads.
#### Checking if the subsampled paired end reads are correct
We can use [fastq_pair](https://github.com/linsalrob/fastq-pair) as follows:
```bash
fastq_pair -t 1000 left.fastq right.fastq
```
We can automate the above code by creating a [PBS script](/scripts/subsample_check.sh) that checks the paired reads. We can submit the above jobs as follows:
```bash
while read i; do
qsub -v name=$i -N checksub_$i subsample_check.sh
sleep 1
done < filenames.txt
```
In the above case, the `paired` files are supposed to contain the reads that are paired and the single files are the ones that contain only `singletons`.
### Running `metavelvet` on 100k paired-end reads
In some cases Trinity might not work for assembling large metatranscriptome datasets. Therefore, we have decided to go with metavelvet.
In order to run metavelvet we must run velvetg and velveth. I have created a [PBS script](/scripts/met-assembly.sh) which contains all the commands to run metavelvet.
Now that we have the metadata for the `fastq` files, we can submit jobs to the queue using the [PBS script](/scripts/met-assembly.sh) that contains all the commands for assembling with `metavelvet`
```bash
while read i; do
qsub -v name=$i -N metavelvet_$i met-assembly.sh
sleep 1
done < filenames.txt
```
Running the previous jobs only took 2 minutes because I was using a random subset of the total reads. The average coverage for the previous runs was 1. Ideally, while assembling the total number of reads, coverages should increase to at least 10.
### Assembling with total reads
We must use a [different PBS script](/scripts/met-assembly-fullreads.sh) in this case. The script changes only in the filenames.
We submit the jobs as follows:
```bash
while read i; do
qsub -v name=$i -N metavelvet_$i met-assembly-fullreads.sh
sleep 1
done < filenames.txt
```
### Assessing assembly Quality
We can assess assembly with [quast](http://bioinf.spbau.ru/quast). Let us create a folder with soft links to all the assemblies.
```bash
mkdir assemblies
```
Then create soft links
```bash
while read i; do
currentdir=`pwd`
ln $currentdir/$i-out-fullreads/meta-velvetg.contigs.fa $currentdir/assemblies/$i-metavelvet.fasta
done < filenames.txt
```
Finally we can run `quast` on the `assemblies` folder as follows:
```bash
quast.py *.fasta
```
This will create a PDF report with metrics about the assemblies.
### Blasting the largest contig on each subsample
First we need to obtain the metadata of the largest contig. We can use the `quast` output and `grep` the fasta files for the length of the contig. Let's create a file with the lengths of the largest contigs.
```
T1R1_host_removed 6165
T1R2_host_removed 6918
T2R1R_host_removed 9451
T2R1S_host_removed 5999
T2R2R_host_removed 5698
T2R2S_host_removed 6202
```
Now we can retrieve the IDs of the fasta files having the above info. Keep in mind that this data had to be retrieved from the `meta-velvetg.LastGraph-stats.txt` because `quast` did not give the same number for the largest contig.
```bash
while read i; do
name=$(echo $i|awk '{print $1}')
length=$(echo $i|awk '{print $2}')
echo $name $length
grep "\<NODE.*length_$length.*\>" $name-metavelvet.fasta | sed "s/\(.*\)/\1 $name/" | sed 's/^.\{1\}//' >> transcriptid.txt
done < largestcontigs.txt
```
Now retrieving the largest contigs for each file.
```bash
while read i; do
id=$(echo $i|awk '{print $1}')
name=$(echo $i|awk '{print $2}')
echo $name $id
samtools faidx $name-metavelvet.fasta $id > $name-largestcontig.fasta
done < transcriptid.txt
```
### Retrieving contigs larger than 1000 bp
We must be in the `assemblies` folder to run the following command. This will work only with velvet assemblies because they have the required format.
```bash
while read i; do
grep "\<NODE.*\>" $i-metavelvet.fasta | sed 's/^.\{1\}//' | awk -F "_" '$4 >=1000' > $i-largerthan1000.txt
done < ../filenames.txt
```
Now we can retrieve those large contigs using samtools.
```bash
while read i; do
echo $i
xargs samtools faidx $i-metavelvet.fasta < $i-largerthan1000.txt > $i-largerthan1000.fasta
done < ../filenames.txt
```
### Retrieving functional and contig ID
We will use [diamond](https://github.com/bbuchfink/diamond) for these purposes.
First we need to have a formatted database from diamond. We can download the `nt` database from the [ftp site](ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/). We must download it as a `fasta` format and we will be using the [following script](/scripts/download-fasta-nt.sh).
We must blast the dataset to the nt database using the diamond by following the below command and using [this script] (/scripts/diamond.sh)
```bash
while read i; do
echo $i
qsub -v name=$i -N diamond_$i diamond.sh
sleep 1
done < ../filenames.txt
```
### Using Kraken with raw reads
First we create a folder called `kraken` where we will create a folder called `database`. The installation of kraken was done following [this guidelines](http://ccb.jhu.edu/software/kraken/MANUAL.html#installation).
Jellyfish v 1 is also required and I installed it following [this guidelines](http://www.cbcb.umd.edu/software/jellyfish/README). Version 1 an be downloaded [from here](http://www.cbcb.umd.edu/software/jellyfish/jellyfish-1.1.11.tar.gz).
Download the kraken database using [this script](/scripts/kraken-database.sh).
Once the database is ready we can use `kraken` without any issue using the [following script](/scripts/kraken-run.sh)
```bash
while read i; do
echo $i
qsub -v name=$i -N diamond_$i diamond.sh
sleep 1
done < ../filenames.txt
```
## E-probe design
A [Control file](/scripts/ProbeDesignCtrlFile.ctrl) is required to generate e-probes with EDNA.
The control file contains information about basic metrics of EDNA. The job can be submitted to a computer cluster using PBS with [this script](/scripts/probedes.sh). Alternatively, a cluster using slurm can use the [following script](/scripts/probedes-slurm.sh)
## Running EDNA with the generated e-probes.
To run EDNA 2.0 we are required to use a [control file](/scripts/edna2.0-control.ctrl) which will contain all the metrics about the search to be performed. Then EDNA can be run with a [PBS script](/scripts/ednarun.sh).
When using raw data that usually comes in the `.fastq.gz` format, it is imperative to decompress and convert `fastq` to `fasta`. For those purposes I have made this [small bash script](/scripts/gztofasta.sh) which will determine if a file is `fastq` or `.fastq.gz` and convert them to `fasta` accordingly. The script has to bee in the same folder and must be executable.
To make the script executable run the following command:
```bash
chmod +x gztofasta.sh
In this case I have paired end reads for 6 samples which in total means that I need to uncompress 12 files. I can use `gnu-parallel` for this as follows.
```bash
parallel --gnu ./gztofasta.sh ::: *gz
```
Alternatively, when there are multiple millions of reads per file you might want to submit a job as a [pbs script](/scripts/gztofasta-pbs.sh).
## Appendix and FAQ
:::info
**Find this document incomplete?** Leave a comment!
:::