# PLASTICIDADE - EUSCHISTUS (DENIS)
## Table of Contents
[TOC]
## Basic commands
Conectando no servidor
```
ssh denis@143.107.244.181 -p 2205
```
Transferencia de arquivos de servidor para local
```
#de local para servidor
#a sintaxe é simples: scp -P <porta> <caminho e arquivo de origem> <diretorio de destino>
sudo scp -P 2205 /home/bruno/GoogleDrive/[ARTIGO]Denis/T26_1.fq.gz denis@143.107.244.181:/home/denis/raw_seqs
#de servidor para servidor
sudo scp caminho/do/arquivo.txt usuário@servidor-de-destino:/caminho/destino/
#At: /home/denis/results/Trim_Reads
sudo scp ./T4_1.trim.fq.gz cunha@lem.ib.usp.br:/home/cunha/denis/raw_seqs
```
Checar md5
```
md5sum <arquivo>
md5sum * #checar todos de uma vez
md5sum * > checklist.chk # generates a list of checksums for any file that matches *
md5sum -c checklist.chk # runs through the list to check them
```
Análise em screen
```
#rodar análises em screen
screen
#iniciar um screen e nomea-lo
screen -S session_name
#sair do screen sem fechar
ctrl+a+d
#encerrar o screen em definitivo
exit
#recuperar screen
screen -r
#se tiver mais de um aberto
screen -r <numeros>
```
Verificar se um programa está instalado (which is a BASH program that looks through everything you have installed, and tells you what folder it is installed to. If it can’t find the program you asked for, it returns nothing, i.e. gives you no results.)
```
which <programa>
```
Verificar a versão de um programa
```
<programa> -v
```
Instalar um programa
```
sudo apt-get install <programa>
```
man (short for manual) displays detailed documentation (also referred as man page or man file) for bash commands. It is a powerful resource to explore bash commands, understand their usage and flags.
```
man ls
```
In most commands the flags can be combined together in no particular order to obtain the desired results/output.
```
ls -Fa
ls -laF
```
Contar o número de linhas em um arquivo
```
wc -l <arquivo>
```
## Description of the data
| Sample | Sex ratio | Batch | QC result |
|:------:|:---------:|:-----:|:---------:|
| 4 | 2MF | A | Hold |
| 7 | 2MF | A | Hold |
| 26 | M2F | A | Hold |
| 29 | 2MF | B | Pass |
| 30 | 2MF | C | Pass |
| 31 | M2F | B | Pass |
| 32 | 2MF | C | Pass |
| 34 | M2F | C | Pass |
| 36 | M2F | C | Fail |
| 37 | M2F | C | Pass |
| 38 | M2F | C | Fail |
| 39 | 2MF | C | Pass |
Files:
T4_1.fq.gz
T4_2.fq.gz
T7_1.fq.gz
T7_2.fq.gz
T26_1.fq.gz
T26_2.fq.gz
T29_1.fq.gz
T29_2.fq.gz
T30_1.fq.gz
T30_2.fq.gz
T31_1.fq.gz
T31_2.fq.gz
T32_1.fq.gz
T32_2.fq.gz
T34_1.fq.gz
T34_2.fq.gz
T36_1.fq.gz
T36_2.fq.gz
T37_1.fq.gz
T37_2.fq.gz
T38_1.fq.gz
T38_2.fq.gz
T39_1.fq.gz
T39_2.fq.gz
Our samples are paired-end. Paired end reads are produced when the fragment size used in the sequencing process is much longer (typically 250 - 500 bp long) and the ends of the fragment are read in towards the middle. This produces two “paired” reads. One from the left hand end of a fragment and one from the right with a known separation distance between them. (The known separation distance is actually a distribution with a mean and standard deviation as not all original fragments are of the same length.) This extra information contained in the paired end reads can be useful for helping to tie pieces of sequence together during the assembly process.

## 0 - Quality control of raw reads
FastQC will will process one sample at a time and give you an output report for each sample separately. Don’t need to unzip raw read files because fastqc can cope with zipped files (.gz). MultiQC will combine all the outputs from FastQC analysis and give you one QC report for all processed samples, making them more easily comparable.
```
fastqc <arquivo>
fastqc *.fq.gz #analisar todos de uma vez
multiqc . #the dot stands for the local directory, and is obligatory.
mv *fastqc* /dados/home/denis/results/fastQC
#Passando o arquivo html para o local (the command is run from your local machine)
scp -P 2205 denis@143.107.244.181:/home/denis/results/fastQC/multiqc_report.html .
```
About FastQC:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/
About MultiQC:
https://multiqc.info/
Resultados:



Este gráfico indica que as amostras provavelmente estão contaminadas:

"Warnings in this module usually indicate a problem with the library. Sharp peaks on an otherwise smooth distribution are normally the result of a specific contaminant (adapter dimers for example), which may well be picked up by the overrepresented sequences module. Wider or multiple peaks may represent contamination with a different species."
## 1 - Preprocessing (trimming)
We will use a program called Trimmomatic to filter poor quality reads and trim poor quality bases from our samples.
About Trimmomatic:
https://datacarpentry.org/wrangling-genomics/03-trimming/
http://www.usadellab.org/cms/index.php?page=trimmomatic
Manual:
http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
We must first specify whether we have paired end (PE) or single end (SE) reads. You can specify threads to indicate the number of processors on your computer that you want Trimmomatic to use. In most cases using multiple threads (processors) can help to run the trimming faster.
```
#Running Trimmomatic version 0.39:
for infile in *_1.fq.gz
> do
> base=$(basename ${infile} _1.fq.gz)
> trimmomatic PE -threads 12 -phred33 ${infile} ${base}_2.fq.gz \
> ${base}_1.trim.fq.gz ${base}_1un.trim.fq.gz \
> ${base}_2.trim.fq.gz ${base}_2un.trim.fq.gz \
> SLIDINGWINDOW:4:20 MINLEN:36 ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10
> done
```
I tested quality control again (for only paired reads):
```
fastqc *1.trim*
fastqc *2.trim*
mv *fastqc* ~/results/fastQc_Trim_Reads/
multiqc .
```
Results:




As amostras continuam contaminadas:

## 2 - Transcriptome assembly
The goal of a sequence assembler is to produce long contiguous pieces of sequence (contigs) from reads. The contigs are sometimes then ordered and oriented in relation to one another to form scaffolds. The distances between pairs of a set of paired end reads is useful information for this purpose. Even if a reference genome is available, de novo assembly should be performed, as it can recover transcripts that are transcribed from segments of the genome that are missing from the reference genome assembly.
### 2.1 - Genome-free De novo transcriptome assembly
RNA sequencing data is typically used for gene expression analysis via mapping reads
to a reference genome. However, for organisms without a high-quality reference genome
or gene annotation, de novo transcriptome assembly is a viable alternative
We will be using Spades v3.13.0, which contains the pipeline rnaSPades. rnaSPAdes is a tool for de novo transcriptome assembly from RNA-Seq data and is suitable for all kind of organisms. It accepts gzip-compressed files.
Manual:
https://cab.spbu.ru/files/release3.14.1/rnaspades_manual.html
```
#assembling a transcriptome for each sample
nano script_spades
rnaspades.py -1 T4_1.trim.fq.gz -2 T4_2.trim.fq.gz -o rnaSpades_T4
rnaspades.py -1 T7_1.trim.fq.gz -2 T7_2.trim.fq.gz -o rnaSpades_T7
rnaspades.py -1 T26_1.trim.fq.gz -2 T26_2.trim.fq.gz -o rnaSpades_T26
rnaspades.py -1 T29_1.trim.fq.gz -2 T29_2.trim.fq.gz -o rnaSpades_T29
rnaspades.py -1 T31_1.trim.fq.gz -2 T31_2.trim.fq.gz -o rnaSpades_T31
rnaspades.py -1 T30_1.trim.fq.gz -2 T30_2.trim.fq.gz -o rnaSpades_T30
rnaspades.py -1 T32_1.trim.fq.gz -2 T32_2.trim.fq.gz -o rnaSpades_T32
rnaspades.py -1 T34_1.trim.fq.gz -2 T34_2.trim.fq.gz -o rnaSpades_T34
rnaspades.py -1 T36_1.trim.fq.gz -2 T36_2.trim.fq.gz -o rnaSpades_T36
rnaspades.py -1 T37_1.trim.fq.gz -2 T37_2.trim.fq.gz -o rnaSpades_T37
rnaspades.py -1 T38_1.trim.fq.gz -2 T38_2.trim.fq.gz -o rnaSpades_T38
rnaspades.py -1 T39_1.trim.fq.gz -2 T39_2.trim.fq.gz -o rnaSpades_T39
bash script_spades
```
```
#assembling a transcriptome using all samples
rnaspades.py --pe1-1 T26_1.trim.fq.gz --pe1-2 T26_2.trim.fq.gz --pe2-1 T29_1.trim.fq.gz --pe2-2 T29_2.trim.fq.gz --pe3-1 T31_1.trim.fq.gz --pe3-2 T31_2.trim.fq.gz --pe4-1 T4_1.trim.fq.gz --pe4-2 T4_2.trim.fq.gz --pe5-1 T7_1.trim.fq.gz --pe5-2 T7_2.trim.fq.gz -o Transcriptome_denovo
```
````
#assembling a transcriptome including the new samples
#An alternative way to specify an input data set for SPAdes is to create a YAML data set file. By using a YAML file you can provide an unlimited number of paired-end libraries.
nano libraries.yaml
[
{
orientation: "fr",
type: "paired-end",
right reads: [
"/home/denis/results/Trim_Reads/T4_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T7_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T26_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T29_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T30_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T31_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T32_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T34_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T36_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T37_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T38_1.trim.fq.gz",
"/home/denis/results/Trim_Reads/T39_1.trim.fq.gz"
],
left reads: [
"/home/denis/results/Trim_Reads/T4_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T7_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T26_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T29_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T30_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T31_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T32_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T34_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T36_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T37_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T38_2.trim.fq.gz",
"/home/denis/results/Trim_Reads/T39_2.trim.fq.gz"
]
}
]
rnaspades.py --dataset libraries.yaml -o Transcriptome_denovo2
````
#### Doing the same in Trinity to compare
````
#At: /home/denis/results/Trim_Reads
Trinity --max_memory 50G --seqType fq --left T4_1.trim.fq.gz,T7_1.trim.fq.gz,T26_1.trim.fq.gz,T29_1.trim.fq.gz,T30_1.trim.fq.gz,T31_1.trim.fq.gz,T32_1.trim.fq.gz,T34_1.trim.fq.gz,T36_1.trim.fq.gz,T37_1.trim.fq.gz,T38_1.trim.fq.gz,T39_1.trim.fq.gz --right T4_2.trim.fq.gz,T7_2.trim.fq.gz,T26_2.trim.fq.gz,T29_2.trim.fq.gz,T30_2.trim.fq.gz,T31_2.trim.fq.gz,T32_2.trim.fq.gz,T34_2.trim.fq.gz,T36_2.trim.fq.gz,T37_2.trim.fq.gz,T38_2.trim.fq.gz,T39_2.trim.fq.gz --CPU 15 --min_contig_length 199 --full_cleanup
````
### 2.2 - Genome-Guided De novo transcriptome Assembly
We will be using Trinity v2.1.1. If a genome sequence is available, Trinity offers a method whereby reads are first aligned to the genome, partitioned according to locus, followed by de novo transcriptome assembly at each locus. In this use-case, the genome is only being used as a substrate for grouping overlapping reads into clusters that will then be separately fed into Trinity for de novo transcriptome assembly. This is very much unlike typical genome-guided approaches (eg. cufflinks) where aligned reads are stitched into transcript structures and where transcript sequences are reconstructed based on the reference genome sequence. Here, transcripts are reconstructed based on the actual read sequences.
Why do this? You may have a reference genome, but your sample likely comes from an organism with a genome that isn't an exact match to the reference genome. Genome-guided de novo assembly should capture the sequence variations contained in your RNA-Seq sample in the form of the transcripts that are de novo reconstructed. In comparison to genome-free de novo assembly, it can also help in cases where you have paralogs or other genes with shared sequences, since the genome is used to partition the reads according to locus prior to doing any de novo assembly. If you have a highly fragmented draft genome, then you are likely better off performing a genome-free de novo transcriptome assembly.
Users must provide read alignments to Trinity as a coordinate-sorted bam file. Use GSNAP, TopHat, STAR or other favorite RNA-Seq read alignment tool to generate the bam file, and be sure it's coordinate sorted by running 'samtools sort' on it.
We will be using STAR v2.7.10a.
Manual:
https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf
Aligning reads using STAR is a two step process:
#### Generating genome indeces files
````
#creating a directory for the indices
mkdir star_index
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /home/denis/Ref_genome/star_index --genomeFastaFiles /home/denis/Ref_genome/GCA_003667255.1_E_heros_v1.0_sep_2018_genomic.fna
````
#### Mapping reads to the genome
````
mkdir STAR
#At: /home/denis/results/Trim_Reads
#Creating a loop to go through the entire directory and run each aligment in serial:
for i in *_1.trim.fq.gz
> do
> STAR --genomeDir /home/denis/Ref_genome/star_index \
> --runThreadN 10 \
> --readFilesCommand zcat \
> --readFilesIn $i ${i%_1.trim.fq.gz}_2.trim.fq.gz \
> --outFileNamePrefix /home/denis/results/STAR/${i%_1.trim.fq.gz} \
> --outSAMtype BAM SortedByCoordinate \
> --outSAMunmapped Within #output unmapped reads within the main SAM file (i.e. Aligned.out.sam)
> done
#Mapping all the samples togheter
STAR --genomeDir /home/denis/Ref_genome/star_index \
--runThreadN 10 \
--readFilesCommand zcat \
--readFilesIn T26_1.trim.fq.gz,T29_1.trim.fq.gz,T31_1.trim.fq.gz,T4_1.trim.fq.gz,T7_1.trim.fq.gz T26_2.trim.fq.gz,T29_2.trim.fq.gz,T31_2.trim.fq.gz,T4_2.trim.fq.gz,T7_2.trim.fq.gz \
--outFileNamePrefix /home/denis/results/STAR/All \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within
````
#### Assessing Alignment Quality
````
#At: /home/denis/results/STAR
multiqc .
#Passando o arquivo html para o local
scp -P 2205 denis@143.107.244.181:/home/denis/results/STAR/multiqc_report.html .
````
Results:

#### Running Trinity
About Trinity
https://github.com/trinityrnaseq/trinityrnaseq/wiki/Running-Trinity
````
Trinity --genome_guided_bam AllAligned.sortedByCoord.out.bam --max_memory 50G --genome_guided_max_intron 10000 --CPU 15
````
## 3 - Assessing transcriptome assembly quality with busco
BUSCO (Benchmarking Universal Single-Copy Orthologs) provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs.
Manual:
https://busco.ezlab.org/busco_userguide.html#conda-package
Creating a new environment with BUSCO installed
```
conda create -n teste -c conda-forge -c bioconda busco=5.3.1
# To activate this environment, use
conda activate teste
# To deactivate an active environment, use
conda deactivate
```
Running BUSCO 5.3.1 for the transcriptome of each sample
```
# I am here now: ~/results/Trasncriptome_assembly
#vamos usar como input o transcriptoma com maior númeor de contigs (soft filtered)
nano script_busco
busco -i rnaSpades_T4/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T4 -m transcriptome
busco -i rnaSpades_T7/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T7 -m transcriptome
busco -i rnaSpades_T26/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T26 -m transcriptome
busco -i rnaSpades_T29/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T29 -m transcriptome
busco -i rnaSpades_T31/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T31 -m transcriptome
busco -i rnaSpades_T30/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T30 -m transcriptome
busco -i rnaSpades_T32/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T32 -m transcriptome
busco -i rnaSpades_T34/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T34 -m transcriptome
busco -i rnaSpades_T36/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T36 -m transcriptome
busco -i rnaSpades_T37/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T37 -m transcriptome
busco -i rnaSpades_T38/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T38 -m transcriptome
busco -i rnaSpades_T39/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_T39 -m transcriptome
bash script_busco #tive que ativar o ambiente teste novamente dentro da screen
```
Now we use a python script (generate_plot.py) that builds R scripts for the graphs, this will build an .R file that may be directly run in R.
```
#first create a folder and then copy the BUSCO short summary file from each of the runs you want to plot into this folder.
mkdir BUSCO_summaries
cp short_summary.specific.hemiptera_odb10.busco_T4.txt ../BUSCO_summaries/
cp short_summary.specific.hemiptera_odb10.busco_T7.txt ../BUSCO_summaries/
cp short_summary.specific.hemiptera_odb10.busco_T26.txt ../BUSCO_summaries/
cp short_summary.specific.hemiptera_odb10.busco_T29.txt ../BUSCO_summaries/
cp short_summary.specific.hemiptera_odb10.busco_T31.txt ../BUSCO_summaries/
#Then simply run the script giving as argument the name of the folder you created containing the summaries you wish to plot.
python3 /home/denis/anaconda3/envs/teste/bin/generate_plot.py -wd BUSCO_summaries
#Passando os arquivos gerados para o local
scp -P 2205 denis@143.107.244.181:/home/denis/results/Busco/BUSCO_summaries/busco_figure.png .
scp -P 2205 denis@143.107.244.181:/home/denis/results/Busco/BUSCO_summaries/busco_figure.R .
```
results:

Running BUSCO 5.3.1 for the denovo and genome-guided transcriptomes, and for E. heros genome
```
#At: ~/results/Trasncriptome_assembly
busco -i Transcriptome_denovo/soft_filtered_transcripts.fasta -l hemiptera_odb10 -o busco_denovo -m transcriptome
busco -i Transcriptome_gguided/trinty_out_dir/Trinity-GG.fasta -l hemiptera_odb10 -o busco_gguided -m transcriptome
#At: ~/Ref_genome
busco -i GCA_003667255.1_E_heros_v1.0_sep_2018_genomic.fna -l hemiptera_odb10 -o busco_genome -m genome
#At: ~/results/Busco
python3 /home/denis/anaconda3/envs/teste/bin/generate_plot.py -wd BUSCO_summaries2
```
results:

## 4 - Keeping only the longest isoforms
many of contigs assembled are “repeated” versions of the same gene with little variation (i.e. in few nucleotides). Although these repetitions may be biologically significant (i.e. isoforms), probably most of them are assembly artifacts. With this step we remove those, keeping only the longest isoforms.
````
#Spades
#Passando o script para o servidor
scp -P 2205 ./get_longest_isoforms_rnaspades.sh denis@143.107.244.181:/home/denis/results/Trasncriptome_assembly/Transcriptome_denovo
#Rodando o script
bash ./get_longest_isoforms_rnaspades.sh hard_filtered_transcripts.fasta
````
````
#Trinity
perl ./get_longest_isoform_seq_per_trinity_gene.pl Trinity.fasta > Trinity.longest
````
## 5 - Transcriptome Annotation
Transcriptome annotation provides insight into the function and biological process of transcripts and the proteins they encode.
### 5.1 - Transcriptome Annotation with Trinotate
Trinotate v3.2.2 is a comprehensive annotation suite designed for automatic functional annotation of transcriptomes, particularly de novo assembled transcriptomes, from model or non-model organisms.
About Trinotate:
https://github.com/Trinotate/Trinotate.github.io/blob/master/index.asciidoc
#### Obtaining the protein database files by running this Trinotate build process.
````
/home/denis/anaconda3/bin/Build_Trinotate_Boilerplate_SQLite_db.pl Trinotate
#Prepare the protein database for blast searches by:
makeblastdb -in uniprot_sprot.pep -dbtype prot
#Uncompress and prepare the Pfam database for use with hmmscan like so:
gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm
````
#### Extracting the longest open reading frames
````
TransDecoder.LongOrfs -t /home/denis/results/Trasncriptome_assembly/Trinity_denovo/Trinity.longest
````
#### predicting the likely coding region
```
TransDecoder.Predict -t /home/denis/results/Trasncriptome_assembly/Trinity_denovo/Trinity.longest --retain_pfam_hits /home/denis/results/HMMER/TrinotatePFAM.out --retain_blastp_hits /home/denis/results/Blast/Trinity_transcriptome/blastp.outfmt6
```
The set of candidate coding regions can be found as files '.transdecoder.' where extensions include .pep, .cds, .gff3, and .bed
#### Capturing BLAST Homologies
````
#Search Trinity transcripts
blastx -query /home/denis/results/Trasncriptome_assembly/Trinity_denovo/Trinity.longest -db /home/denis/results/Annotation/Trinotate/uniprot_sprot.pep -num_threads 15 -max_target_seqs 1 -outfmt 6 > blastx.outfmt6
#I got this warning: "[blastx] Examining 5 or more matches is recommended", but it does not stop the job.
# search Transdecoder-predicted proteins
blastp -query /home/denis/results/TransDecoder/Trinity_transcriptome/Trinity.longest.transdecoder.pep -db /home/denis/results/Annotation/Trinotate/uniprot_sprot.pep -num_threads 15 -max_target_seqs 1 -outfmt 6 > blastp.outfmt6
````
#### Running HMMER to identify protein domains
````
hmmscan --cpu 20 --domtblout TrinotatePFAM.out /home/denis/results/Annotation/Trinotate/Pfam-A.hmm /home/denis/results/TransDecoder/Trinity_transcriptome/Trinity.longest.transdecoder.pep > pfam.log
````
#### Running signalP to predict signal peptides
````
signalp -f short -n signalp2.out /home/denis/results/TransDecoder/Trinity_transcriptome/Trinity.longest.transdecoder.pep
````
#### Running tmHMM to predict transmembrane regions
````
tmhmm --short < /home/denis/results/TransDecoder/Trinity_transcriptome/Trinity.longest.transdecoder.pep > tmhmm.out
````
#### Running RNAMMER to identify rRNA transcripts
````
RnammerTranscriptome.pl --transcriptome /home/denis/results/Trasncriptome_assembly/Trinity_denovo/Trinity.longest --path_to_rnammer /home/denis/results/RNAmmer/Installation/rnammer
````
#### Loading Above Results into a Trinotate SQLite Database
````
#1. Load transcripts and coding regions
/home/denis/anaconda3/bin/get_Trinity_gene_to_trans_map.pl /home/denis/results/Trasncriptome_assembly/Trinity_denovo/Trinity.longest > gene_trans_map
Trinotate Trinotate.sqlite2 init --gene_trans_map Trinity.fasta.gene_trans_map --transcript_fasta /home/denis/results/Trasncriptome_assembly/Trinity_denovo/Trinity.longest --transdecoder_pep /home/denis/results/TransDecoder/Trinity_transcriptome/Trinity.longest.transdecoder.pep
````
````
#2. Loading BLAST homologies
# load protein hits
Trinotate Trinotate.sqlite LOAD_swissprot_blastp /home/denis/results/Blast/Trinity_transcriptome/blastp.outfmt6
# load transcript hits
Trinotate Trinotate.sqlite LOAD_swissprot_blastx /home/denis/results/Blast/Trinity_transcriptome/blastx.outfmt6
````
````
#3. Load Pfam domain entries
Trinotate Trinotate.sqlite LOAD_pfam /home/denis/results/HMMER/TrinotatePFAM.out
````
````
#4. Load transmembrane domains
Trinotate Trinotate.sqlite LOAD_tmhmm /home/denis/results/tmHMM/tmhmm.out
````
````
#5. Load signal peptide predictions
Trinotate Trinotate.sqlite LOAD_signalp /home/denis/results/signalP/signalp.out
````
#### Output an Annotation Report
````
Trinotate Trinotate.sqlite report > trinotate_annotation_report2.xls
````
### 5.2 - Transcriptome Annotation with eggNOG mapper
We will be using eggnog-mapper v2.1.7 which is a tool for fast functional annotation of novel sequences (genes or proteins) using precomputed eggNOG-based orthology assignments. To start an annotation job, provide a FASTA file containing your query sequences (-i option), specify a project name which will be used as a prefix for all the output files (-o option), and run emapper.py
About eggnog-mapper:
https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.7#Conda_bioconda_channel_version
I installed version 2.1.7 using a conda environment.
````
conda create -n eggnog -c bioconda -c conda-forge eggnog-mapper
conda activate eggnog
conda deactivate
````
#### Running emapper.py
````
emapper.py --cpu 15 -i /home/denis/results/TransDecoder/Trinity.longest.transdecoder.cds -o E_heros --itype CDS --translate -m diamond --evalue 0.000001 --pident 50 --query_cover 70 --seed_ortholog_evalue 0.000001 --target_orthologs all --report_orthologs --go_evidence all --md5 --decorate_gff yes --excel
python3 /home/pedro/Programs/eggnog-mapper/emapper.py --cpu 6 -i /home/pedro/Calliphoridae_pedro_gi/Cmac_ANNOT/Cmac_unnanotated_transcripts.fasta -o Cbez_annot_eggnog --data_dir /home/pedro/Programs/eggnog-mapper/data/ --itype CDS --translate -m diamond --evalue 0.000001 --pident 50 --query_cover 70 --seed_ortholog_evalue 0.000001 --target_orthologs all --report_orthologs --go_evidence all --md5 --decorate_gff yes
````
#### Doing the same for spades transcriptome
````
TransDecoder.LongOrfs -t /home/denis/results/Trasncriptome_assembly/Spades_denovo/Spades.longest.hf
blastp -query /home/denis/results/TransDecoder/Spades_transcriptome/Spades.longest.hf.transdecoder_dir/longest_orfs.pep \
-db /home/denis/results/Annotation/Trinotate/uniprot_sprot.pep -max_target_seqs 1 \
-outfmt 6 -evalue 1e-5 -num_threads 15 > blastp.outfmt6
hmmscan --cpu 15 --domtblout pfam.domtblout /home/denis/results/Annotation/Trinotate/Pfam-A.hmm /home/denis/results/TransDecoder/Spades_transcriptome/Spades.longest.hf.transdecoder_dir/longest_orfs.pep
TransDecoder.Predict -t /home/denis/results/Trasncriptome_assembly/Spades_denovo/Spades.longest.hf --retain_pfam_hits /home/denis/results/HMMER/pfam.domtblout --retain_blastp_hits /home/denis/results/Blast/Spades_transcriptome/blastp.outfmt6
emapper.py --cpu 15 -i /home/denis/results/TransDecoder/Spades_transcriptome/Spades.longest.hf.transdecoder.cds -o E_heros --itype CDS --translate -m diamond --evalue 0.000001 --pident 50 --query_cover 70 --seed_ortholog_evalue 0.000001 --target_orthologs all --report_orthologs --go_evidence all --md5 --decorate_gff yes --excel
````
### 5.2 - Transcriptome Annotation with Diamond
We will be using Diamond v0.9.21. Diamond is an almostdrop-in replacement for blastp and blastx, both due to its speed, and due to the fact that it mimics the BLAST command line function calls and output formats. The main drawback of the tool is that it can only operate with amino acid sequences as targets. However, it does accept both nucleotide and protein queries. Therefore, it is a great choice for performing protein versus protein (or translated nucleotide versus protein) searches while annotating de novo assembled transcriptomes.
About:
https://github.com/bbuchfink/diamond
First I downloaded the protein sequences from *Halyomorpha halys* and made a database with them:
````
diamond makedb --in Hhal_2.0_protein.faa -d Hhal
````
Running a search in blastx mode:
````
diamond blastx -d Hhal.dmnd -q /home/denis/results/Trasncriptome_assembly/Trinity_denovo/Trinity.longest -o matches.tsv --threads 20
````
reporting only the alignments whose score is at most 10% lower than the best alignment score for a query:
````
diamond blastx -d Hhal.dmnd -q /home/denis/results/Trasncriptome_assembly/Trinity_denovo/Trinity.longest -o matches2.tsv --top 10 --threads 20
````
In both ways, 50911 queries were aligned.
Para saber quantas das ~25.000 proteínas do proteoma de H. halys deram match com os inputs:
````
#Isolando a segunda coluna do arquivo matches2.tsv
awk '{print $2}' matches2.tsv > contig_codes.txt
#removendo os contigs repetidos
sort contig_codes.txt | uniq -c > couts.txt
#contando o número de linhas
wc -l counst.txt
````
Resultado: 22.167
## 6 - Measuring expression levels
We will be using Kallisto v0.46.2.
About:
https://github.com/pachterlab/kallisto
kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment.
#### Building an index
Pseudoalignment requires processing a transcriptome file to create a “transcriptome index”.
````
kallisto index -i transcripts.idx /home/denis/results/Trasncriptome_assembly/Trinity_denovo/Trinity.longest
````
#### Quantification
Now you can quantify abundances of the transcripts using the two read files reads_1.fastq.gz and reads_2.fastq.gz (kallisto can read in either plain-text or gzipped read files).
Important note: only supply one sample at a time to kallisto. The multiple FASTQ (pair) option is for users who have samples that span multiple FASTQ files.
````
nano script_kallisto
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T4 -b 100 /home/denis/results/Trim_Reads/T4_1.trim.fq.gz /home/denis/results/Trim_Reads/T4_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T7 -b 100 /home/denis/results/Trim_Reads/T7_1.trim.fq.gz /home/denis/results/Trim_Reads/T7_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T26 -b 100 /home/denis/results/Trim_Reads/T26_1.trim.fq.gz /home/denis/results/Trim_Reads/T26_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T29 -b 100 /home/denis/results/Trim_Reads/T29_1.trim.fq.gz /home/denis/results/Trim_Reads/T29_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T30 -b 100 /home/denis/results/Trim_Reads/T30_1.trim.fq.gz /home/denis/results/Trim_Reads/T30_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T31 -b 100 /home/denis/results/Trim_Reads/T31_1.trim.fq.gz /home/denis/results/Trim_Reads/T31_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T32 -b 100 /home/denis/results/Trim_Reads/T32_1.trim.fq.gz /home/denis/results/Trim_Reads/T32_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T34 -b 100 /home/denis/results/Trim_Reads/T34_1.trim.fq.gz /home/denis/results/Trim_Reads/T34_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T36 -b 100 /home/denis/results/Trim_Reads/T36_1.trim.fq.gz /home/denis/results/Trim_Reads/T36_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T37 -b 100 /home/denis/results/Trim_Reads/T37_1.trim.fq.gz /home/denis/results/Trim_Reads/T37_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T38 -b 100 /home/denis/results/Trim_Reads/T38_1.trim.fq.gz /home/denis/results/Trim_Reads/T38_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/T39 -b 100 /home/denis/results/Trim_Reads/T39_1.trim.fq.gz /home/denis/results/Trim_Reads/T39_2.trim.fq.gz -t 20
bash script_kallisto
````
#### Doing the same for spades transcriptome
````
kallisto index -i transcripts.idx /home/denis/results/Trasncriptome_assembly/Spades_denovo/Spades.longest.hf
nano script_kallisto
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T4 -b 100 /home/denis/results/Trim_Reads/T4_1.trim.fq.gz /home/denis/results/Trim_Reads/T4_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T7 -b 100 /home/denis/results/Trim_Reads/T7_1.trim.fq.gz /home/denis/results/Trim_Reads/T7_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T26 -b 100 /home/denis/results/Trim_Reads/T26_1.trim.fq.gz /home/denis/results/Trim_Reads/T26_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T29 -b 100 /home/denis/results/Trim_Reads/T29_1.trim.fq.gz /home/denis/results/Trim_Reads/T29_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T30 -b 100 /home/denis/results/Trim_Reads/T30_1.trim.fq.gz /home/denis/results/Trim_Reads/T30_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T31 -b 100 /home/denis/results/Trim_Reads/T31_1.trim.fq.gz /home/denis/results/Trim_Reads/T31_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T32 -b 100 /home/denis/results/Trim_Reads/T32_1.trim.fq.gz /home/denis/results/Trim_Reads/T32_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T34 -b 100 /home/denis/results/Trim_Reads/T34_1.trim.fq.gz /home/denis/results/Trim_Reads/T34_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T36 -b 100 /home/denis/results/Trim_Reads/T36_1.trim.fq.gz /home/denis/results/Trim_Reads/T36_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T37 -b 100 /home/denis/results/Trim_Reads/T37_1.trim.fq.gz /home/denis/results/Trim_Reads/T37_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T38 -b 100 /home/denis/results/Trim_Reads/T38_1.trim.fq.gz /home/denis/results/Trim_Reads/T38_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Spades.transcriptome/T39 -b 100 /home/denis/results/Trim_Reads/T39_1.trim.fq.gz /home/denis/results/Trim_Reads/T39_2.trim.fq.gz -t 20
bash script_kallisto
````
#### Doing the same only for coding sequences
````
kallisto index -i transcripts.idx /home/denis/results/TransDecoder/Trinity_transcriptome/Trinity.longest.transdecoder.cds
nano script_kallisto
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T4 -b 100 /home/denis/results/Trim_Reads/T4_1.trim.fq.gz /home/denis/results/Trim_Reads/T4_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T7 -b 100 /home/denis/results/Trim_Reads/T7_1.trim.fq.gz /home/denis/results/Trim_Reads/T7_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T26 -b 100 /home/denis/results/Trim_Reads/T26_1.trim.fq.gz /home/denis/results/Trim_Reads/T26_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T29 -b 100 /home/denis/results/Trim_Reads/T29_1.trim.fq.gz /home/denis/results/Trim_Reads/T29_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T30 -b 100 /home/denis/results/Trim_Reads/T30_1.trim.fq.gz /home/denis/results/Trim_Reads/T30_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T31 -b 100 /home/denis/results/Trim_Reads/T31_1.trim.fq.gz /home/denis/results/Trim_Reads/T31_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T32 -b 100 /home/denis/results/Trim_Reads/T32_1.trim.fq.gz /home/denis/results/Trim_Reads/T32_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T34 -b 100 /home/denis/results/Trim_Reads/T34_1.trim.fq.gz /home/denis/results/Trim_Reads/T34_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T36 -b 100 /home/denis/results/Trim_Reads/T36_1.trim.fq.gz /home/denis/results/Trim_Reads/T36_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T37 -b 100 /home/denis/results/Trim_Reads/T37_1.trim.fq.gz /home/denis/results/Trim_Reads/T37_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T38 -b 100 /home/denis/results/Trim_Reads/T38_1.trim.fq.gz /home/denis/results/Trim_Reads/T38_2.trim.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/cds/T39 -b 100 /home/denis/results/Trim_Reads/T39_1.trim.fq.gz /home/denis/results/Trim_Reads/T39_2.trim.fq.gz -t 20
bash script_kallisto
````
#### Doing the same only for M samples
````
nano script_kallisto
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Initial_males/T6M -b 100 /home/denis/raw_seqs/machos_e_femeas/T6M_1.fq.gz /home/denis/raw_seqs/machos_e_femeas/T6M_2.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Initial_males/T8M -b 100 /home/denis/raw_seqs/machos_e_femeas/T8M_1.fq.gz /home/denis/raw_seqs/machos_e_femeas/T8M_2.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Initial_males/T9M -b 100 /home/denis/raw_seqs/machos_e_femeas/T9M_1.fq.gz /home/denis/raw_seqs/machos_e_femeas/T9M_2.fq.gz -t 20
kallisto quant -i transcripts.idx -o /home/denis/results/Kallisto/Initial_males/T11M -b 100 /home/denis/raw_seqs/machos_e_femeas/T11M_1.fq.gz /home/denis/raw_seqs/machos_e_femeas/T11M_2.fq.gz -t 20
bash script_kallisto
````
## 7 - Evaluating differentially expressed genes
first lets keep only columns 1 and 4 from the outputs of Kallisto, these are (1) the contig code and (4) estimated read counts
````
#tail -2 list all the file starting from 2nd line
#sort put it in alphabetical order using the first column
#cut -f 1,4 keeps only columns 1 and 4
nano script_import
tail -n +2 /home/denis/results/Kallisto/T4/abundance.tsv | cut -f 1,4 > T4_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T7/abundance.tsv | cut -f 1,4 > T7_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T26/abundance.tsv | cut -f 1,4 > T26_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T29/abundance.tsv | cut -f 1,4 > T29_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T30/abundance.tsv | cut -f 1,4 > T30_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T31/abundance.tsv | cut -f 1,4 > T31_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T32/abundance.tsv | cut -f 1,4 > T32_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T34/abundance.tsv | cut -f 1,4 > T34_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T36/abundance.tsv | cut -f 1,4 > T36_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T37/abundance.tsv | cut -f 1,4 > T37_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T38/abundance.tsv | cut -f 1,4 > T38_abundance.txt
tail -n +2 /home/denis/results/Kallisto/T39/abundance.tsv | cut -f 1,4 > T39_abundance.txt
bash script_import
````
````
paste T4_abundance.txt T7_abundance.txt T26_abundance.txt T29_abundance.txt T30_abundance.txt T31_abundance.txt T32_abundance.txt T34_abundance.txt T36_abundance.txt T37_abundance.txt T38_abundance.txt T39_abundance.txt > all_samples_expression.txt
````
Lets try using columns 1 and 5 from the outputs of Kallisto, these are (1) the contig code and (5) Transcripts Per Million (TPM)
````
nano script_import2
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T4/abundance.tsv | cut -f 1,5 > T4_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T7/abundance.tsv | cut -f 1,5 > T7_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T26/abundance.tsv | cut -f 1,5 > T26_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T29/abundance.tsv | cut -f 1,5 > T29_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T30/abundance.tsv | cut -f 1,5 > T30_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T31/abundance.tsv | cut -f 1,5 > T31_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T32/abundance.tsv | cut -f 1,5 > T32_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T34/abundance.tsv | cut -f 1,5 > T34_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T36/abundance.tsv | cut -f 1,5 > T36_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T37/abundance.tsv | cut -f 1,5 > T37_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T38/abundance.tsv | cut -f 1,5 > T38_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Trinity.transcriptome/T39/abundance.tsv | cut -f 1,5 > T39_abundance.txt
bash script_import2
````
````
paste T4_abundance.txt T7_abundance.txt T26_abundance.txt T29_abundance.txt T30_abundance.txt T31_abundance.txt T32_abundance.txt T34_abundance.txt T36_abundance.txt T37_abundance.txt T38_abundance.txt T39_abundance.txt > all_samples_expression2.txt
````
M samples
````
nano script_import
tail -n +2 /home/denis/results/Kallisto/Initial_males/T6M/abundance.tsv | cut -f 1,4 > T6M_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Initial_males/T8M/abundance.tsv | cut -f 1,4 > T8M_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Initial_males/T9M/abundance.tsv | cut -f 1,4 > T9M_abundance.txt
tail -n +2 /home/denis/results/Kallisto/Initial_males/T11M/abundance.tsv | cut -f 1,4 > T11M_abundance.txt
bash script_import
````
````
paste T6M_abundance.txt T8M_abundance.txt T9M_abundance.txt T11M_abundance.txt > M_samples_expression.txt
`````
## TransPi
https://github.com/PalMuc/TransPi
````
nextflow run TransPi.nf --all --reads '/home/denis/raw_seqs/plasticidade/*_[1,2].fq.gz' \
--k 25,35,55,75,85 --maxReadLen 150 -profile conda
nextflow run TransPi.nf --onlyAsm --reads '/home/denis/results/Trim_Reads/*_[1,2].trim.fq.gz' \
--k 25,35,55,75,85 --maxReadLen 150 -profile conda
````
## rnaQUAST
https://github.com/ablab/rnaquast
````
python ./rnaQUAST.py --transcripts /home/denis/results/Trasncriptome_assembly/Trinity_denovo/Trinity.longest /home/denis/results/Trasncriptome_assembly/Spades_denovo/hard_filtered_transcripts_long_isof.fa --busco hemiptera_odb10 --gene_mark
````