Revision session - Day 4

# Revision session - Day 4 ## Feedback * Not sure my sense of humour is as Bastian's... so bare with me :sweat_smile: * Flowchart: ```mermaid flowchart TD data --> FastQC data --> SortMeRNA SortMeRNA --> FastQC SortMeRNA --> Trimmomatic Trimmomatic --> FastQC FastQC --> MultiQC SortMeRNA --> MultiQC Trimmomatic --> MultiQC Trimmomatic --> rCorrector rCorrector --> Trinity Trinity --> Detonate Trinity --> Transrate Trinity --> BUSCO Trinity --> transdecoder transdecoder --> eggnog Trimmomatic --> salmon Trinity --> salmon salmon --> DESeq2 ``` * eggnog output: we follow the trail from the GitHub [repository](https://github.com/eggnogdb/eggnog-mapper) to the [WIKI](https://github.com/eggnogdb/eggnog-mapper/wiki) to the current version [output description](https://github.com/eggnogdb/eggnog-mapper/wiki) From `Elsa`: ``` ## Wed Nov 30 20:03:08 2022 ## emapper-2.1.9 ## /usr/local/bin/eggnog-mapper-2.1.9/emapper.py --cpu 4 -i /home/elza/day3/transcriptome/annotation/decoded/longest_orfs.pep --translate -m diamond -o /home/elza/day3/transcriptome/annotation/eggnog ## #query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs TRINITY_DN1_c0_g1_i1.p1 3880.AES99404 2.71e-78 261.0 COG5099@1|root,KOG1488@2759|Eukaryota,37HNU@33090|Viridiplantae,3GAB4@35493|Streptophyta,4JMHX@91835|fabids 35493|Streptophyta J pumilio homolog - GO:0003674,GO:0003676,GO:0003723,GO:0003729,GO:0003730,GO:0005488,GO:0005575,GO:0005622,GO:0005623,GO:0005737,GO:0005829,GO:0044424,GO:0044444,GO:0044464,GO:0097159,GO:1901363 - ko:K17943 - - - - ko00000 - - - NABP,PUF TRINITY_DN1_c0_g1_i2.p1 3880.AES99404 1.08e-92 301.0 COG5099@1|root,KOG1488@2759|Eukaryota,37HNU@33090|Viridiplantae,3GAB4@35493|Streptophyta,4JMHX@91835|fabids 35493|Streptophyta J pumilio homolog - GO:0003674,GO:0003676,GO:0003723,GO:0003729,GO:0003730,GO:0005488,GO:0005575,GO:0005622,GO:0005623,GO:0005737,GO:0005829,GO:0044424,GO:0044444,GO:0044464,GO:0097159,GO:1901363 - ko:K17943 - - - - ko00000 - - - NABP,PUF TRINITY_DN10_c0_g1_i3.p1 13333.ERM97415 3.78e-153 462.0 28HBY@1|root,2QPQB@2759|Eukaryota,37NM8@33090|Viridiplantae,3G8U4@35493|Streptophyta 35493|Streptophyta S ETO1-like protein 1 - - - - - - - - - - - - BTB,TPR_8 TRINITY_DN10_c0_g1_i4.p1 13333.ERM97415 3.78e-153 462.0 28HBY@1|root,2QPQB@2759|Eukaryota,37NM8@33090|Viridiplantae,3G8U4@35493|Streptophyta 35493|Streptophyta S ETO1-like protein 1 - - - - - - - - - - - - BTB,TPR_8 ## 4 queries scanned ## Total time (seconds): 0.9135375022888184 ## Rate: 4.38 q/s ``` ## Assessment 1. Oxford Nanopore Direct Sequencing is an amplification free protocol **TRUE**/False 2. Which one is not a step of the trinity short-read de-novo assembly * **circular consensus sequence** * linear contig generation (inchworm) * de Bruijn graph generation (chrysalis) * de Bruijn graph traversal (butterfly) The last three are all steps of trinity. 3. List one advantage and one limitation of 3rd generation sequencing over 2nd generation sequencing in the context of de-novo transcriptome assembly. `3rd gen will do MORE for me! :) ... if I have the $$$$$` 4. Which is NOT a method that could be used to assess an assembly quality? * **SortMeRNA** * transrate * BUSCO * detonate Well you got it right, albeit we did not cover that topic yesterday... Why are we there? :-D 5. Describe in ONE sentence the principle of EITHER detonate or BUSCO. `detonate: the state of my mind after a full day of lectures` `detonate is so fancy and awesome that it will blow my mind tomorrow :-)` ## Questions? ## A bit more on reproducible research You have heard about making the data FAIR and Bastian mentioned numerous methods to augment reproducibility (container, conda). You have also seen Nextflow. We still miss one piece to have all the minimal skills. Here it comes: ### GitHub bonus! 1. I will create a public repository for the course 2. You will clone that repository 2. I will create a script to run salmon 3. I will create a script to iterate through the files 4. I will commit and push my changes 5. You will pull the changes 6. You will run the script [There](https://gist.github.com/nicolasDelhomme/46a1053d277510b95692318bd1732b6d) you will find the Git instructions we provide to onboard user at the UPSC. Most of the information there will be relevant to you if you ever want to use GitHub in your research. ### CLI commands ```{bash} mkdir salmon mkdir indices salmon -h salmon index -h salmon quant -h salmon quant --help-reads # build the index using salmon index - check the lecture for additional pointers. salmon index -t /data/raw_data/reference/fasta/Pabies1.0-all-phase.gff3.CDSandLTR-TE.fa.gz -i indices/Pabies1.0-all-phase.gff3.CDSandLTR-TE_salmon_1.8.0 # iterate on all samples, run salmon for f in $(find /data/raw_data/trimmomatic -name "*_trimmomatic_1.fq.gz"); do fnam=$(basename ${f/_1.fq.gz/}) salmon quant -l A \ -i ~/indices/Pabies1.0-all-phase.gff3.CDSandLTR-TE_salmon_1.8.0 \ -1 $f -2 /data/raw_data/trimmomatic/${fnam}_2.fq.gz \ -o salmon/$fnam --seqBias --gcBias \ -p 4 done ```