# RNA-Seq analysis in non model organisms - Day 3 Today's topic is **Third generation sequencing, de-novo transcriptome assembly, assessment and evalutation** ## Course overview ```mermaid gantt title Course progression dateFormat YYYY-MM-DD axisFormat %e section Preprocessing Intro :a1, 2022-11-28, 12h FastQC :a2, after a1 , 12h Trimming, Sorting :after a2 , 1d section De-novo assembly Assembly :2022-11-29 , 1d Evaluation : 12h Annotation :12h section Analysis Quantification : 2022-12-01 , 1d Differential Gene Expression : 1d ``` ```mermaid flowchart TD p0((Channel.fromFilePairs)) p1((Channel.fromPath)) p2([collect]) p3((Channel.fromPath)) p4((Channel.fromPath)) p5([collect]) p6((Channel.fromPath)) p7([collect]) p8((Channel.fromPath)) p9[INDEX_RNA] p10[REMOVE_RNA] p11(( )) p12[TRIM] p13(( )) p14[RCORRECTOR] p15(( )) p16([collect]) p17([collect]) p18[TRINITY] p19([collect]) p20([collect]) p21[CAT_LIBRARIES] p22[DETONATE] p23(( )) p24(( )) p25[TRANSRATE] p26(( )) p27(( )) p28[BUSCO] p29(( )) p30[TRANSDECODER] p31(( )) p32(( )) p33[EGGNOG] p34(( )) p35(( )) p36[SALMON_INDEX] p37(( )) p38[SALMON] p39(( )) p40[QC_RAW] p41[QC_RRNA] p42[QC_TRIM] p43([concat]) p44([collect]) p45[MULTIQC] p46(( )) p0 --> p10 p1 --> p2 p2 --> p9 p3 --> p12 p4 --> p5 p5 --> p30 p6 --> p7 p7 --> p30 p8 --> p33 p9 --> p10 p2 -->|fasta_files| p10 p10 --> p12 p10 --> p11 p10 --> p43 p12 --> p14 p12 --> p13 p12 --> p43 p14 --> p16 p14 --> p15 p16 --> p18 p14 --> p17 p17 --> p18 p18 --> p22 p14 --> p19 p19 --> p21 p14 --> p20 p20 --> p21 p21 --> p22 p22 --> p24 p22 --> p23 p18 -->|transcriptome| p25 p21 -->|reads| p25 p25 --> p27 p25 --> p26 p18 -->|transcriptome| p28 p28 --> p43 p28 --> p29 p18 -->|transcriptome| p30 p30 --> p33 p30 --> p32 p30 --> p31 p33 --> p35 p33 --> p34 p18 -->|transcriptome| p36 p36 --> p38 p36 --> p37 p12 --> p38 p38 --> p43 p38 --> p39 p0 --> p40 p40 --> p43 p10 --> p41 p41 --> p43 p12 --> p42 p42 --> p43 p43 --> p44 p44 --> p45 p45 --> p46 ``` ## Learning objectives By the end of the day, you will be able: * Compare 2nd and 3rd generation sequencing methods for de-novo transcriptome assembly * Retrieve the necessary instructions to perform the assembly * Compute different assembly assessment metrics perform the assembly evaluation * Describe the principles of functional annotation ## Schedule | Time Slot | Title | Instructor | |-----------|-------|------------| | 14:00 - 14:30 | [Revision session](/YL2Lj2nUTwqAJVVoeNLnSg) | Bastian | | 14:30 - 16:30 | [Transcriptome QC](https://bschiffthaler.github.io/TranscriptomeEvaluation/) **Guided practical** | Bastian | | | [Transcriptome QC](https://gist.github.com/bschiffthaler/515619bbb65de786ebb156462b21cca4) | 16:30 - 19:40 | [Transcriptome Annotation](https://bschiffthaler.github.io/TranscriptomeAnnotation/) **Lecture...** | Bastian | | | [Transcriptome Annotation](https://gist.github.com/bschiffthaler/23d183eff6d085c40b3ddbdd56c47e86) **...and tutorial** | Bastian | | 19:40 - 20:00 | [Feedback and Assessment](https://hackmd.io/@bschiffthaler/S1cNpLZvs/%2FsLweTbo8Q0OtrW5Llnh5PQ)| Nico| ## Rooms Poll results: 25% No, 25% Yes, 50% Undecided. Trial: We will create four rooms, join a room according to your preference based on the table below: |Room number|Preference| |---|---| |1|Individual work| |2|Some interaction| |3|More interaction| |4|Collaborative work| ## Questions ### Revision session ### Data preprocessing * Could you please summarise what each number represents exactly in the file name? * Nico: An example filename (HiSeq 2500): ``` 6_150123_BC6KU1ANXX_P1790_101_1.fastq.gz 6_150123_BC6KU1ANXX_P1790_101_2.fastq.gz ``` The format is Lane_YearMonthDay_FlowCellID_sampleID_Fwd/Rev_read.fq.gz - sample here is `P1790_101` * Nico: An example filename (NovaSeq 6000): ``` P12108_262_S125_L002_R1_001.fastq.gz ``` The format is Sample_ID_Lane_Read_00X.fastq.gz. Read can be R1 or R2 * We will have multiple reads and we can run Trimmomatic as an array by creating a rawlist. Can we do the same in Rcorrector? * Nico: not sure what you mean by array? * Both Trimmomatic and Rcorrector will work on a individual sample at a time. What we can do is create a for loop to iterate across all samples. * If we creat variables for file1 and file2 we don't need a for loop I think. Not sure if I can due it with Rcorrector... * Nico: not exactly. The variable file1 and file2 will only contain one value at a time, so you will still need a loop to iterate other the list/array of filenames. The same would apply for Rcorrector * Yesterday, the slidingwindow was 5:20 today 5:5, so keep the reads with phred 5. Will this values depend if we are using the "sample" that we will use to assemble de novo or if we are preparing the samples for DEG? In DEG we want to keep more info so 5:5... * Nico: the first value is the number of base in the sliding window. The second is the quality score for trimming. If the average quality across the window size is below the quality threshold, the read is cut at that point. Note that the scanning starts from the beginning of the read. The quality value is an arbitrary choice. You might use 20 for one project and 5 for another. As long as you can justify your choice... any value is fine :slightly_smiling_face: * What are the rcorrector additions to the ID line? * From [GitHub](https://github.com/mourisl/Rcorrector) ``` "cor": some bases of the sequence are corrected "unfixable_error": the errors could not be corrected "l:INT m:INT h:INT": the lowest, median and highest kmer count of the kmers from the read ``` * If you assume that the corrected reads are 'the truth', why dont you use them also for quantification? * Because Rcorrector might affect the properties of the data (similar to what the diginorm does). Hence the data is no longer the raw data that expression quantification tools expect. ### De-novo transcriptome assembly * Using the genome_guided mode, can we still call our assembly "de novo"? * Nico: Sure :slightly_smiling_face: It is still de-novo as you start from the raw reads. It just has some extra guidance to reduce the generation of chimeras. Same as if you use longer reads, they are just used to resolve complexities in the graph. * Not sure if I understood... So, my transcriptome assembly should be made with all the read files (control and treatment condictions)? * Nico: You can do an assembly with all your samples or you can do assemblies one with all control and one with ll treatment. The latter if you assume that there are a lot of different transcripts between the conditions. That is mostly relevant if you work with different tissues / strains / _etc._ If you use the same tissue and "just" apply a treatment, you commonly do not expect such changes in the transcriptome. There running all samples in a single assembly is the most relevant approach IMO. * Do you treat long reads similar as the short reads prior to assembly? I'm referring to phred cutoff, rRNA filtering and running rcorrector? * Nico: Not needed as the filtering/processing done on the long read data is going to have taken care of that. rRNA filtering is important for short reads but unlikely to be in mRNA long read libraries. rCorrector is not needed as that is to correct issues in short reads. The fact the PacBio sequences the mRNA multiple times in one go will give very high quality sequences. ONT has a model that also gives very good quality sequences (99%) accuracy. Hence, both do some type of "read correction" already, if you like. * If you have a transcriptome assembly (e.g., a public one), but do not have raw reads. How can you estimate the quality of it? * Nico: You could use the reads that you generate for that. If it is before you generate your data, then running BUSCO, then transdecoder to extract the protein sequence and performing an alignment against UniRef90 (Uniprot, but the redundancy limited to 90% - so sequences with similarity > 90% are clustered together and a single reference sequence is kept) using diamond. Then you can from the UniRef IDs extract the taxon and some description. * If there are any related organisms, with a genome / transcriptome comparing to it. * Looking at the sequence properties (length, GC, _etc._) are all set of evidences you can build to assert the quality. * If you assemple a transcriptome of a polyplois species (e.g., 8x), do you need to use some additional parameters to set the number of possible variants of a gene? * Nico: No there is no such options. That's because the assembler uses a de-bruijn graph and bubbles introduced by SNPs, Indels are "popped" based on read coverage. So you would get a set of transcripts that are a "mix" of your 8 alleles. You would need to run something like GATK post-hoc to identify the variants. * Does it mean I can't just run Trinity? * Sure you can! You just need to know that as a caveat. Spruce is diploid, so less of an issue, but it still works fine. We have a hybrid between two species with different species as "parents" and it works well do. Transcripts that are different enough are well identified, and those that are not, we can identify using GATK. So more work, but doable. * In this assebling process,how many fragments did we use? Not only those two (ze_1_trimmed_1.cor.fq.gz and ze_1_trimmed_2.cor.fq.gz),right? Trinity \ --seqType fq \ --left ~/transcriptome/ze_1_trimmed_1.cor.fq.gz \ --right ~/transcriptome/ze_1_trimmed_2.cor.fq.gz \ --output trinity_short_only \ --max_memory 32G \ --CPU 8 * if you run `gunzip -c ~/transcriptome/ze_1_trimmed_1.cor.fq.gz | wc -l` you will get the number of lines in the fastq file. Divide that by 4 and you will know the number of fragments. I believe 100,000.