REWIRE - HackMD

# REWIRE ###### tags: `REWIRE` `TE_expression` `sitophilus` `piRNA` ## People involved (unordered) Elisa Dell'Aglio Agnès Vallier Mariana G. Ferrarini Théo Chancy Caroline Blanc Cristina Vieira Claire Hoede Christine Gaspin Rita Rebollo ## Data ### Experimental setting Produced by Elisa Dell'Aglio from Nov 2020 to January 2021. Check Elisa's report (C:\Users\rrebollo\Documents\work\BF2i\OngoingProjects\AAP_REWIRE\smallRNA) for all the material and methods prior to sequencing. For info, day 3 old adults were injected and day 10 were dissected. Small RNAs were extracted using a column based method. Below the qPCR confirmation of the RNAi against AGO2, AGO3 and SIWI. ![](https://i.imgur.com/FewtmyJ.png) #### RNA-seq RNAseq library construction was performed by the platform, while sRNAseq library construction was performed by Agnès with Elisa's help. Sequencing on a Hiseq 3000, 100bp paired end reads, the data was received on July 2021 ![](https://i.imgur.com/1u74p8Q.png) ``` wget -r --user=S20189 --password=fC7c978Fcaf8AA6eF618 https://transfert-ngs.igbmc.fr/210528_K00201_0492_AHLF3WBBXY_HS703/ ``` MD5sum check : ``` md5sum *.fastq.gz > md5sumToCheck.txt ``` The data is present in : /home/cdata/rebollo_cold/REWIRE/ and the working directory is /home/data/rrebollo/sitophilus/REWIRE/ The number of reads is really variable, from 29M to 171M... The most sequenced library is Lib45 (RTLR73) with 171M reads which is Gut Ago3_3. ![](https://i.imgur.com/tIW11Td.png) #### sRNA-seq **From REWIRE** ![](https://i.imgur.com/ifMtJJp.png) ![](https://i.imgur.com/vhEAVjn.png) Received data on Thursday 6th May 2021 ``` wget -r --user=S20188 --password=7PqmEinIOHRyNR0kz0VV ftp://ngs.igbmc.fr/210505_K00201_0487_AHLCT5BBXY_HS699 ``` MD5sum check : ok ``` md5sum *.fastq.gz > md5sumToCheck.txt ``` **From UNLEASh** D9 wild type RTRL11, RTRL12, RTRL12. These are not published yet but should be asap. However there are major differences: these are day 9 (instead of day 10) but most importantly, they were obtained by size selecting the small RNAs and not based on protein complexes as the REWIRE samples. ### Bioinformatic tool versions FastQC v0.11.9 MultiQC v1.10.1 Trimmomatic v0.39 STAR v2.7.1a TEtools v1.0.0 featureCounts v2.0.1 ## Quality control ### RNAseq ![](https://i.imgur.com/UIVg3G0.png) ![](https://i.imgur.com/u75Xwsm.png) Illumina universal adaptor present: ![](https://i.imgur.com/JDE5SBB.png) ### sRNAseq - [ ] add the info on the sRNAseq raw data ## RNAseq ### Trimming library.txt is a file with the libraries name where I added in the first column the name of read 1 and the second column the name of read 2. For library.txt do `ls *fastq.gz > library.txt` and manually edit to look like this: ![](https://i.imgur.com/FNzsmOf.png) ```ssh #trimming.sh while read line; do fileNameR1=$(echo $line|awk '{print $1}'); fileNameR2=$(echo $line|awk '{print $2}'); ## Trim java -jar /home/data/rrebollo/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 12 -phred33 $fileNameR1.fastq.gz $fileNameR2.fastq.gz $fileNameR1.paired.fastq.gz $fileNameR1.unpaired.fastq.gz $fileNameR2.paired.fastq.gz $fileNameR2.unpaired.fastq.gz LEADING:20 TRAILING:20 AVGQUAL:25 SLIDINGWINDOW:10:30 MINLEN:35; done < library.txt ``` `bash trimming.sh ` #### FastQC Still a bit of adaptors of course, but it should be fine with RNAseq data. See if everyone agrees. Otherwise add the illumina adaptor into the trimming step. ![](https://i.imgur.com/2zcQ44s.png) ### Mapping With help from Mariana G. Ferrarini. #### TEtools ```ssh #Need to create a STAR index of the TE fasta file STAR --runThreadN 8 --runMode genomeGenerate --genomeDir TE_STAR_INDEX/ --genomeSAindexNbases 8 --genomeChrBinNbits 6 --genomeFastaFiles /home/data/rrebollo/sitophilus/TE_expression/TEtools/GCF_002938485.1_Soryzae_2.0_genomic.fna.TEs_v3.fasta #TEtools script by Mariana ls *.paired.fastq.gz > labels.txt cat labels.txt | xargs -n 2 sh run-te-tools.sh ``` run-te-tools.sh ```ssh # Run STAR on each sample STAR --genomeDir /home/data/rrebollo/sitophilus/REWIRE/RNAseq/TE_STAR_INDEX/ --runThreadN 2 --readFilesIn ./$1 ./$2 --readFilesCommand zcat --outFileNamePrefix ./$1_TEs_STAR_ -outSAMtype BAM SortedByCoordinate --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 --outMultimapperOrder Random --outSAMmultNmax 1 # Run TE-tools on SAM files # Check if out files from STAR indeed have suffix _TEs_STAR_Aligned.out.sam python3 ~/TEtools/TEcount.py -rosette /home/data/rrebollo/sitophilus/TE_expression/TEtools/GCF_002938485.1_Soryzae_2.0_genomic.fna.TEs_v3.rosette -column 2 -TE /home/data/rrebollo/sitophilus/TE_expression/TEtools/GCF_002938485.1_Soryzae_2.0_genomic.fna.TEs_v3.fasta -count $1_TE.count -sam $1_TEs_STAR_Aligned.out.sam ``` Mapping rate varies from 2.7% to 5%. The very sequenced library (RTRL73) shows 3.4% of aligned reads so in line with the rest of the libraries but it corresopnds to 2.4M aligned reads, while the other libraries have around 1. In general the ovary samples show higher TE read mapping than gut samples (I think this is contrary to what was observed in the genome paper). ![](https://i.imgur.com/lgdWFhp.png) ![](https://i.imgur.com/qhxe8iu.png) We can already compare the % of TE mapped reads between libraries but remember these are not normalized counts, are just total % of reads. We can see how the ovarie's KD, specially SIWI are potentially able to deregulate the TEs. For gut there is one nice SIWI library but all the rest seems pretty the same (maybe AGO3 too). Of course this is to be followed up by a proper statistical count analyses. ![](https://i.imgur.com/l2yFG2o.png) #### Gene counts Trimmed data in */home/cdata/rebollocold/REWIRE/RNAseq/trimmmed/* Genome in */home/data/rrebollo/sitophilus/genomeinfo/GCF002938485.1Soryzae2.0genomic.fna.gz* ##### STAR index the genome `STAR --runThreadN 12 --runMode genomeGenerate --genomeSAindexNbases 9 --genomeFastaFiles /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic.fna --genomeDir /home/data/rrebollo/sitophilus/genome_info/SO2_STAR/` genomeSAindexNbases 9 : because of genome size ##### Mapping & counting STAR command `STAR --genomeDir /home/data/rrebollo/sitophilus/genome_info/SO2_STAR/ --runThreadN 12 --readFilesIn $1 $2 --readFilesCommand zcat --outFileNamePrefix ./$1_gene_STAR_ -outSAMtype BAM SortedByCoordinate` ```ssh cd /home/data/rrebollo/sitophilus/REWIRE/RNAseq/gene_counts/ ##mapping cat trimmed.txt | xargs -n 2 sh run_star.sh ##Feature counts tests #conda activate RNAseq featureCounts -a /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic_modifiedMGF.gff -t exon -g "Parent" -T 12 -p -o genes-parent.count *.sam featureCounts -a /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic_modifiedMGF.gff -t exon -g "Parent" -O -T 12 -p -o genesO.count *.sam featureCounts -a /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic_modifiedMGF.gff -t exon -g "Parent" -s 1 -T 12 -p -o genes-parents1.count *.sam featureCounts -a /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic_modifiedMGF.gff -t exon -g "Parent" -s 2 -T 12 -p -o genes-parents2.count *.sam featureCounts -a /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic_modifiedMGF.gff -t exon -g "Parent" -s 1 -O -T 12 -p -o genesOs1.count *.sam featureCounts -a /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic_modifiedMGF.gff -t exon -g "Parent" -s 2 -O -T 12 -p -o genesOs2.count *.sam featureCounts -p -s 1 -F "SAF" -a /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic_modifiedMGF_flatten.saf -o genes-pairsS1.counts -T 12 *.sam featureCounts -p -F "SAF" -a /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic_modifiedMGF_flatten.saf -o genes-pairs.counts -T 12 *.sam featureCounts -p -s 2 -F "SAF" -a /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic_modifiedMGF_flatten.saf -o genes-pairsS2.counts -T 12 *.sam ``` -p to specify paired end data For paired-end data is better to count fragments rather than reads. --countReadPairs shoul dbe used to count fragments instead of reads. Unfortunately, I can't install the latest version in conda, so I can't use countReadPairs. Counting reads only. -s 1 strand in the same gene sense -O to get meta feature count (genes) because otherwise it gave me transcript levels so I can make plots for AGO2 and other genes that have multiple isoforms. SAF: simplified annotation format All STAR reports look good. In the table below, aligned = uniquely mapped in multiqc. RTRL73 the largest library shows 88% aligned, which is equivalent to the other libraries. Of course, this still means 43K reads instead of ~10K like the other libraries. ![](https://i.imgur.com/fGp1is0.png) ![](https://i.imgur.com/Iv1kZ96.png) - [x] Comparison of all featurecounts | | Exon | Exon+S1 | Exon+S2 | O | O+S1 | O+S2 | SAF | SAF+S1 | SAF+S2 | |:------------------------------------ |:---- | ------- |:------- |:---- |:---- |:---- |:---- |:------ |:------ | | Successfuly assigned alignemnts | 65% | 0.5% | 65% | 77% | 0.7% | 80% | 80% | ~0.6% | 74% | | SIWI XM_030894834 | 1301 | -- | 1297 | 1301 | -- | 1297 | 1301 | -- | 1297 | | AGO3 XM_030896763 | 447 | -- | 443 | 450 | -- | 443 | 447 | -- | 443 | | AGO2 XM_030904 304 & ...305 & ...303 | 4-0 | -- | 4 | 6267 | -- | 6259 | 6267 | -- | 6259 | So clearly the strand is **-s 2**. Regarding the best count for genes, O works pretty well but it does make an extra computation step. SAF+2 seems to be the most straightforward way to make directly gene counts in the proper annotated strand. - [x] Get the counts of SIWI/AGO3/AGO2 between conditions (counts from counts-O) **SIWI:** LOC115878363 - XM_030894834 **AGO3:** LOC115879776 - XM_030896763 **AGO2:** LOC115885402 - XM_030904304 - XM_030904305 - XM_030904303 ![](https://i.imgur.com/Zc68P82.png) #### IGV confirmation of KD Transform SAM into BAM file: `samtools view -S -b sample.sam > sample.bam` then sort `samtools sort -o sorted.bam initial.bam` and finally into a bigwig with bamCoverage : ``` samtools index *.bam bamCoverage -b reads.bam -o coverage.bw --normalizeUsing BPM -p 12 ``` **SIWI:** LOC115878363 (GFP in green, KD SIWI in purple) Gut : KD ok, residual expression, argue for effect but not 100% ![](https://i.imgur.com/5GwijWf.png) Ovaries : gorgeous KD ![](https://i.imgur.com/Ez6IUmy.png) **AGO3:** LOC115879776 (GFP in green, KD AGO3 in purple) Gut : KD works, although vestigial expression ![](https://i.imgur.com/hPEu0CI.png) Ovaries: gorgeous ![](https://i.imgur.com/p0XWa7A.png) **AGO2:** LOC115885402 (GFP in green, KD AGO2 in purple) Gut : residual expression ![](https://i.imgur.com/Mtyz8ta.png) Ovaries : residual expression ![](https://i.imgur.com/cbv88Hm.png) **conclusion** The KD look really good, and of course they are not 100% KD but the decrease seems to be > 90%. Need to check the normalized counts, but we can safely continue the analysis. ### SARTools With Mariana G. Ferrarini help. ##### Preparing the joint TE & gene count table On the TEcount files: cut 2nd column and 5th column ```ssh ls *.count > countfiles.txt #in cut.sh = awk '{print $2,$5}' $1.R1.paired.fastq.gz_TE.count > $1.count cat countfiles.txt | xargs -n 1 sh cut.sh ``` Example of count file: ![](https://i.imgur.com/si4E77K.png) Example of modified count file: ![](https://i.imgur.com/YpS5OA0.png) On the gene count file : did it by hand on the genes_parent file, that represents the transcripts. Also did the same on the genes_O file so I can make graphs for genes. ##### Running SARTools - whole data Then run SARTools from Mariana's script on my laptop. Need a colData_REWIRE_RNAseq.txt file: > label files group > Gut_GFP_1 RTRL65.Allcounts.txt GutGFP > Gut_GFP_2 RTRL66.Allcounts.txt GutGFP > Gut_GFP_3 RTRL67.Allcounts.txt GutGFP > Gut_Siwi_1 RTRL68.Allcounts.txt GutSiwi > Gut_Siwi_2 RTRL69.Allcounts.txt GutSiwi > Gut_Siwi_3 RTRL70.Allcounts.txt GutSiwi > Gut_Ago3_1 RTRL71.Allcounts.txt GutAgo3 > Gut_Ago3_2 RTRL72.Allcounts.txt GutAgo3 > Gut_Ago3_3 RTRL73.Allcounts.txt GutAgo3 > Gut_Ago2_1 RTRL74.Allcounts.txt GutAgo2 > Gut_Ago2_2 RTRL75.Allcounts.txt GutAgo2 > Gut_Ago2_3 RTRL76.Allcounts.txt GutAgo2 > Ova_GFP_1 RTRL77.Allcounts.txt OvaGFP > Ova_GFP_2 RTRL78.Allcounts.txt OvaGFP > Ova_GFP_3 RTRL79.Allcounts.txt OvaGFP > Ova_Siwi_1 RTRL80.Allcounts.txt OvaSiwi > Ova_Siwi_2 RTRL81.Allcounts.txt OvaSiwi > Ova_Siwi_3 RTRL82.Allcounts.txt OvaSiwi > Ova_Ago3_1 RTRL83.Allcounts.txt OvaAgo3 > Ova_Ago3_2 RTRL84.Allcounts.txt OvaAgo3 > Ova_Ago3_3 RTRL85.Allcounts.txt OvaAgo3 > Ova_Ago2_1 RTRL86.Allcounts.txt OvaAgo2 > Ova_Ago2_2 RTRL87.Allcounts.txt OvaAgo2 > Ova_Ago2_3 RTRL88.Allcounts.txt OvaAgo2 I ran twice, once for the TE+transcripts file and one for the gene file (-O feature count) => deprecated, but same script, only used the TE+gene with -s 2 and SAF. ```r ################################################################################ ### R script to compare several conditions with the SARTools and DESeq2 packages ### Hugo Varet ### May 9th, 2016 ### designed to be executed with SARTools 1.4.0 ################################################################################ ################################################################################ ### parameters: to be modified by the user ### ################################################################################ rm(list=ls()) # remove all the objects from the R session workDir <- "~/work/BF2i/OngoingProjects/AAP_REWIRE/RNAseq/counts" # working directory for the R session projectName <- "2021-09-10-REWIRE-RNAseq-TEtools" # name of the project author <- "RR" # author of the statistical analysis/report targetFile <- "~/work/BF2i/OngoingProjects/AAP_REWIRE/RNAseq/counts/colData_REWIRE_RNAseq.txt" # path to the design/target file rawDir <- "~/work/BF2i/OngoingProjects/AAP_REWIRE/RNAseq/counts" # path to the directory containing raw counts files featuresToRemove <- c("alignment_not_unique", # names of the features to be removed "ambiguous", "no_feature", # (specific HTSeq-count information and rRNA for example) "not_aligned", "too_low_aQual")# NULL if no feature to remove varInt <- "group" # factor of interest condRef <- "GutGFP" # reference biological condition batch <- NULL # blocking factor: NULL (default) or "batch" for example fitType <- "parametric" # mean-variance relationship: "parametric" (default) or "local" cooksCutoff <- TRUE # TRUE/FALSE to perform the outliers detection (default is TRUE) independentFiltering <- TRUE # TRUE/FALSE to perform independent filtering (default is TRUE) alpha <- 0.01 # threshold of statistical significance pAdjustMethod <- "BH" # p-value adjustment method: "BH" (default) or "BY" typeTrans <- "VST" # transformation for PCA/clustering: "VST" or "rlog" locfunc <- "median" # "median" (default) or "shorth" to estimate the size factors #colors <- c("dodgerblue","firebrick1", # vector of colors of each biological condition on the plots #"MediumVioletRed","SpringGreen") colors <- c('#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4', '#46f0f0', '#f032e6') #, '#bcf60c', '#fabebe', '#008080', '#e6beff', '#9a6324', '#fffac8', '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#808080', '#ffffff', '#000000') ################################################################################ ### running script ### ################################################################################ setwd(workDir) library(SARTools) # checking parameters checkParameters.DESeq2(projectName=projectName,author=author,targetFile=targetFile, rawDir=rawDir,featuresToRemove=featuresToRemove,varInt=varInt, condRef=condRef,batch=batch,fitType=fitType,cooksCutoff=cooksCutoff, independentFiltering=independentFiltering,alpha=alpha,pAdjustMethod=pAdjustMethod, typeTrans=typeTrans,locfunc=locfunc,colors=colors) # loading target file target <- loadTargetFile(targetFile=targetFile, varInt=varInt, condRef=condRef, batch=batch) # loading counts counts <- loadCountData(target=target, rawDir=rawDir, featuresToRemove=featuresToRemove) # description plots majSequences <- descriptionPlots(counts=counts, group=target[,varInt], col=colors) # analysis with DESeq2 out.DESeq2 <- run.DESeq2(counts=counts, target=target, varInt=varInt, batch=batch, locfunc=locfunc, fitType=fitType, pAdjustMethod=pAdjustMethod, cooksCutoff=cooksCutoff, independentFiltering=independentFiltering, alpha=alpha) # PCA + clustering exploreCounts(object=out.DESeq2$dds, group=target[,varInt], typeTrans=typeTrans, col=colors) # summary of the analysis (boxplots, dispersions, diag size factors, export table, nDiffTotal, histograms, MA plot) summaryResults <- summarizeResults.DESeq2(out.DESeq2, group=target[,varInt], col=colors, independentFiltering=independentFiltering, cooksCutoff=cooksCutoff, alpha=alpha) # save image of the R session save.image(file=paste0(projectName, ".RData")) # generating HTML report writeReport.DESeq2(target=target, counts=counts, out.DESeq2=out.DESeq2, summaryResults=summaryResults, majSequences=majSequences, workDir=workDir, projectName=projectName, author=author, targetFile=targetFile, rawDir=rawDir, featuresToRemove=featuresToRemove, varInt=varInt, condRef=condRef, batch=batch, fitType=fitType, cooksCutoff=cooksCutoff, independentFiltering=independentFiltering, alpha=alpha, pAdjustMethod=pAdjustMethod, typeTrans=typeTrans, locfunc=locfunc, colors=colors) ``` All results are present in *C:\Users\rrebollo\Documents\work\BF2i\OngoingProjects\AAPREWIRE\RNAseq\counts* in two folders created by SARTools *figures\ and tables\ and in the countsO* folder KD of normalized counts (using -O counts): SIWI, AGO3 look great, both Gut and Ovaries. Ago2, as expected, is a really harder to KD, we can see residual expression up to 40% for gut. In general, the KD of ovaries are better, and this could be due to the difference in expression between gut and ovaries (as you can see, there is a 10 times more expression in the ovaries, this will be discussed later on). Redo this without the bugged libraries (see later on) ![](https://i.imgur.com/QD7ovuW.jpg) Read count, we can see Gut_Ago3_3 is really high compared to others. ![](https://i.imgur.com/I97FLHs.png) Clustered dendogram before normalization, some samples are mixed : Gut Ago3_3 & Ova_Ago3_3 cluster together. Gut_Siwi_1 clusters with the Ova_Siwi samples while the other Gut_Siwi_2 and Gut_Siwi_3 cluster with the Gut samples. Now for me it is not problematic that SIWI samples cluster together, even though one is gut and the other is ovaries. But it does bug me that the SIWI gut samples do not cluster together. The same thing goes for the Gut_AGO3_3 sample clustering with the Ova_Ago3_3 sample. I don't mind because if the derepression is not tissue specific this is fine, but it is strange that the other Gut or ovary samples are not clustering together. ![](https://i.imgur.com/MF6LPWH.png) PCA normalized counts. Same thing here, some of the samples are mixed but overall all good. Gut_Ago_3 & Ova_Ago_3 really cluster together, and not with the other samples. The "**gut siwi 1**" green sample is seen on the bottom right, with the Ova samples. On the second PCA this is more obvious but the AGO3 seem fine, while the **Ova AGO3** is the one clustering more in the middle. I wonder if this would change if we would only take into account the gene counts and not the genes + TEs. ![](https://i.imgur.com/9j2S0Fz.png) Also did this with genes and it looks exactly the same so it is not a problem of TE+gene manual file. - [x] Check if the count files were properly made (i.e. I didn't manually paste a TE read count from a file into another because I did this manually...) - [x] Check STAR reports - [x] Check ovary specific expression * LOC115882565 (XM_030900738.1) vitellogenin-like : expressed in gut* * LOC115882294 (XM_030900262.1) vitellogenin-like: expressed in gut* * LOC115875670 (XM_030891202.1) vitellogenin-like : not expressed in either tissues * LOC115878934 : could be but too many transcripts Heatmap of transcript counts to search for patterns between libraries and conclude on Gut_Siwi_1 and Ova_Ago3_3. Code based on Mariana's heatmap of the *S. oryzae* genome paper. I removed the mitochondrial and the TE counts. I also only included genes where total counts > 10 to remove genes that wer silent in all samples. I added 1 to every count, and log10 transformed. Installing a conda environment with everything that is needed : `conda create -n python pandas seaborn numpy matplotlib`. ```python ### Read annotation file with species, family, genus, gc content python #it simply opens a python "app" import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt df=pd.read_table("counts.filtered.txt",index_col=0) df ``` ``` sns_plot = sns.clustermap(df.iloc[:,0:23],yticklabels=0,xticklabels=1,cmap=sns.color_palette("RdYlBu_r", as_cmap=True),tree_kws=dict(linewidths=0)) sns_plot.savefig('test.png', dpi=400) ``` We can see that Gut Siwi 1 is extremely similar to ovaries but I find it hard to see all the difference with so many genes that show similar expression patterns. Also it seriously doesn't seem like a ![](https://i.imgur.com/fgmmUoP.png) Try to get only the genes that are differentially expressed between GUT GFP and OVA GFP (padj <0.05). Well, it looks exactly the same... ![](https://i.imgur.com/dnrELBk.png) Trying another heatmap with Daniel's code => he only do means. But I think the conclusion is the same, I certainly need to remove Gut Siwi 1 and potentially gut_Ago3_3 and Ova Ago3_3. Retry the analysis without these samples and discuss with the others later. ##### Running SARTools - robust datasets only Since there are three datasets that look contaminated with gut/ovary samples, I will remove them from the analysis. FYI, if you want to see DE plots of the whole datasets, check below the #deprecated analysis. To remove * Gut_Siwi_1 because it clusters with ovarian samples. * Gut_Ago3_3 because it doesn't look like neither ova or gut samples * Ova_Ago3_3 for the same reason as Gut_Ago3_3, and they cluster together. I ran SARtools as before, but removing the samples that seemed contaminated. ![](https://i.imgur.com/EnfhkM6.png) ![](https://i.imgur.com/2vIwT9K.png) While the data now looks good, there is still the question of keeping the replicates that did show too much residual expression on their KD. This is a concern only for AGO2 where we see 40% of residual expression for Gut_Ago2_2; and 34% and 28% residual expression for Ago2_1 and Ago2_3. I do find weird that the IGV does not illustrate this as the KD seem to really be at least 80% effecient. I wonder if this is not a problem of feature counts with -O parameter. Nayway, if we do remove these samples, we won't be able to take AGO2 into account, so maybe maintain it and explain the mild effect we could see because of the RNAi efficiency. ## sRNAseq ### Trimming This was performed by Mariana + mapping to sitophilus/sodalis with bbsplit. Ask Mariana to add the info here. ### BBsplit #### Size distribution This is just for information as it shows all the reads mapping against the *S. oryzae* genome. In green the Gut samples from REWIRE, in blue gut D9 WT samples (mirvana), and in purple the ovaries samples. ![](https://i.imgur.com/dvN4ZVN.png) Within **Guts**, here is the breakdown of samples GFP (green) vs WT with mirvana method (black): similar landscape, our size selected libraries are really good. ![](https://i.imgur.com/fQVqJFr.png) GFP vs SIWI : piRNA sizes are present in gut and ovary GFP, and we can see a clear decrease in piRNA-size reads while the si/miRNA sized reads remain the same. ![](https://i.imgur.com/2NhsqyO.png) GFP vs AGO3: the decrease in piRNA size reads is not visually clear ![](https://i.imgur.com/riENFju.png) GFP vs AGO2: the decrease in siRNA size reads is not visually clear ![](https://i.imgur.com/9IIzEvg.png) Within **Ovaries**, here is the breakdown of samples GFP vs Siwi: we can see more piRNA like than siRNA size reads and the decrease in piRNA size in the SIWI RNAi. ![](https://i.imgur.com/YrmW5rP.png) GFP vs AGO3: not obvious the decrease in piRNA. ![](https://i.imgur.com/jStIeMz.png) GFP vs AGO2 : not obvious the siRNA decrease ![](https://i.imgur.com/BBM6YSd.png) ### TEtools #### Read filtering Need to first filter the reads per size because TEtools will count all reads in the file. From Marie's paper > Using PRINSEQ lite version 0.20.4 (80), we filtered reads of size 23–30 nt and considered these as piRNAs, and reads of size 21 nt were considered as siRNAs. I followed Marie's size selection but used another method copied from [this](https://www.biostars.org/p/374959/) to filter the reads per size. ```ssh while read line; do fileName=$(echo $line|awk '{print $1}'); ## piRNAs # Step 1: make a list of sequence names that are within the length range /home/data/rrebollo/seqtk/seqtk comp $fileName.fq.gz | awk '{ if (($2 >= 23) && ($2 <= 30)) { print} }' | cut --fields 1 > $fileName.piRNA.list # Step 2: subset the fastq file for those sequences only /home/data/rrebollo/seqtk/seqtk subseq $fileName.fq.gz $fileName.piRNA.list > $fileName.piRNA.fq ## siRNAs # Step 1: make a list of sequence names that are within the length range /home/data/rrebollo/seqtk/seqtk comp $fileName.fq.gz | awk '{ if ($2 == 21) { print} }' | cut --fields 1 > $fileName.siRNA.list # Step 2: subset the fastq file for those sequences only /home/data/rrebollo/seqtk/seqtk subseq $fileName.fq.gz $fileName.siRNA.list > $fileName.siRNA.fq rm $fileName.piRNA.list rm $fileName.siRNA.list mv $fileName.piRNA.fq ../piRNA/ cd ../piRNA ~/FastQC/fastqc $fileName.piRNA.fq cd ../bbsplit mv $fileName.siRNA.fq ../siRNA/ cd ../siRNA ~/FastQC/fastqc $fileName.siRNA.fq cd ../bbsplit done < list.txt ``` I verified by a FastQC that all samples are as expected. #### Running TEtools Make a list of the files (siRNA and piRNAs separately), then run `cat list.txt | xargs -n 1 sh tetools.sh` tetools.sh piRNA folder ``` python3 ~/TEtools/TEcount.py -rosette /home/data/rrebollo/sitophilus/TE_expression/TEtools/GCF_002938485.1_Soryzae_2.0_genomic.fna.TEs_v3.rosette -column 2 -TE /home/data/rrebollo/sitophilus/TE_expression/TEtools/GCF_002938485.1_Soryzae_2.0_genomic.fna.TEs_v3.fasta -count $1_piRNA.count -RNA $1.fq ``` I would like to know what is the % of mapped reads. siRNA folder ``` python3 ~/TEtools/TEcount.py -rosette /home/data/rrebollo/sitophilus/TE_expression/TEtools/GCF_002938485.1_Soryzae_2.0_genomic.fna.TEs_v3.rosette -column 2 -TE /home/data/rrebollo/sitophilus/TE_expression/TEtools/GCF_002938485.1_Soryzae_2.0_genomic.fna.TEs_v3.fasta -count $1_siRNA.count -RNA $1.fq ``` Once the mappings are ready, before diving into DESEQ2 I need to get the miRNA counts to add to the DESEQ2 input. For the miRNAs, I need to check with Mariana as I might as well check if she needs me to redo this for the UNLEASh datasets. Mirdeep2: https://github.com/rajewsky-lab/mirdeep2 Got the average expresion for each confident miRNA (again check the other HackMD) and added miRNA_ piRNA_ siRNA_ prefix for each of the counts. Ran SARTools on that count, same script as for RNAseq. The data looks properly clustered. Lots of DE sRNAs. Check the rest of the HackMD for the proper pi/si analysis. ![](https://i.imgur.com/iRkWR3f.png) ![](https://i.imgur.com/tdSSTFo.png) --- Now that the data looks good (*well, I did exclude three libraries!*), and the KD looks ok, I will dive into the scientific questions. I organized the following results into a potential article. # Transposable element regulation in the rice weevil, *Sitophilus oryzae* In the introduction, we need to say we have previously showed differences in TE expression between gut vs soma (genome paper) and that we annotated the piRNA machinary and looked at its expression (genome paper). ### Expression of piRNA related genes in soma and germ tissues In Diptera, the piRNA machinary is specific to gonads, while in other insects, somatic piRNAs have recurrently been observed. The duplication of the SIWI protein into PIWI/AUB is the culprit of such tissue specificity in diptera. In coleoptera piRNAs were previously found in somatic tissues suggesting the piRNA machinary is not gonad specific. Thanks to the recently published genome of *S. oryzae*, we annotated all piRNA-associated genes and compared by real time RT-PCR their expression between gut and ovaries. Despite the presence of SIWI, associated in other species to somatic piRNAs, we found small expression of all piRNA-associated genes in gut compared to ovaries. Here we took advantage of genome wide transcriptomic data between gut and ovaries to confirm the expression of piRNA-associated genes between soma and germ tissues. Mariana gave me the data obtained in the genome paper (D10 gut and ovaries RNAseq in triplicates) but her files only shows the TE counts. When I checked the cluster, I found at /home/cdata/ferrarini_cold/GenomePaper/Trimmed/BAM/ a file named Genes_Ovaries_Gut_Genome.txt that is the output of feature counts. However, the counts are made per transcript so I can not count for multiple isoforms. I redid the feature counts with -O and made a graph of the piRNA associated genes. Got the data from /home/cdata/ferrarini_cold/GenomePaper/Trimmed/BAM/ (Check this is this ok with **Mariana**) ![](https://i.imgur.com/ZPQEFXB.png) ```ssh ##counting for graphs of genes conda activate RNAseq featureCounts -a /home/data/rrebollo/sitophilus/genome_info/GCF_002938485.1_Soryzae_2.0_genomic_modifiedMGF.gff -t exon -g "Parent" -O -T 12 -p -o WT_O.count *.bam ``` Run SARTools/DESEQ2 to get normalized counts. | Gene | LOC | RNA | Gut | Ovaries | |:--------- | ------------ |:--------------------------------------------------- |:----- |:------- | | SIWI | LOC115878363 | XM_030894834 | 9797 | 26126 | | AGO3 | LOC115879776 | XM_030896763 | 1683 | 3416 | | ARMITAGE | LOC115890098 | XM_030910239 | 679 | 850 | | HEN1 | LOC115884554 | XM_030903175 | 294 | 1061 | | SPINDLE-E | LOC115886157 | XM_030905230 | 2474 | 2927 | | MAELSTROM | LOC115888111 | XM_030907693 XM_030907694 XM_030907692 | 3195 | 5213 | | VASA | LOC115889396 | XM_030909376 | 1287 | 8044 | | ZUCCHINI | LOC115891431 | XM_030911883 | 105 | 163 | | AGO2 | LOC115885402 | XM_030904304 XM_030904305 XM_030904303 | 26987 | 2920 | | DICER2 | LOC115880804 | XM_030898095 XM_030898094 XM_030898092 XM_030898096 | 1873 | 662 | | DROSHA | LOC115876974 | XM_030893027 | 258 | 741 | Log10 normalized counts, not clustered. `sns_plot = sns.clustermap(df.iloc[:,0:2],yticklabels=True,xticklabels=1,cmap=sns.color_palette("RdYlBu_r", as_cmap=True),tree_kws=dict(linewidths=0), row_cluster=FALSE)` ![](https://i.imgur.com/wuSl0sf.png) We can see how SIWI and most piRNA associated genes are expressed higher at ovaries compared to gut. Nevertheless, the piRNA associated genes **are** expressed in gut, as obsevred in the "genome paper" AGO2 is more expressed in gut than in ovaries, same thing for DICER2. We can also plot individual graphs, which I prefer, I find the difference between gut and ovaries clearler than the heatmap. ![](https://i.imgur.com/OTJOjZg.png) ![](https://i.imgur.com/WpkScxB.png) ##### Update of 2025 I think I need to add something that shows the analysis includes the tissue as a parameter. I need both the normalized counts and the DEGs. #### RT-PCR *in situ* of piRNA associated genes There is no antibodies for immunflurescence, so we aimed at performing RT PCR *in situ*. Hopefully the method will be ready by 2022. If it is, need to check Severine Balmand/Agnès Vallier availability to test AGO3/SIWI/AGO2 in ovaries/gut of D10 adults. I think this is a mandatory experiment as it would show which cells in the gut and ovaries are actually expressing the genes studied. ### piRNA & siRNA expression in soma vs germ tissues #### Size distribution of *S. oryzae* small RNAs The entire REWIRE data was previously processed by Mariana. She took the FASTQ, trimmed adaptors off without quality trimming, and mapped with bbsplit against the *S. oryzae* and *S. pierantonius* (the bacterial endosymbiont), and send me the insect fraction here /home/cdata/rebollo_cold/REWIRE/small_RNAseq/BBSPLIT/ We do not have small RNA sequencing of gut and ovaries in wild type animals. We have gut small RNA that we could use to show the presence of piRNAs. We also have GFP injected animals, gut and ovaries small RNAs. So here is what I suggest: 1. Show that the distribution of wild type and GFP piRNA is similar in the gut (put this in sup) : the distribution of sRNA sizes is equivalent between WT and GFP. Don't forget the WT is a mirvana sample at D9, while the GFP is a columns sample at D10. Still, they look quite similar! Red : Gut WT Green: Gut GFP ![](https://i.imgur.com/K8uyhBt.png) 2. Use the GFP samples to make the comparisons between gut and ovaries : there are piRNA-sized sRNAs present in gut samples, albeit at a lower level than in ovaries. Don't forget these data are raw size distributions, no normalization, no mapping filtered. ![](https://i.imgur.com/r477wul.png) ![](https://i.imgur.com/P9VqqJY.png) ##### Small RNAs associated with TE copies in gut vs ovaries - [x] counts thanks to TEtools on the TE copies - [x] used miRNA for normalization Heatmap log10 normalized counts of TE superfamilies. How many superfamilies? **superfamilies**: 52 including satellites Academ-1 CMC-Chapaev CMC-Chapaev-3 CMC-Transib Copia CR1 Crypton Crypton-A Crypton-I DIRS Dong-R4 Ginger-2 Gypsy hAT hAT-Ac hAT-Blackjack hAT-Charlie hAT-hAT19 hAT-hATm hAT-hATx hAT-Tag1 hAT-Tip100 Helitron I I-Jockey IS3EU L2 Maverick MULE-MuDR P Pao Penelope PIF-Harbinger PIF-ISL2EU PiggyBac R1 R1-LOA R2 RTE-BovB RTE-RTE RTE-X Sola-1 Sola-2 TcMar TcMar-Fot1 TcMar-m44 TcMar-Mariner TcMar-Pogo TcMar-Tc1 TcMar-Tigger Unknown Zisupton **classes/subtypes**: 8 (with Satellite...) * DIRS * DNA * LINE * LTR * PLE * RC * Satellite * Unknown **Families:** 3,399 lines including satellites So remove Satellites from count files, and make a heatmap of the log10 normalized counts. => ongoing, asked Mariana's help! All graphs done in my local R session of sRNAseq ```r #REWIRE_sRNAseq # Library library(ggplot2) library(tidyr) library(tm) library(gridExtra) library(readr) # plots DE piRNAs # num.pi correspond aux numéros de lignes des piRNA, num.si aux siRNA num.pi = 603:4001 num.si = 4002:7400 OvaAgo3 = read.table("OvaAgo3vsOvaGFP.complete.txt", h=T) OvaAgo2 = read.table("OvaAgo2vsOvaGFP.complete.txt", h=T) OvaSiwi = read.table("OvaSiwivsOvaGFP.complete.txt", h=T) GutAgo3 = read.table("GutAgo3vsGutGFP.complete.txt", h=T) GutAgo2 = read.table("GutAgo2vsGutGFP.complete.txt", h=T) GutSiwi = read.table("GutSiwivsGutGFP.complete.txt", h=T) GutOva = read.table("OvaGFPvsGutGFP.complete.txt", h=T) #filtering for pi GutSiwipi=GutSiwi[num.pi,] GutAgo3pi=GutAgo3[num.pi,] GutAgo2pi=GutAgo2[num.pi,] OvaSiwipi=OvaSiwi[num.pi,] OvaAgo3pi=OvaAgo3[num.pi,] OvaAgo2pi=OvaAgo2[num.pi,] GutOvapi=GutOva[num.pi,] #filtering for si GutSiwisi=GutSiwi[num.si,] GutAgo3si=GutAgo3[num.si,] GutAgo2si=GutAgo2[num.si,] OvaSiwisi=OvaSiwi[num.si,] OvaAgo3si=OvaAgo3[num.si,] OvaAgo2si=OvaAgo2[num.si,] GutOvasi=GutOva[num.si,] #all small RNAs names contain the prefix piRNA- or siRNA-. In order to get family correspondance, I need to remove these GutSiwipi$Id=removeWords(GutSiwipi$Id, "piRNA-") GutAgo3pi$Id=removeWords(GutAgo3pi$Id, "piRNA-") GutAgo2pi$Id=removeWords(GutAgo2pi$Id, "piRNA-") OvaSiwipi$Id=removeWords(OvaSiwipi$Id, "piRNA-") OvaAgo3pi$Id=removeWords(OvaAgo3pi$Id, "piRNA-") OvaAgo2pi$Id=removeWords(OvaAgo2pi$Id, "piRNA-") GutOvapi$Id=removeWords(GutOvapi$Id, "piRNA-") GutSiwisi$Id=removeWords(GutSiwisi$Id, "siRNA-") GutAgo3si$Id=removeWords(GutAgo3si$Id, "siRNA-") GutAgo2si$Id=removeWords(GutAgo2si$Id, "siRNA-") OvaSiwisi$Id=removeWords(OvaSiwisi$Id, "siRNA-") OvaAgo3si$Id=removeWords(OvaAgo3si$Id, "siRNA-") OvaAgo2si$Id=removeWords(OvaAgo2si$Id, "siRNA-") GutOvasi$Id=removeWords(GutOvasi$Id, "siRNA-") #Adding the family names to the TEs #imported rosetta file GCF_002938485_1_Soryzae_2_0_genomic_fna_TEs_v3 GCF_002938485_1_Soryzae_2_0_genomic_fna_TEs_v3 <- read_delim("GCF_002938485.1_Soryzae_2.0_genomic.fna.TEs_v3.rosette", delim = "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE) TEfamily=GCF_002938485_1_Soryzae_2_0_genomic_fna_TEs_v3[,3:5] colnames(TEfamily)=c("Id","Class","Superfamily") TEfamily=unique(TEfamily) GutSiwipifamily=merge(GutSiwipi,TEfamily, by="Id") GutAgo3pifamily=merge(GutAgo3pi,TEfamily, by="Id") GutAgo2pifamily=merge(GutAgo2pi,TEfamily, by="Id") OvaSiwipifamily=merge(OvaSiwipi,TEfamily, by="Id") OvaAgo3pifamily=merge(OvaAgo3pi,TEfamily, by="Id") OvaAgo2pifamily=merge(OvaAgo2pi,TEfamily, by="Id") GutOvapifamily=merge(GutOvapi,TEfamily, by="Id") GutSiwisifamily=merge(GutSiwisi,TEfamily, by="Id") GutAgo3sifamily=merge(GutAgo3si,TEfamily, by="Id") GutAgo2sifamily=merge(GutAgo2si,TEfamily, by="Id") OvaSiwisifamily=merge(OvaSiwisi,TEfamily, by="Id") OvaAgo3sifamily=merge(OvaAgo3si,TEfamily, by="Id") OvaAgo2sifamily=merge(OvaAgo2si,TEfamily, by="Id") GutOvasifamily=merge(GutOvasi,TEfamily, by="Id") #remove satellites GutSiwipifamily=GutSiwipifamily[!grepl("Satellite", GutSiwipifamily$Class),] GutAgo3pifamily=GutAgo3pifamily[!grepl("Satellite", GutAgo3pifamily$Class),] GutAgo2pifamily=GutAgo2pifamily[!grepl("Satellite", GutAgo2pifamily$Class),] OvaSiwipifamily=OvaSiwipifamily[!grepl("Satellite", OvaSiwipifamily$Class),] OvaAgo3pifamily=OvaAgo3pifamily[!grepl("Satellite", OvaAgo3pifamily$Class),] OvaAgo2pifamily=OvaAgo2pifamily[!grepl("Satellite", OvaAgo2pifamily$Class),] GutOvapifamily=GutOvapifamily[!grepl("Satellite", GutOvapifamily$Class),] GutSiwisifamily=GutSiwisifamily[!grepl("Satellite", GutSiwisifamily$Class),] GutAgo3sifamily=GutAgo3sifamily[!grepl("Satellite", GutAgo3sifamily$Class),] GutAgo2sifamily=GutAgo2sifamily[!grepl("Satellite", GutAgo2sifamily$Class),] OvaSiwisifamily=OvaSiwisifamily[!grepl("Satellite", OvaSiwisifamily$Class),] OvaAgo3sifamily=OvaAgo3sifamily[!grepl("Satellite", OvaAgo3sifamily$Class),] OvaAgo2sifamily=OvaAgo2sifamily[!grepl("Satellite", OvaAgo2sifamily$Class),] GutOvasifamily=GutOvasifamily[!grepl("Satellite", GutOvasifamily$Class),] #plot si vs pi within gut and within ovaries #piRNA vs siRNA GFP Gutpi_si = read.table("pi-si_gut.txt", h=T) Ovapi_si = read.table("pi-si_ova.txt", h=T) Gutpi_sifamily=merge(Gutpi_si,TEfamily, by="Id") Ovapi_sifamily=merge(Ovapi_si,TEfamily, by="Id") #remove satellite Gutpi_sifamily=Gutpi_sifamily[!grepl("Satellite", Gutpi_sifamily$Class),] Ovapi_sifamily=Ovapi_sifamily[!grepl("Satellite", Ovapi_sifamily$Class),] #plots grid.arrange( ggplot(data=Gutpi_sifamily)+ geom_point(aes(x=piRNA+1, y=siRNA+1, col = Class), alpha=0.4)+ scale_x_log10(limits = c(1,1e6))+ scale_y_log10(limits = c(1,1e6))+ annotation_logticks(base = 10)+ geom_abline(linetype="dashed", color = "red")+ theme(axis.line = element_line(colour = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()), ggplot(data=Ovapi_sifamily)+ geom_point(aes(x=piRNA+1, y=siRNA+1, col = Class), alpha=0.4)+ scale_x_log10(limits = c(1,1e6))+ scale_y_log10(limits = c(1,1e6))+ annotation_logticks(base = 10)+ geom_abline(linetype="dashed", color = "red")+ theme(axis.line = element_line(colour = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()), nrow = 1) #plot pi gut vs ova & si gut vs ova grid.arrange( ggplot(data=GutOvapifamily)+ geom_point(aes(x=(log10((GutGFP)+1)), y=log2(OvaGFP/GutGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-12,12) + xlim(0,6) + theme(axis.line = element_line(colour = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()), ggplot(data=GutOvasifamily)+ geom_point(aes(x=(log10((GutGFP)+1)), y=log2(OvaGFP/GutGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-12,12) + xlim(0,6) + theme(axis.line = element_line(colour = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()), nrow = 1) ``` pi left, si right ![](https://i.imgur.com/8QTjeUA.png) Gut left, ovary right ![](https://i.imgur.com/h7n4NhL.png) Class specific graphs piRNA vs siRNA in each tissue ![](https://i.imgur.com/ng04kR9.png) piRNA or siRNA between tissues ![](https://i.imgur.com/3LJ7wn3.png) Size distribution of TE mapped reads. - [x] signature.py Trying with TEtools output in one sample (Sitophilus.Trimmed.RTRL11.piRNA.sam). I installed a conda environment with numpy (because it didn't work without it) named sRNA. `python signature.py Sitophilus.Trimmed.RTRL11.piRNA.sam 23 30 1 30 test.txt` It works, so run the script on all samples. piRNA ```ssh while read line; do fileName=$(echo $line|awk '{print $1}'); python signature.py $fileName 23 30 1 30 $fileName.txt done < sam.list ``` siRNA ```ssh while read line; do fileName=$(echo $line|awk '{print $1}'); python signature.py $fileName 21 21 1 21 $fileName.txt done < sam.list ``` ![](https://i.imgur.com/o86qijn.png) Clear ping pong signal for gut and ovaries. siRNA overlap is supposed to be 19 and here we see increase in overlap for guts, and overlap at 18 for ovaries. I did this on TEtools sam file, so reads are mapped allover the copies. - [ ] check double stranded presence for siRNAs? **Need to show sense and antisense mapping per family, but there are many families, so need to think of a better way of showing this.** - [x] piRNA cluster identification Ongoing, check this hakcmd https://hackmd.io/@unleash/piRNAcluster #### Small RNA seq analysis WT vs GFP => I don't have it using the same age and extraction method so I am not focusing on this for now. #### Link TE expression & small RNAs - [ ] sup: TE heatmap D10 genomic data = GFP data. Ask **Mariana** for the TE counts of the RNAseq of the genome paper. - [ ] 2D plot TE RNAseq with sRNAseq - [ ] TEs expressed => enriched in clusters? ## Impairment of siRNA and piRNA machineries ### Survival Data present in C:\Users\rrebollo\Documents\work\BF2i\OngoingProjects\CrossTalk-TESodalis\Theo Théo made 5 independent injections (there is also AGO1 in the lot, that we can certainly add to the paper and suggest we did for the sake of positive control). Below the example of one survival curve only, all the others look the same. Injections are done at the larval stage and the survival is followed for at least 30 days (all the RTqPCR showing decrease in expresiosn of AGO1/AGO2/AGO3 and SIWI are also available). * AGO1 : lethal * AGO2 : no effect * AGO3 : lethal * SIWI : lethal ![](https://i.imgur.com/vRU8RaS.png) These were all done at the larval stage. The subsequent analysis we have done for RNA and sRNAseq are performed at day 3 adults. I don't think we need to make survival on the injected adults. Ideas? Also need to talk about the SIWI phenotype of the survivals? They were all strangey looking and didn't walk very well. ### Phenotypes showing empty gonads (instead of fecundity tests) - [ ] Take pictures of ovaries and gut on GFP, Siwi, Ago3 and Ago2 Day 7. See if Agnès/Severine could do this. ### piRNA expression on KD RTqPCR confirmation of KD before sequencing (I think it could be a figure in the paper) ![](https://i.imgur.com/FewtmyJ.png) Normalized read count confirmation after sequencing (this could be in the supplementary data). ![](https://i.imgur.com/AmY3tik.jpg) - [x] small RNA total counts - [ ] small RNA size distribution (not sure we need this) - [x] potentially a coverage of piRNA clusters ![](https://i.imgur.com/Vv5hh3g.png) ### siRNA expression on KD ![](https://i.imgur.com/KucEu32.png) ### TE expression on KD - [x] Marie's plot ```r library(ggplot2) library(gridExtra) #import counts OvaAgo3 = read.table("OvaAgo3vsOvaGFP.complete.txt", h=T) OvaAgo2 = read.table("OvaAgo2vsOvaGFP.complete.txt", h=T) OvaSiwi = read.table("OvaSiwivsOvaGFP.complete.txt", h=T) GutAgo3 = read.table("GutAgo3vsGutGFP.complete.txt", h=T) GutAgo2 = read.table("GutAgo2vsGutGFP.complete.txt", h=T) GutSiwi = read.table("GutSiwivsGutGFP.complete.txt", h=T) # num.te correspond aux numéros de lignes des ET num.te = 27310:30708 #filtering for TE counts GutSiwiTE=GutSiwi[num.te,] GutAgo3TE=GutAgo3[num.te,] GutAgo2TE=GutAgo2[num.te,] OvaSiwiTE=OvaSiwi[num.te,] OvaAgo3TE=OvaAgo3[num.te,] OvaAgo2TE=OvaAgo2[num.te,] #making plots grid.arrange( ggplot(data=GutSiwiTE)+geom_point(aes(x=(log10((GutGFP)+1)), y=log2(GutSiwi/GutGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), ggplot(data=GutAgo3TE)+geom_point(aes(x=(log10((GutGFP)+1)), y=log2(GutAgo3/GutGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), ggplot(data=GutAgo2TE)+geom_point(aes(x=(log10((GutGFP)+1)), y=log2(GutAgo2/GutGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), ggplot(data=OvaSiwiTE)+geom_point(aes(x=(log10((OvaGFP)+1)), y=log2(OvaSiwi/OvaGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), ggplot(data=OvaAgo3TE)+geom_point(aes(x=(log10((OvaGFP)+1)), y=log2(OvaAgo3/OvaGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), ggplot(data=OvaAgo2TE)+geom_point(aes(x=(log10((OvaGFP)+1)), y=log2(OvaAgo2/OvaGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), nrow = 2) ``` We can see SIWI KD results in massive upregulation of TE families, regardless of the tissue. There is some TE upregulation in AGO3, both tissues again. However, there no differences between AGO2 and GFP. Indeed the RNAi is not extremely efficient but this might specially mean that the TEs are not regulated by the siRNA pathway, and are mainly controlled by the piRNA pathway, regardless of the tissue. This might be confirmed by the piRNA/siRNA mapping to TEs. There is a lot of AGO2 padj NA in the gut, and this was not there when all samples were taken into account (see deprecated analysis). This could be because there is too much variation in between the replicates, but they seem ok on the PCA and dendogram. ![](https://i.imgur.com/doDARB9.png) - [x] Global TE deregulation ![](https://i.imgur.com/Abt0kex.png) If we break down by TE class (sorry the colors are ugly, I need to get the same colors as the genome paper) we can see LTR elements are particularly upregulated on all conditions, however they represent a small amount of the 72% TE content of *S. oryzae* genome. TIR elements, the most abundant, are also upregulated. Finally, the fact that Unknown elemnts are upregulated also confirm them as being "real" TEs. ![](https://i.imgur.com/c0K50XA.png) - [x] Venn diagram of upregulated TEs between conditions and tissues. There is nearly half of all upregulated TEs that are specific to ovaries and SIWI RNAi, and 21% to Gut Siwi RNAi. There is 34% common to both SIWI RNAi. All AGO3 upregulated TEs in gut are upregulated in ovaries. ![](https://i.imgur.com/1gQeoKi.png) - [ ] heatmap - [ ] violin plot Ova GFP vs Gut GFP ```r GutOva = read.table("OvaGFPvsGutGFP.complete.txt", h=T) GutOvaTE=GutOva[num.te,] ggplot(data=GutOvaTE)+geom_point(aes(x=(log10((GutGFP)+1)), y=log2(OvaGFP/GutGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8) ``` ![](https://i.imgur.com/FFG2agW.png) Comparison of TE log2FC of KDs vs GFP. Make a table with log2C of all KD vs their respective GFP. Try heatmap with ggplot. The best would be to separate the TEs into categories later on, DNA/LINE/LTR etc. ### TE vs piRNA expression on KD # Should we also take a look at the genes that are differentially expressed in the KDs? Yes, need to do this... There are some genes for instance in the AGO2 datasets... Maybe associated with immunity? ![](https://i.imgur.com/nLmlKLr.png) ## code to be used For TE data later on: ```python annotation=pd.read_table("GUT_OVS_BaseMean_FC_Pvalue_v2.txt",index_col=0) annot=annotation.iloc[:,1:2] k=len(df.Class.unique()) for i in range(0, k): te_class=df.Class.unique()[i] plot_df=df[df.Class==te_class] sns_plot=sns.clustermap(plot_df.iloc[:,1:8],yticklabels=0,xticklabels=1,cmap=sns.color_palette("RdYlBu_r", as_cmap=True),figsize=(5,12),vmin=0,vmax=5.5,tree_kws=dict(linewidths=0) ) plt.show sns_plot.savefig(te_class+".pdf") ``` ## Deprecated analysis **Log2FC comparisons** (Marie Fablet's inspired plots). Questions to Marie: - [x] Why when I plot log2FC the results do not look the same? Marie tells me it is because DESEQ2 makes a bit of gymnastics to fit the data but it should not change that much. Try with both and see what changes. ```r library(ggplot2) library(gridExtra) #import counts OvaAgo3 = read.table("OvaAgo3vsOvaGFP.complete.txt", h=T) OvaAgo2 = read.table("OvaAgo2vsOvaGFP.complete.txt", h=T) OvaSiwi = read.table("OvaSiwivsOvaGFP.complete.txt", h=T) GutAgo3 = read.table("GutAgo3vsGutGFP.complete.txt", h=T) GutAgo2 = read.table("GutAgo2vsGutGFP.complete.txt", h=T) GutSiwi = read.table("GutSiwivsGutGFP.complete.txt", h=T) # num.te correspond aux numéros de lignes des ET num.te = 27310:30708 #filtering for TE counts GutSiwiTE=GutSiwi[num.te,] GutAgo3TE=GutAgo3[num.te,] GutAgo2TE=GutAgo2[num.te,] OvaSiwiTE=OvaSiwi[num.te,] OvaAgo3TE=OvaAgo3[num.te,] OvaAgo2TE=OvaAgo2[num.te,] #making plots grid.arrange( ggplot(data=GutSiwiTE)+geom_point(aes(x=(log10((GutGFP)+1)), y=log2(GutSiwi/GutGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), ggplot(data=GutAgo3TE)+geom_point(aes(x=(log10((GutGFP)+1)), y=log2(GutAgo3/GutGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), ggplot(data=GutAgo2TE)+geom_point(aes(x=(log10((GutGFP)+1)), y=log2(GutAgo2/GutGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), ggplot(data=OvaSiwiTE)+geom_point(aes(x=(log10((OvaGFP)+1)), y=log2(OvaSiwi/OvaGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), ggplot(data=OvaAgo3TE)+geom_point(aes(x=(log10((OvaGFP)+1)), y=log2(OvaAgo3/OvaGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), ggplot(data=OvaAgo2TE)+geom_point(aes(x=(log10((OvaGFP)+1)), y=log2(OvaAgo2/OvaGFP), color=padj<0.05), alpha=0.4) + geom_hline(yintercept=0, linetype="dashed", color = "red") + ylim(-8,8), nrow = 2) ``` ![](https://i.imgur.com/01d0Wb3.png) Things I am not sure and to ask **Mariana**: - [x] Why are are all those Ago3 TEs without padj? Too much difference between the replicates? - [x] How can I add the info on the TE type? Grey for padj > 0.05, then colored based on TE type? Or make different plots for each class? The Gut SIWI_1 and the Ova AGO3_3 wer strange in the PCA + dendogram. I am going to remove them from the analysis, and see if PCA/dendogram and these DE plots look better (because we can see all those grey dots that are probably samples where DESEQ couldn't get a proper padj due to too much variation between samples).