# ChIP-seq (cold DMEL) ###### tags: `ChIP-seq` `DMEL` `Cold` Started April 8th 2021 Updated December 2021 ## Pipeline inspiration * https://www.bioconductor.org/packages/devel/data/experiment/vignettes/systemPipeRdata/inst/doc/systemPipeChIPseq.html * https://hbctraining.github.io/Intro-to-ChIPseq/lessons/03_align_and_filtering.html ## Data All info related to the project can be found [here](https://drive.google.com/drive/folders/1qsBFzDWty0SYJYRqjL1kDAB50DFyHo7n?usp=sharing). Data is on /home/cdata/rebollo_cold/VieiraSeq/pbil.univ-lyon1.fr/pub/datasets/vieira/. For now I only have the H3K9me3, H3K4me3 and input for controls and cold stress of *Drosophila melanogaster* ovaries. **Checking md5sum** Checked md5 on donwloaded fastq files. Cristina sent me along with the files, the original md5sum to check. All good :heavy_check_mark: ```ssh md5sum *.fastq.gz > md5sumToCheck.txt ``` *D. melanogaster* genome : ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.16_FB2017_03/fasta/dmel-all-chromosome-r6.16.fasta.gz Also donwloaded the GTF, from the same link (dmel-all-r6.16). ## Quality check A couple of libraries have very few reads, despite the correct donwload (md5sum), see [here](https://drive.google.com/file/d/14qYC-hFl3BVuzpySxJ0Ax8v_6sbyrrSF/view?usp=sharing). There are three libraries that seem problematic (most libraries have around 1G) : DmGoth6.3 CS2 K4s and DmSJ23_cont2 are around ~4M, and DmSJ23_cont1-K4 are ~1M. :negative_squared_cross_mark: ```ssh du -sh *.fastq.gz >file.size.txt ``` FastQC shows the presence of illumina adaptors, but the overall quality is good. All FastQC reports can be found [here](https://drive.google.com/drive/folders/1sOu00XzjeK8jR3sAmX4cEGrZ93liJWvc?usp=sharing), along with a multiQC report. ```ssh #fastQC on all samples ~/FastQC/fastqc *.fastq.gz #multiqc source ~/miniconda2/bin/activate multiqc multiqc . source ~/miniconda2/bin/deactivate multiqc ``` In a couple of samples, there is a weird GC content distribution... need to see if this is removed once adaptor cleaning is made. There are also loads of duplicated crappy sequences that I hope will be gone with trimming. If the GC content remains after trimming, check if the K9me3 samples are the ones with this weird profile. It could be tandem repeats, satellites etc. ![](https://i.imgur.com/P7FsEMY.png) ![](https://i.imgur.com/v2WFCtR.png) Most samples have a crap 5' that should be removed by trimming (around 8-9 bases). ![](https://i.imgur.com/RffcpO9.png) ## Quality and adaptor Trimming I use [TrimGalore](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md) because it recognizes the adaptors automatically and works quite fast. Built a library.txt with file names (column 1 = R1, column 2 = R2, without .fastq.gz extension). All fastQC reports and multiQC of trimmed files can be found [here](https://drive.google.com/drive/folders/1MuKb4Y0z5lhsEddRyrXHzb-2O_fuQ0J7?usp=sharing). ```ssh #print file names and manually create columns ls *.fastq.gz > library.txt while read line; do R1=$(echo $line|awk '{print $1}'); R2=$(echo $line|awk '{print $2}'); #Trimming with defaults (min quality of 20, adapter automatically recognized, min length of 20), 4 cores, clip 9bp of 5' because of FastQC results. ~/TrimGalore-0.6.0/trim_galore --path_to_cutadapt /usr/bin/cutadapt --paired --max_n 0 --cores 4 --clip_R1 9 --clip_R2 9 $R1.fastq.gz $R2.fastq.gz done <library.txt #FastQC because mine doesn't work from Trim_galore ~/FastQC/fastqc *fq.gz #multiqc source ~/miniconda2/bin/activate multiqc multiqc . source ~/miniconda2/bin/deactivate multiqc ``` Trimmed 1 to 6% of reads. Clipped 5' works well and we can see the A depletion at the 3' because of adaptor removal. ![](https://i.imgur.com/apIiSiU.png) The GC content is still strange, so I'm checking what these red samples are. ![](https://i.imgur.com/gDGMQeF.png) The samples correpond rouhgly to the ones that show high duplicated seq (a lot of simple repeats: GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAA TCTTCTCTTCTCTTCTCTTCTCTTCTCTTCTCTTCTCTTCTCTTCTCTTC for instance). I think it is ok, these reads won't properly map to the genome. ![](https://i.imgur.com/eQ2It92.png) I'm running Fastq screen to check for potential contaminants in case of: Adaptors, Drosophila, Human, phix, and vectors. `./fastq_screen CNVR31.R1.fastq.gz` CNVR31 - bad GC - over represented >3% ![](https://i.imgur.com/MuUVuOe.png) CNVR54 - bad GC - over represented >1% ![](https://i.imgur.com/ReQAHph.png) CNVR37 bad GC - over represented >3% ![](https://i.imgur.com/WsR55P8.png) CNVR7 - ok ![](https://i.imgur.com/Uh5WtE9.png) CNVR11 - ok ![](https://i.imgur.com/8GGqhbp.png) In conclusion the reads that are probably over represented do map against Drosophila but also against human and vectors. For me, these reads are bringing noise to the libraries but apparently it is not recommended to remove such reads... Regarding the libraries that had smaller sizes, they end up with ~8M reads (DmGoth6.3.CS2.K4/CNVR41 & DmSJ23.cont2/CNVR46) or 1.5M reads (DmSJ23.cont1.K4/CNVR44) compared to the other libraries that show >10/15M reads. :negative_squared_cross_mark: ## Mapping ### Bowtie2 [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) reports multimapped reads randomly, so a filtering step to get only uniquely mapped reads is needed. The option for Bowtie2 `--local` is the same as `--sensitive-local` and it means `-D 15 -R 2 -N 0 -L 20 -i S,1,0.75` . BAM sorting was done with [Sambamba](http://lomereiter.github.io/sambamba/index.html) instead of samtools because it is faster and suggested [here](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/03_align_and_filtering.html). Lastly, I also removed "[blacklisted regions](https://www.nature.com/articles/s41598-019-45839-z)" that have shown to be artifact regions. The regions are from dm6 that should work on the genome used here (dmel-all-chromosome-r6.16.fasta.gz). I donwloaded the dm6 blacklists files [here](https://github.com/Boyle-Lab/Blacklist/). To cite : Amemiya, H.M., Kundaje, A. & Boyle, A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). https://doi.org/10.1038/s41598-019-45839-z ```ssh #build bowtie2 index bowtie2-build dmel-all-chromosome-r6.16.fasta.gz dmel-all-chromosome-r6.16 #sambamba for BAM filtering is within a conda environment source ~/miniconda2/bin/activate sambamba ####loop script while read line; do R1=$(echo $line|awk '{print $1}'); R2=$(echo $line|awk '{print $2}'); #mapping bowtie2 -p 4 --local -x /home/data/rrebollo/FRM/dmel-all-chromosome-r6.16 -1 ${R1}_val_1.fq.gz -2 ${R2}_val_2.fq.gz -S ${R1}.sam #convert to BAM samtools view -h -S -b -o ${R1}.unsorted.bam ${R1}.sam #sort BAM by coordinates using sambamba sambamba sort -t 2 -o ${R1}.sorted.bam ${R1}.unsorted.bam #filter uniquely mapping reads with sambamba view sambamba view -h -t 2 -f bam -F "[XS] == null and not unmapped and not duplicate" ${R1}.sorted.bam > ${R1}.filtered01.bam #filter blacklisted regions, regions with chr names modified, I removed the chr to match the BAM bedtools intersect -v -a ${R1}.filtered01.bam -b /home/data/rrebollo/FRM/dm6-blacklist.v2.bed > ${R1}.filtered02.bam done <library.txt #### End of loop script #deactivate conda environment source ~/miniconda2/bin/deactivate sambamba ``` From [here](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/03_align_and_filtering.html) we can understand the filtering procedure to get only uniquely mapped reads. > We filtered out unmapped reads by specifying in the filter not unmapped, and duplicates with not duplicate. Also, among the reads that were aligned, we filtered out multimappers by specifying [XS] == null. ‘XS’ is a tag generated by Bowtie2 that gives an alignment score for the second-best alignment, and it is only present if the read is aligned and more than one alignment was found for the read. Example of mapping stats (CNVR14): > 17220109 reads; of these: > 17220109 (100.00%) were paired; of these: > 4715520 (27.38%) aligned concordantly 0 times > 10120277 (58.77%) aligned concordantly exactly 1 time > 2384312 (13.85%) aligned concordantly >1 times > ---- > 4715520 pairs aligned concordantly 0 times; of these: > 3932777 (83.40%) aligned discordantly 1 time > ---- > 782743 pairs aligned 0 times concordantly or discordantly; of these: > 1565486 mates make up the pairs; of these: > 430814 (27.52%) aligned 0 times > 63405 (4.05%) aligned exactly 1 time > 1071267 (68.43%) aligned >1 times > 98.75% overall alignment rate ### Mapping quality control #### Visualisation (IGV) Things to check: - [x] Input mapping is all over the place - [x] K4me3 present within promoters - [x] K9me3 all over the place - [x] Blacklisted regions are absent Checking IGV with DmSJ7 controls (CNVR7, CNVR8), K4 (CNVR23, CNVR25), K9 (CNVR24,CNVR26). H3K4me3 enriched on gene promoters, while input and H3K9me3 are low everywhere. ![](https://i.imgur.com/DJUq4oM.png) ![](https://i.imgur.com/VMmD5Af.png) Black listed regions are indeed removed from the BAM files. See below for an example of a low mappability region and a high signal region. ![](https://i.imgur.com/cX8g7Wu.png) ![](https://i.imgur.com/wwfJBdb.png) Since DmSJ7 samples are fine, I checked all the other samples compared to DmSJ7 controls (check all the screenshots [here](https://docs.google.com/presentation/d/1XSWbEoyvXlYu7hXIMpLHh9u9e1Lg7FZ5kBFH0kTE9t4/edit?usp=sharing)). All SJ7 and Goth10.1 are fine. But all the inputs, and some K9 of DmGoth6.3 and SJ23 are looking as K4 datasets, with narrow peaks, often at the K4 peaks. * DmSj7 control :heavy_check_mark: * DmSj7 CS :heavy_check_mark: * DmSJ23 control :negative_squared_cross_mark: * DMSJ23 CS :negative_squared_cross_mark: * DmGoth10.1 control :heavy_check_mark: * DmGoth10.1 CS :heavy_check_mark: * DmGoth6.3 control :negative_squared_cross_mark: * DmGoth6.3 CS :negative_squared_cross_mark: These results are in agreement with the Fingerprints done below (Deeptools section). #### Qualimap Using [qualimap](http://qualimap.conesalab.org/) to check mapping statistics. Qualimap is a graphic tool, and I did a multiBAM QC on the final BAM (filtered for uniquely mapping, non duplicated, and outside backlisted regions). Check the entire reports [here](https://drive.google.com/drive/folders/1VHQ5RqfD1eTrXB5PPs1_r3uq9ovBu6yu?usp=sharing). ```ssh ./qualimap --java-mem-size=4G ``` One of the things that was bugging me, was the GC content distribution. Here we can see that while the GC content might be different between samples, it is homogeneous within a library. ![](https://i.imgur.com/GiX4qWs.png) The PCA of all samples is really hard to grasp... It would be better to make this analysis with bigwig files (it would at least remove the input information), and label K4 and K9 separately. ![](https://i.imgur.com/N5SHt7i.png) #### Deeptools [Deeptools](https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html#tools-for-qc) can be used to assess ChIP-seq quality with their QC tools, and it might be more suited than Qualimap. Fingerprints of ChIP are done with bam files (filtered, last output of mapping script). ```ssh ##start loop script## while read line; do R1=$(echo $line|awk '{print $1}'); #need to index the BAM files samtools index ${R1}.filtered02.bam done <library.txt ##end loop script## #activate deeptools environment source ~/miniconda2/bin/activate deeptools #plotFingerprint with BAM files for each group (for instance, the duplicates for DMSJ7 control, K4 and K9) plotFingerprint -p 4 -b CNVR7.R1.filtered02.bam CNVR8.R1.filtered02.bam CNVR23.R1.filtered02.bam CNVR25.R1.filtered02.bam CNVR24.R1.filtered02.bam CNVR26.R1.filtered02.bam --labels input-01 input-02 H3K4me3-01 H3K4me3-02 H3K9me3-01 H3K9me3-02 -T "Fingerprints of DMSJ7_Control" -plot DMSJ7_Control.pdf plotFingerprint -p 4 -b CNVR9.R1.filtered02.bam CNVR10.R1.filtered02.bam CNVR27.R1.filtered02.bam CNVR29.R1.filtered02.bam CNVR28.R1.filtered02.bam CNVR30.R1.filtered02.bam --labels input-01 input-02 H3K4me3-01 H3K4me3-02 H3K9me3-01 H3K9me3-02 -T "Fingerprints of DMSJ7_Cold" -plot DMSJ7_Cold.pdf plotFingerprint -p 4 -b CNVR11.R1.filtered02.bam CNVR12.R1.filtered02.bam CNVR15.R1.filtered02.bam CNVR17.R1.filtered02.bam CNVR16.R1.filtered02.bam CNVR18.R1.filtered02.bam --labels input-01 input-02 H3K4me3-01 H3K4me3-02 H3K9me3-01 H3K9me3-02 -T "Fingerprints of DMGoth10.1_Control" -plot DMGoth10.1_Control.pdf plotFingerprint -p 4 -b CNVR13.R1.filtered02.bam CNVR14.R1.filtered02.bam CNVR19.R1.filtered02.bam CNVR21.R1.filtered02.bam CNVR20.R1.filtered02.bam CNVR22.R1.filtered02.bam --labels input-01 input-02 H3K4me3-01 H3K4me3-02 H3K9me3-01 H3K9me3-02 -T "Fingerprints of DMGoth10.1_Cold" -plot DMGoth10.1_Cold.pdf plotFingerprint -p 4 -b CNVR31.R1.filtered02.bam CNVR34.R1.filtered02.bam CNVR32.R1.filtered02.bam CNVR35.R1.filtered02.bam CNVR33.R1.filtered02.bam CNVR36.R1.filtered02.bam --labels input-01 input-02 H3K4me3-01 H3K4me3-02 H3K9me3-01 H3K9me3-02 -T "Fingerprints of DMGoth6.3_Control" -plot DMGoth6.3_Control.pdf plotFingerprint -p 4 -b CNVR37.R1.filtered02.bam CNVR40.R1.filtered02.bam CNVR38.R1.filtered02.bam CNVR41.R1.filtered02.bam CNVR39.R1.filtered02.bam CNVR42.R1.filtered02.bam --labels input-01 input-02 H3K4me3-01 H3K4me3-02 H3K9me3-01 H3K9me3-02 -T "Fingerprints of DMGoth6.3_Cold" -plot DMGoth6.3_Cold.pdf plotFingerprint -p 4 -b CNVR43.R1.filtered02.bam CNVR46.R1.filtered02.bam CNVR44.R1.filtered02.bam CNVR47.R1.filtered02.bam CNVR45.R1.filtered02.bam CNVR48.R1.filtered02.bam --labels input-01 input-02 H3K4me3-01 H3K4me3-02 H3K9me3-01 H3K9me3-02 -T "Fingerprints of DMSJ23_Control" -plot DMSJ23_Control.pdf plotFingerprint -p 4 -b CNVR49.R1.filtered02.bam CNVR52.R1.filtered02.bam CNVR50.R1.filtered02.bam CNVR53.R1.filtered02.bam CNVR51.R1.filtered02.bam CNVR54.R1.filtered02.bam --labels input-01 input-02 H3K4me3-01 H3K4me3-02 H3K9me3-01 H3K9me3-02 -T "Fingerprints of DMSJ23_Cold" -plot DMSJ23_Cold.pdf #deactivate conda environment source ~/miniconda2/bin/deactivate deeptools ``` Fingerprints DmSJ7 :heavy_check_mark: ![](https://i.imgur.com/hEN2YTl.png) ![](https://i.imgur.com/YDZOXDN.png) Fingerprints DmGoth10.1 :heavy_check_mark: ![](https://i.imgur.com/XicortI.png) ![](https://i.imgur.com/wf5ptj0.png) Fingerprints DmGoth6.3 (CNVR41 8M reads only, DmGoth6.3.CS2.K4). Inputs are bad, and one K9me3 from control (second replicate) is not great :negative_squared_cross_mark:. ![](https://i.imgur.com/Jqa1a8q.png) ![](https://i.imgur.com/UQ2LcSp.png) Fingerprints DmSJ23 control and CS (cont2 has 8M reads, while control1_K4 has only 1.5M). All inputs are bad and one K9 CS is also crappy :negative_squared_cross_mark:. ![](https://i.imgur.com/Kz3q5Ut.png) ![](https://i.imgur.com/G6to5ML.png) In conclusion, both the IGV and the fingerprints show that DMSJ23 and DMGoth6.3 have crap input samples for all of their replicates, control and cold shock. Furthermore, it does not seem that the file size has any influence on the samples, as for instance, the two inputs of DmSJ23_control are similar on the fingerprints, while one of them has only 8M reads; same for the K4, where one has 1.5M reads and is very similar to the one that has 1GB. There are a couple of K9me3 that don't look ok, but let's see how they are in normalized samples. Cristina asked Judit to see how the samples were processed. The method she used does not explain the differences we are seeing in DmSJ23 and DmGoth10.1. Indeed, Judit dit the Goth, and Angelo did the SJ so there is no way they could have messed up one sample each. ### BBsplit In order to verify if the "overrepresented sequences" are responsible for the "bad" fingerprint of a couple of DMel populations, I decided to filter for melanogaster only read using BBsplit (as we do with sitophilus oryzae). So I simply used the trimmed fastq for one good repliate (input/K4/K9) and one bad replicate (input/K4/K9), against drosophila and human. Libraries to be used: CNVR7 DmSJ7_Cont1 CNVR23 DmSJ7_Cont1-K4 CNVR24 DmSJ7_Cont1-K9 CNVR43 DmSJ23_Cont1 CNVR44 DmSJ23_Cont1-K4 CNVR45 DmSJ23_Cont1-K9 https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/installation-guide/ http://seqanswers.com/forums/showthread.php?t=41288 `ambig2=toss` Ambiguous reads are considered unmapped. `ambig2=split` Write a copy to the AMBIGUOUS_ output file for each reference to which it maps. ```ssh bash /home/data/rrebollo/bbmap/bbsplit.sh in1=CNVR23.R1_val_1.fq.gz in2=CNVR23.R2_val_2.fq.gz ref=/home/data/rrebollo/bbmap/resources/hg38.fa.gz,/home/data/rrebollo/bbmap/resources/dmel-all-chromosome-r6.16.fasta.gz ambig2=split basename=CNVR23splitout_%.fq outu1=CNVR23splitclean1.fq outu2=CNVR23splitclean2.fq bash /home/data/rrebollo/bbmap/bbsplit.sh in1=CNVR24.R1_val_1.fq.gz in2=CNVR24.R2_val_2.fq.gz ref=/home/data/rrebollo/bbmap/resources/hg38.fa.gz,/home/data/rrebollo/bbmap/resources/dmel-all-chromosome-r6.16.fasta.gz ambig2=split basename=CNVR24out_%.fq outu1=CNVR24clean1.fq outu2=CNVR24clean2.fq bash /home/data/rrebollo/bbmap/bbsplit.sh in1=CNVR7.R1_val_1.fq.gz in2=CNVR7.R2_val_2.fq.gz ref=/home/data/rrebollo/bbmap/resources/hg38.fa.gz,/home/data/rrebollo/bbmap/resources/dmel-all-chromosome-r6.16.fasta.gz ambig2=split basename=CNVR7out_%.fq outu1=CNVR7clean1.fq outu2=CNVR7clean2.fq bash /home/data/rrebollo/bbmap/bbsplit.sh in1=CNVR43.R1_val_1.fq.gz in2=CNVR43.R2_val_2.fq.gz ref=/home/data/rrebollo/bbmap/resources/hg38.fa.gz,/home/data/rrebollo/bbmap/resources/dmel-all-chromosome-r6.16.fasta.gz ambig2=split basename=CNVR43out_%.fq outu1=CNVR43clean1.fq outu2=CNVR43clean2.fq bash /home/data/rrebollo/bbmap/bbsplit.sh in1=CNVR44.R1_val_1.fq.gz in2=CNVR44.R2_val_2.fq.gz ref=/home/data/rrebollo/bbmap/resources/hg38.fa.gz,/home/data/rrebollo/bbmap/resources/dmel-all-chromosome-r6.16.fasta.gz ambig2=split basename=CNVR44out_%.fq outu1=CNVR44clean1.fq outu2=CNVR44clean2.fq bash /home/data/rrebollo/bbmap/bbsplit.sh in1=CNVR45.R1_val_1.fq.gz in2=CNVR45.R2_val_2.fq.gz ref=/home/data/rrebollo/bbmap/resources/hg38.fa.gz,/home/data/rrebollo/bbmap/resources/dmel-all-chromosome-r6.16.fasta.gz ambig2=split basename=CNVR45out_%.fq outu1=CNVR45clean1.fq outu2=CNVR45clean2.fq ``` - [x] Calculate the number of reads lost (FastQC) | Library | Lib. quality | Raw | Trimmed | BBsplit/2 | Lib. quality | |:------- |:------------:| --- | ------- | --------:| ---------------------------- | | CNVR7 | pass | 16292888 | 16259510 | 15894439 | pass | | CNVR23 | pass | 17609074 | 17557994 | 17320227 | pass | | CNVR24 | GC spikey - few over. seq. | 17620817 | 17557435 | 16991811 | pass - no over. seq. | | CNVR43 | fail | 14609819 | 14505996 | 13516046 | GC fail- over. seq. | | CNVR44 | fail | 1546406 | 1538733 | 1450227 | GC a bit better - no over. seq. | | CNVR45 | fail | 15926411 | 15855911 | 15220190 | GC a bit better - few over. seq. | - [x] Map against the dm6 genome So it seems some of the libraries could be better after bbsplit. The fastq paired-end reads from bbsplit are in one single file... see how to split it or change the bowtie command. :heavy_check_mark: ``` bash /home/data/rrebollo/bbmap/reformat.sh in=CNVR45out_dmel-all-chromosome-r6.16.fq out1=DmSJ23_Cont1-K9_R1.fq out2=DmSJ23_Cont1-K9_R2.fq bash /home/data/rrebollo/bbmap/reformat.sh in=CNVR43out_dmel-all-chromosome-r6.16.fq out1=DmSJ23_Cont1_R1.fq out2=DmSJ23_Cont1_R2.fq bash /home/data/rrebollo/bbmap/reformat.sh in=CNVR24out_dmel-all-chromosome-r6.16.fq out1=DmSJ7_Cont1-K9_R1.fq out2=DmSJ7_Cont1-K9_R2.fq bash /home/data/rrebollo/bbmap/reformat.sh in=CNVR7out_dmel-all-chromosome-r6.16.fq out1=DmSJ7_Cont1_R1.fq out2=DmSJ7_Cont1_R2.fq bash /home/data/rrebollo/bbmap/reformat.sh in=CNVR23splitout_dmel-all-chromosome-r6.16.fq out1=DmSJ7_Cont1-K4_R1.fq out2=DmSJ7_Cont1-K4_R2.fq bash /home/data/rrebollo/bbmap/reformat.sh in=CNVR44out_dmel-all-chromosome-r6.16.fq out1=DmSJ23_Cont1-K4_R1.fq out2=DmSJ23_Cont1-K4_R2.fq ``` ```ssh #sambamba for BAM filtering is within a conda environment #sambamba for BAM filtering is within a conda environment export PATH="/home/data/rrebollo/miniconda2/bin:$PATH" source activate sambamba ####loop script while read line; do R1=$(echo $line|awk '{print $1}'); R2=$(echo $line|awk '{print $2}'); #mapping bowtie2 -p 12 --local -x /home/data/rrebollo/drosophila/FRM/dmel-all-chromosome-r6.16 -1 ${R1}.fq -2 ${R2}.fq -S ${R1}.sam #convert to BAM samtools view -h -S -b -o ${R1}.unsorted.bam ${R1}.sam #sort BAM by coordinates using sambamba sambamba sort -t 2 -o ${R1}.sorted.bam ${R1}.unsorted.bam #filter uniquely mapping reads with sambamba view sambamba view -h -t 2 -f bam -F "[XS] == null and not unmapped and not duplicate" ${R1}.sorted.bam > ${R1}.filtered01.bam #filter blacklisted regions, regions with chr names modified, I removed the chr to match the BAM bedtools intersect -v -a ${R1}.filtered01.bam -b /home/data/rrebollo/drosophila/FRM/dm6-blacklist.v2.bed > ${R1}.filtered02.bam done <list1.txt #### End of loop script #deactivate conda environment conda deactivate ##start loop script## while read line; do R1=$(echo $line|awk '{print $1}'); #need to index the BAM files samtools index ${R1}.filtered02.bam done <list1.txt ##end loop script## #activate deeptools environment source activate deeptools #plotFingerprint with BAM files for each group plotFingerprint -p 4 -b DmSJ23_Cont1_R1.bam DmSJ23_Cont1-K4_R1.bam DmSJ23_Cont1-K9_R1.bam -T "Fingerprints of DmSJ23_Cont1" -plot DmSJ23_Cont1.pdf plotFingerprint -p 4 -b DmSJ7_Cont1_R1.bam DmSJ7_Cont1-K4_R1.bam DmSJ7_Cont1-K9_R1.bam -T "Fingerprints of DmSJ7_Cont1" -plot DmSJ7_Cont1.pdf #deactivate conda environment conda deactivate ``` - [x] Plot Fingerprint DMSJ7 as expected, all good ![](https://i.imgur.com/BwRAnaN.png) DmSJ23 did not improve ![](https://i.imgur.com/LkmwaMa.png) - [x] Check IGV We can see again the input + K9 have peks, even though they are smaller than the K9 Peak. ![](https://i.imgur.com/aDwohqs.png) In conclusion, I do not see how to improve the input analysis and will therefore make a input-normalized analysis only with DmSJ7 and DMGoth10.1 and an read cover normalized analysis with all samples. ## ChIP-seq analysis with inputs With samples Dm SJ7 and Dm Goth 10.1 only since the other populations show bad inputs. ### Normalization Use [bamCompare](https://deeptools.readthedocs.io/en/develop/content/tools/bamCompare.html) from deeptools. b1 = treatment b2 = control - [x] need to do with --extendReads - [x] use normalization with RPKM, scaling none ```ssh conda activate deeptools bamCompare -b1 CNVR15.R1.filtered02.bam -b2 CNVR11.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmGoth10.1_Con1-K4_log2.bw bamCompare -b1 CNVR17.R1.filtered02.bam -b2 CNVR12.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmGoth10.1_Con2-K4_log2.bw bamCompare -b1 CNVR19.R1.filtered02.bam -b2 CNVR13.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmGoth10.1_CS1-K4_log2.bw bamCompare -b1 CNVR21.R1.filtered02.bam -b2 CNVR14.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmGoth10.1_CS2-K4_log2.bw bamCompare -b1 CNVR23.R1.filtered02.bam -b2 CNVR7.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmSJ7_Cont1-K4_log2.bw bamCompare -b1 CNVR25.R1.filtered02.bam -b2 CNVR8.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmSJ7_Cont2-K4_log2.bw bamCompare -b1 CNVR27.R1.filtered02.bam -b2 CNVR9.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmSJ7_CS1-K4_log2.bw bamCompare -b1 CNVR29.R1.filtered02.bam -b2 CNVR10.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmSJ7_CS2-K4_log2.bw bamCompare -b1 CNVR16.R1.filtered02.bam -b2 CNVR11.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmGoth10.1_Con1-K9_log2.bw bamCompare -b1 CNVR18.R1.filtered02.bam -b2 CNVR12.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmGoth10.1_Con2-K9_log2.bw bamCompare -b1 CNVR20.R1.filtered02.bam -b2 CNVR13.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmGoth10.1_CS1-K9_log2.bw bamCompare -b1 CNVR22.R1.filtered02.bam -b2 CNVR14.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmGoth10.1_CS2-K9_log2.bw bamCompare -b1 CNVR24.R1.filtered02.bam -b2 CNVR7.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmSJ7_Cont1-K9_log2.bw bamCompare -b1 CNVR26.R1.filtered02.bam -b2 CNVR8.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmSJ7_Cont2-K9_log2.bw bamCompare -b1 CNVR28.R1.filtered02.bam -b2 CNVR9.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmSJ7_CS1-K9_log2.bw bamCompare -b1 CNVR30.R1.filtered02.bam -b2 CNVR10.R1.filtered02.bam --extendReads --scaleFactorsMethod None --normalizeUsing RPKM -p 12 --verbose -o DmSJ7_CS2-K9_log2.bw #multiBigwigSummary multiBigwigSummary bins -b DmGoth10.1_CS1-K4_log2.bw DmGoth10.1_CS1-K9_log2.bw DmGoth10.1_CS2-K4_log2.bw DmGoth10.1_CS2-K9_log2.bw DmGoth10.1_Con1-K4_log2.bw DmGoth10.1_Con1-K9_log2.bw DmGoth10.1_Con2-K4_log2.bw DmGoth10.1_Con2-K9_log2.bw DmSJ7_CS1-K4_log2.bw DmSJ7_CS1-K9_log2.bw DmSJ7_CS2-K4_log2.bw DmSJ7_CS2-K9_log2.bw DmSJ7_Cont1-K4_log2.bw DmSJ7_Cont1-K9_log2.bw DmSJ7_Cont2-K4_log2.bw DmSJ7_Cont2-K9_log2.bw -p 12 --smartLabels -o normalized-bw.npz #plotCorrelation of readCount plotCorrelation -in normalized-bw.npz -c spearman -p heatmap --skipZeros -T "ChIP-seq library correlation with input" -o bw_correlation.pdf #Plot PCA plotPCA -in normalized-bw.npz -T "Dmel control and cold ChIP-seq libraries without input" -o bw_PCA.pdf #deactivate conda environment conda deactivate ``` #### IGV K9me3 enriched near telomeres ![](https://i.imgur.com/aG0n8zF.png) K4me3 enriched at promoters, even though we can spot some K9 also enriched. The data has negative values because I calculated the log2 ration. For proper vizualizatio in IGV for future figures, it is possible to use 1X normalized values, instead of input-normalized RPKM. ![](https://i.imgur.com/5Iy47zH.png) #### Correlations Correlations are as expected. :+1: ![](https://i.imgur.com/ApQepgv.png) ![](https://i.imgur.com/DWNgeFo.png) ### Enrichment at gene promoters and TEs #### Genes Because K4me3 is mainly present around TSSs, I wanted to take a look at the enrichment at the gene promoters, the correlation between samples and the overall K4me3/K9me3 density at TSSs. So first, I made a bed file with all gene promoters using [gencode_regions](https://github.com/saketkc/gencode_regions)). It creates many different files, including a 2Kb promoter bed region which I will use. ```ssh Rscript create_regions_from_gencode.R ./dmel-all-r6.16.gtf . ``` :warning: there is a bug in the R code of *gencode_regions* so one needs to tweak a bit as described [here](https://github.com/saketkc/gencode_regions/issues/3): > By simply removing the lines as mentioned above > `transcripts.data <- unlist(transcripts.data)` > and > `genes.data <- unlist(genes.data)` > > as well as replacing > `promoters.data <- unlist(promoters(TxDb, upstream=len, downstream=len))` > with > `promoters.data <- promoters(TxDb, upstream=len, downstream=len)` I checked the promoter file in the IGV :heavy_check_mark: To see if the data is good, let's look at H3K4me3 and H3K9me3 enrichment at promoter regions using plotprofile (average enrichment at the TSS +/- 2Kb). I tried looking at all the libraries at the same time, but the graph is too crowded, so I made plots per population, then per K4 and per K9. ```ssh #activate deeptools environment export PATH="/home/data/rrebollo/miniconda2/bin:$PATH" source activate deeptools #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth10.1 computeMatrix reference-point --referencePoint center -S DmGoth10.1_Con1-K4_log2.bw DmGoth10.1_Con2-K4_log2.bw DmGoth10.1_CS1-K4_log2.bw DmGoth10.1_CS2-K4_log2.bw DmGoth10.1_Con1-K9_log2.bw DmGoth10.1_Con2-K9_log2.bw DmGoth10.1_CS1-K9_log2.bw DmGoth10.1_CS2-K9_log2.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R /home/data/rrebollo/drosophila/FRM/promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_DmGoth10.1-ComputeMatrix.gz #plotProfile DmGoth10.1 plotProfile --matrixFile Promoters_DmGoth10.1-ComputeMatrix.gz --perGroup --outFileName Promoters_DmGoth10.1_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ7 computeMatrix reference-point --referencePoint center -S DmSJ7_Cont1-K4_log2.bw DmSJ7_Cont2-K4_log2.bw DmSJ7_CS1-K4_log2.bw DmSJ7_CS2-K4_log2.bw DmSJ7_Cont1-K9_log2.bw DmSJ7_Cont2-K9_log2.bw DmSJ7_CS1-K9_log2.bw DmSJ7_CS2-K9_log2.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R /home/data/rrebollo/drosophila/FRM/promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_DmSJ7-ComputeMatrix.gz #plotProfile DmSJ7 plotProfile --matrixFile Promoters_DmSJ7-ComputeMatrix.gz --perGroup --outFileName Promoters_DmSJ7_PlotProfile.pdf #Can also plot all K4me3 together to show they all have the same profile #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb All K4 computeMatrix reference-point --referencePoint center -S DmGoth10.1_CS1-K4_log2.bw DmGoth10.1_CS2-K4_log2.bw DmGoth10.1_Con1-K4_log2.bw DmGoth10.1_Con2-K4_log2.bw DmSJ7_CS1-K4_log2.bw DmSJ7_CS2-K4_log2.bw DmSJ7_Cont1-K4_log2.bw DmSJ7_Cont2-K4_log2.bw --smartLabels -R /home/data/rrebollo/drosophila/FRM/promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K4-ComputeMatrix.gz #plotProfile All K4 plotProfile --matrixFile Promoters_K4-ComputeMatrix.gz --perGroup --legendLocation upper-right --outFileName Promoters_K4_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb All K9 computeMatrix reference-point --referencePoint center -S DmGoth10.1_CS1-K9_log2.bw DmGoth10.1_CS2-K9_log2.bw DmGoth10.1_Con1-K9_log2.bw DmGoth10.1_Con2-K9_log2.bw DmSJ7_CS1-K9_log2.bw DmSJ7_CS2-K9_log2.bw DmSJ7_Cont1-K9_log2.bw DmSJ7_Cont2-K9_log2.bw --smartLabels -R /home/data/rrebollo/drosophila/FRM/promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K9-ComputeMatrix.gz #plotProfile All K9 plotProfile --matrixFile Promoters_K9-ComputeMatrix.gz --perGroup --legendLocation upper-right --outFileName Promoters_K9_PlotProfile.pdf #deactivate conda environment source ~/miniconda2/bin/deactivate deeptools ``` It is strange because the K4 is expected but there is an enrichment in K9 seen, specially for one K9 library CS, upstream of TSSs. This is weird. (need to create the panel with the figures, but here are the unedited figures) ![](https://i.imgur.com/zC1npgf.png) I have also ploted K4me3 and K9me3 separately, to compare the different samples. ![](https://i.imgur.com/YdvkEt3.png) H3K9me3 as expected : depleted on TSSs ![](https://i.imgur.com/99heOco.png) #### TEs I ran repeat masker with Clément full consensus list of drosophila TEs : https://gitlab.in2p3.fr/stephane.delmotte/dnapipete/-/raw/master/Drosophila_Transposable_Element_all.fasta ```ssh nohup /home/data/rrebollo/RepeatMasker/RepeatMasker -pa 15 -a -s -gccalc -cutoff 200 -no_is -lib Drosophila_Transposable_Element_all.fasta dm6_modifiedforrepeatmasker.fasta nohup /home/data/rrebollo/RepeatMasker/RepeatMasker -pa 15 -a -s -gccalc -cutoff 200 -no_is -lib Drosophila_Transposable_Element_all.fasta dsim-all-chromosome-r2.02.fasta.gz ``` Then used [Clément's script](https://raw.githubusercontent.com/clemgoub/So2/master/TE/TE_colored_gff/rm2gff3.sh) to make a GTF file: `bash rm2gff3.sh dm6_modifiedforrepeatmasker.fasta.out > dmel.TE.gtf` ```ssh #activate deeptools environment export PATH="/home/data/rrebollo/miniconda2/bin:$PATH" source activate deeptools #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth10.1 computeMatrix scale-regions -S DmGoth10.1_Con1-K4_log2.bw DmGoth10.1_Con2-K4_log2.bw DmGoth10.1_CS1-K4_log2.bw DmGoth10.1_CS2-K4_log2.bw DmGoth10.1_Con1-K9_log2.bw DmGoth10.1_Con2-K9_log2.bw DmGoth10.1_CS1-K9_log2.bw DmGoth10.1_CS2-K9_log2.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R /home/data/rrebollo/drosophila/FRM/dmel.TE.bed -a 2000 -b 2000 --skipZeros -p 4 -o TEs_DmGoth10.1-ComputeMatrix.gz #plotProfile DmGoth10.1 plotProfile --matrixFile TEs_DmGoth10.1-ComputeMatrix.gz --perGroup --outFileName TEs_DmGoth10.1_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ7 computeMatrix scale-regions -S DmSJ7_Cont1-K4_log2.bw DmSJ7_Cont2-K4_log2.bw DmSJ7_CS1-K4_log2.bw DmSJ7_CS2-K4_log2.bw DmSJ7_Cont1-K9_log2.bw DmSJ7_Cont2-K9_log2.bw DmSJ7_CS1-K9_log2.bw DmSJ7_CS2-K9_log2.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R /home/data/rrebollo/drosophila/FRM/dmel.TE.bed -a 2000 -b 2000 --skipZeros -p 4 -o TEs_DmSJ7-ComputeMatrix.gz #plotProfile DmSJ7 plotProfile --matrixFile TEs_DmSJ7-ComputeMatrix.gz --perGroup --outFileName TEs_DmSJ7_PlotProfile.pdf #Can also plot all K4me3 together to show they all have the same profile #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb All K4 computeMatrix scale-regions -S DmGoth10.1_CS1-K4_log2.bw DmGoth10.1_CS2-K4_log2.bw DmGoth10.1_Con1-K4_log2.bw DmGoth10.1_Con2-K4_log2.bw DmSJ7_CS1-K4_log2.bw DmSJ7_CS2-K4_log2.bw DmSJ7_Cont1-K4_log2.bw DmSJ7_Cont2-K4_log2.bw --smartLabels -R /home/data/rrebollo/drosophila/FRM/dmel.TE.bed -a 2000 -b 2000 --skipZeros -p 4 -o TEs_K4-ComputeMatrix.gz #plotProfile All K4 plotProfile --matrixFile TEs_K4-ComputeMatrix.gz --perGroup --legendLocation upper-right --outFileName TEs_K4_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb All K9 computeMatrix scale-regions -S DmGoth10.1_CS1-K9_log2.bw DmGoth10.1_CS2-K9_log2.bw DmGoth10.1_Con1-K9_log2.bw DmGoth10.1_Con2-K9_log2.bw DmSJ7_CS1-K9_log2.bw DmSJ7_CS2-K9_log2.bw DmSJ7_Cont1-K9_log2.bw DmSJ7_Cont2-K9_log2.bw --smartLabels -R /home/data/rrebollo/drosophila/FRM/dmel.TE.bed -a 2000 -b 2000 --skipZeros -p 4 -o TEs_K9-ComputeMatrix.gz #plotProfile All K9 plotProfile --matrixFile TEs_K9-ComputeMatrix.gz --perGroup --legendLocation upper-right --outFileName TEs_K9_PlotProfile.pdf #deactivate conda environment conda deactivate ``` ![](https://i.imgur.com/9YytDWN.png) ### Peaks Using MACS3 ``` screen -S macs3 python3 -m venv macs3pythonenv/ source macs3pythonenv/bin/activate pip install macs3 ``` ##### [Call peaks](https://github.com/macs3-project/MACS/blob/master/docs/callpeak.md), broad for K9, default for K4 ```bash #activate environement source macs3pythonenv/bin/activate #callpeaks macs3 callpeak -t CNVR15.R1.filtered02.bam -c CNVR11.R1.filtered02.bam -f BAMPE -g dm -B -n DmGoth10.1_Cont1-K4 macs3 callpeak -t CNVR16.R1.filtered02.bam -c CNVR11.R1.filtered02.bam --broad -f BAMPE -B -g dm -n DmGoth10.1_Cont1-K9 macs3 callpeak -t CNVR17.R1.filtered02.bam -c CNVR12.R1.filtered02.bam -f BAMPE -g dm -B -n DmGoth10.1_Cont2-K4 macs3 callpeak -t CNVR18.R1.filtered02.bam -c CNVR12.R1.filtered02.bam --broad -f BAMPE -B -g dm -n DmGoth10.1_Cont2-K9 macs3 callpeak -t CNVR19.R1.filtered02.bam -c CNVR13.R1.filtered02.bam -f BAMPE -g dm -B -n DmGoth10.1_CS1-K4 macs3 callpeak -t CNVR20.R1.filtered02.bam -c CNVR13.R1.filtered02.bam --broad -f BAMPE -B -g dm -n DmGoth10.1_CS1-K9 macs3 callpeak -t CNVR21.R1.filtered02.bam -c CNVR14.R1.filtered02.bam -f BAMPE -g dm -B -n DmGoth10.1_CS2-K4 macs3 callpeak -t CNVR22.R1.filtered02.bam -c CNVR14.R1.filtered02.bam --broad -f BAMPE -B -g dm -n DmGoth10.1_CS2-K9 macs3 callpeak -t CNVR23.R1.filtered02.bam -c CNVR7.R1.filtered02.bam -f BAMPE -g dm -B -n DmSJ7_Cont1-K4 macs3 callpeak -t CNVR24.R1.filtered02.bam -c CNVR7.R1.filtered02.bam --broad -f BAMPE -B -g dm -n DmSJ7_Cont1-K9 macs3 callpeak -t CNVR25.R1.filtered02.bam -c CNVR8.R1.filtered02.bam -f BAMPE -g dm -B -n DmSJ7_Cont2-K4 macs3 callpeak -t CNVR26.R1.filtered02.bam -c CNVR8.R1.filtered02.bam --broad -f BAMPE -B -g dm -n DmSJ7_Cont2-K9 macs3 callpeak -t CNVR27.R1.filtered02.bam -c CNVR9.R1.filtered02.bam -f BAMPE -g dm -B -n DmSJ7_CS1-K4 macs3 callpeak -t CNVR28.R1.filtered02.bam -c CNVR9.R1.filtered02.bam --broad -f BAMPE -B -g dm -n DmSJ7_CS1-K9 macs3 callpeak -t CNVR29.R1.filtered02.bam -c CNVR10.R1.filtered02.bam -f BAMPE -g dm -B -n DmSJ7_CS2-K4 macs3 callpeak -t CNVR30.R1.filtered02.bam -c CNVR10.R1.filtered02.bam --broad -f BAMPE -B -g dm -n DmSJ7_CS2-K9 ``` Example of output ``` # ARGUMENTS LIST: # name = DmGoth10.1_Cont1-K4 # format = BAMPE # ChIP-seq file = ['CNVR15.R1.filtered02.bam'] # control file = ['CNVR11.R1.filtered02.bam'] # effective genome size = 1.20e+08 # band width = 300 # model fold = [5, 50] # qvalue cutoff = 5.00e-02 # The maximum gap between significant sites is assigned as the read length/tag size. # The minimum length of peaks is assigned as the predicted fragment length "d". # Larger dataset will be scaled towards smaller dataset. # Range for calculating regional lambda is: 1000 bps and 10000 bps # Broad region calling is off # Paired-End mode is on ``` - [x] look at IGV ![](https://i.imgur.com/ja3Ig6z.png) ![](https://i.imgur.com/HJihVor.png) ![](https://i.imgur.com/CEQTgpf.png) - [x] calculate number of peaks see section below ##### Differential peaks between control and cold. Trying [diffbind](https://bioconductor.org/packages/release/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf) following this [tutorial](https://github.com/hbc/NGS_Data_Analysis_Summer2016/blob/master/sessionV/lessons/08_differential_peaks.md). We need a sample sheet, here are all samples, but I did a separate analysis for K4 and K9. ![](https://i.imgur.com/DJWc4Oh.png) ```r export PATH="/home/data/rrebollo/miniconda2/bin:$PATH" source activate r4-base #install the required tools if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DiffBind") library(DiffBind) library(parallel) install.packages("reshape2") library(reshape2) install.packages("tidyverse") library(tidyverse) install.packages("UpSetR") library(UpSetR) #load the samples. here the program computes "consensus peaks" by returning only peaks that are found in at least two samples. K4samples <- read.csv('K4samplesheet.csv') K4 <- dba(sampleSheet=K4samples) K4 ``` Below we can see the number of peaks for every library and the number of consensus peaks (6045). ![](https://i.imgur.com/gYdKHHE.png) ```r #count coverage for the consensus peaks K4 <- dba.count(K4, bUseSummarizeOverlaps=TRUE, bRemoveDuplicates = TRUE, bParallel = TRUE) K4 ``` FRiP stands for Fraction of Reads in Peaks ![](https://i.imgur.com/0JKHRKM.png) ```r png('K4pcaplot2.png') dba.plotPCA(K4, attributes=DBA_TISSUE, label=DBA_ID) dev.off() png('K4pcaplot3.png') dba.plotPCA(K4, attributes=DBA_TREATMENT, label=DBA_ID) dev.off() ``` ![](https://i.imgur.com/LDNSDib.png) In order to find peaks that are differentially observed between CS and control, I then filtered the data to make population specific analysis. ```r #DmSJ7 #load the samples. here the program computes "consensus peaks" by returning only peaks that are found in at least two samples. K4DmSJ7samples <- read.csv('K4DmSJ7samplesheet.csv') K4DmSJ7 <- dba(sampleSheet=K4DmSJ7samples) #count coverage for the consensus peaks K4DmSJ7 <- dba.count(K4DmSJ7, bUseSummarizeOverlaps=TRUE, bRemoveDuplicates = TRUE, bParallel = TRUE) png('K4DmSJ7pcaplot.png') dba.plotPCA(K4DmSJ7, attributes=DBA_TREATMENT, label=DBA_ID) dev.off() #setting the comparisons K4DmSJ7 <- dba.contrast(K4DmSJ7, categories=DBA_TREATMENT, minMembers = 2) #running DE analysis K4DmSJ7 <- dba.analyze(K4DmSJ7, method=DBA_ALL_METHODS) png('K4DmSJ7pcaplotDeseq2.png') dba.plotPCA(K4DmSJ7, contrast=1, method=DBA_EDGER, attributes=DBA_TREATMENT, label=DBA_ID) dev.off() #couldn't plot it, no enough data dba.plotVenn(K4DmSJ7,contrast=1,method=DBA_ALL_METHODS) #only one peak set meets specified criteria png('K4DmSJ7maplotDeseq.png') dba.plotMA(K4DmSJ7, method=DBA_DESEQ2) dev.off() #extract results for each method K4DmSJ7.edgeR <- dba.report(K4DmSJ7, method=DBA_EDGER, contrast = 1, th=1) K4DmSJ7.deseq <- dba.report(K4DmSJ7, method=DBA_DESEQ2, contrast = 1, th=1) # Write to file out.edgeR <- as.data.frame(K4DmSJ7.edgeR) write.table(out.edgeR, file="K4DmSJ7_edgeR.txt", sep="\t", quote=F, row.names=F) out.deseq <- as.data.frame(K4DmSJ7.deseq) write.table(out.deseq, file="K4DmSJ7_deseq.txt", sep="\t", quote=F, row.names=F) ``` K4 DmSJ7 dataset: 5460 consensus peaks ![](https://i.imgur.com/iYNIXlZ.png) PCA of consensus peaks ![](https://i.imgur.com/UYOWqD4.png) There is only **3 significantly different peaks** and only through Deseq2 ![](https://i.imgur.com/S5iplMg.png) Example of the edgeR or Deseq2 tables ![](https://i.imgur.com/2Jsj3fn.png) MA plot deseq ![](https://i.imgur.com/lWVaIHG.png) ```r #DmGoth101 #load the samples. here the program computes "consensus peaks" by returning only peaks that are found in at least two samples. K4DmGoth101samples <- read.csv('K4DmGoth101samplesheet.csv') K4DmGoth101 <- dba(sampleSheet=K4DmGoth101samples) #count coverage for the consensus peaks K4DmGoth101 <- dba.count(K4DmGoth101, bUseSummarizeOverlaps=TRUE, bRemoveDuplicates = TRUE, bParallel = TRUE) png('K4DmGoth101pcaplot.png') dba.plotPCA(K4DmGoth101, attributes=DBA_TREATMENT, label=DBA_ID) dev.off() #setting the comparisons K4DmGoth101 <- dba.contrast(K4DmGoth101, categories=DBA_TREATMENT, minMembers = 2) #running DE analysis K4DmGoth101 <- dba.analyze(K4DmGoth101, method=DBA_ALL_METHODS) dba.plotVenn(K4DmGoth101,contrast=1,method=DBA_ALL_METHODS) png('K4DmGoth101maplotDeseq.png') dba.plotMA(K4DmGoth101, method=DBA_DESEQ2) dev.off() png('K4DmGoth101pvals.png') pvals <- dba.plotBox(K4DmGoth101) dev.off() #extract results for each method K4DmGoth101.edgeR <- dba.report(K4DmGoth101, method=DBA_EDGER, contrast = 1, th=1) K4DmGoth101.deseq <- dba.report(K4DmGoth101, method=DBA_DESEQ2, contrast = 1, th=1) # Write to file out.edgeR <- as.data.frame(K4DmGoth101.edgeR) write.table(out.edgeR, file="K4DmGoth101_edgeR.txt", sep="\t", quote=F, row.names=F) out.deseq <- as.data.frame(K4DmGoth101.deseq) write.table(out.deseq, file="K4DmGoth101_deseq.txt", sep="\t", quote=F, row.names=F) save.image('diffPeaksDm.RData') ``` 5485 consensus sites, pretty much the same as DmSJ7. However, here we see 493 DE EdgeR and 1073 DE DESEQ2. ![](https://i.imgur.com/LG14IvJ.png) ![](https://i.imgur.com/H8HnZhz.png) There are 481 peaks that overlap between both methods and hence should be used for further analysis ![](https://i.imgur.com/Vz8mMYq.png) DESEQ2 ![](https://i.imgur.com/KsYzIwB.png) EdgeR ![](https://i.imgur.com/UhLqLaM.png) ![](https://i.imgur.com/dXBgyUN.png) - [ ] For DmGoth101, plot the significant different peaks with K4 bw in control & CS. - [ ] I won't do this now, but I think we could go into the function and location of the genes. Same analysis but with K9. ``` export PATH="/home/data/rrebollo/miniconda2/bin:$PATH" source activate r4-base library(DiffBind) library(parallel) library(reshape2) library(tidyverse) library(UpSetR) #load the samples. here the program computes "consensus peaks" by returning only peaks that are found in at least two samples. K9samples <- read.csv('K9samplesheet.csv') K9 <- dba(sampleSheet=K9samples) #count coverage for the consensus peaks K9 <- dba.count(K9, bUseSummarizeOverlaps=TRUE, bRemoveDuplicates = TRUE, bParallel = TRUE) png('K9pcaplot2.png') dba.plotPCA(K9, attributes=DBA_TISSUE, label=DBA_ID) dev.off() png('K9pcaplot3.png') dba.plotPCA(K9, attributes=DBA_TREATMENT, label=DBA_ID) dev.off() #DmSJ7 #load the samples. here the program computes "consensus peaks" by returning only peaks that are found in at least two samples. K9DmSJ7samples <- read.csv('K9DmSJ7samplesheet.csv') K9DmSJ7 <- dba(sampleSheet=K9DmSJ7samples) #count coverage for the consensus peaks K9DmSJ7 <- dba.count(K9DmSJ7, bUseSummarizeOverlaps=TRUE, bRemoveDuplicates = TRUE, bParallel = TRUE) png('K9DmSJ7pcaplot.png') dba.plotPCA(K9DmSJ7, attributes=DBA_TREATMENT, label=DBA_ID) dev.off() #setting the comparisons K9DmSJ7 <- dba.contrast(K9DmSJ7, categories=DBA_TREATMENT, minMembers = 2) #running DE analysis K9DmSJ7 <- dba.analyze(K9DmSJ7, method=DBA_ALL_METHODS) png('K9DmSJ7pcaplotDeseq2.png') dba.plotPCA(K9DmSJ7, contrast=1, method=DBA_EDGER, attributes=DBA_TREATMENT, label=DBA_ID) dev.off() dba.plotVenn(K9DmSJ7,contrast=1,method=DBA_ALL_METHODS) #only one peak set meets specified criteria png('K9DmSJ7maplotDeseq.png') dba.plotMA(K9DmSJ7, method=DBA_DESEQ2) dev.off() #extract results for each method K9DmSJ7.edgeR <- dba.report(K9DmSJ7, method=DBA_EDGER, contrast = 1, th=1) K9DmSJ7.deseq <- dba.report(K9DmSJ7, method=DBA_DESEQ2, contrast = 1, th=1) # Write to file out.edgeR <- as.data.frame(K9DmSJ7.edgeR) write.table(out.edgeR, file="K9DmSJ7_edgeR.txt", sep="\t", quote=F, row.names=F) out.deseq <- as.data.frame(K9DmSJ7.deseq) write.table(out.deseq, file="K9DmSJ7_deseq.txt", sep="\t", quote=F, row.names=F) #DmGoth101 #load the samples. here the program computes "consensus peaks" by returning only peaks that are found in at least two samples. K9DmGoth101samples <- read.csv('K9DmGoth101samplesheet.csv') K9DmGoth101 <- dba(sampleSheet=K9DmGoth101samples) #count coverage for the consensus peaks K9DmGoth101 <- dba.count(K9DmGoth101, bUseSummarizeOverlaps=TRUE, bRemoveDuplicates = TRUE, bParallel = TRUE) png('K9DmGoth101pcaplot.png') dba.plotPCA(K9DmGoth101, attributes=DBA_TREATMENT, label=DBA_ID) dev.off() #setting the comparisons K9DmGoth101 <- dba.contrast(K9DmGoth101, categories=DBA_TREATMENT, minMembers = 2) #running DE analysis K9DmGoth101 <- dba.analyze(K9DmGoth101, method=DBA_ALL_METHODS) png('K9DmGoth101venn.png') dba.plotVenn(K9DmGoth101,contrast=1,method=DBA_ALL_METHODS) dev.off() png('K9DmGoth101maplotDeseq.png') dba.plotMA(K9DmGoth101, method=DBA_DESEQ2) dev.off() png('K9DmGoth101pvals.png') pvals <- dba.plotBox(K9DmGoth101) dev.off() #extract results for each method K9DmGoth101.edgeR <- dba.report(K9DmGoth101, method=DBA_EDGER, contrast = 1, th=1) K9DmGoth101.deseq <- dba.report(K9DmGoth101, method=DBA_DESEQ2, contrast = 1, th=1) # Write to file out.edgeR <- as.data.frame(K9DmGoth101.edgeR) write.table(out.edgeR, file="K9DmGoth101_edgeR.txt", sep="\t", quote=F, row.names=F) out.deseq <- as.data.frame(K9DmGoth101.deseq) write.table(out.deseq, file="K9DmGoth101_deseq.txt", sep="\t", quote=F, row.names=F) save.image('diffPeaksDm.RData') ``` ![](https://i.imgur.com/Ipm2kUh.png) ![](https://i.imgur.com/nsS9UcV.png) ![](https://i.imgur.com/3ZGrlXo.png) ![](https://i.imgur.com/bvqSnvz.png) ![](https://i.imgur.com/nSi8viD.png) ![](https://i.imgur.com/CmeZqab.png) ![](https://i.imgur.com/nhjZP3D.png) Because only DmGoth 10.1 has enough peaks to be plotted, this analysis is on stand by. ## ChIP-seq analysis without inputs Input libraries to ignore > CNVR7 > CNVR8 > CNVR9 > CNVR10 > CNVR11 > CNVR12 > CNVR13 > CNVR14 > CNVR31 > CNVR34 > CNVR37 > CNVR40 > CNVR43 > CNVR46 > CNVR49 > CNVR52 ### Correlation First, let's look at the correlation of genome-wide read coverage between all samples. ```ssh #activate deeptools environment source ~/miniconda2/bin/activate deeptools #multiBamSummary multiBamSummary bins --bamfiles CNVR15.R1.filtered02.bam CNVR16.R1.filtered02.bam CNVR17.R1.filtered02.bam CNVR18.R1.filtered02.bam CNVR19.R1.filtered02.bam CNVR20.R1.filtered02.bam CNVR21.R1.filtered02.bam CNVR22.R1.filtered02.bam CNVR23.R1.filtered02.bam CNVR24.R1.filtered02.bam CNVR25.R1.filtered02.bam CNVR26.R1.filtered02.bam CNVR27.R1.filtered02.bam CNVR28.R1.filtered02.bam CNVR29.R1.filtered02.bam CNVR30.R1.filtered02.bam CNVR32.R1.filtered02.bam CNVR33.R1.filtered02.bam CNVR35.R1.filtered02.bam CNVR36.R1.filtered02.bam CNVR38.R1.filtered02.bam CNVR39.R1.filtered02.bam CNVR41.R1.filtered02.bam CNVR42.R1.filtered02.bam CNVR44.R1.filtered02.bam CNVR45.R1.filtered02.bam CNVR47.R1.filtered02.bam CNVR48.R1.filtered02.bam CNVR50.R1.filtered02.bam CNVR51.R1.filtered02.bam CNVR53.R1.filtered02.bam CNVR54.R1.filtered02.bam --smartLabels -p 4 -out readCount.npz #plotCorrelation of readCount plotCorrelation -in readCount.npz -c spearman -p heatmap --skipZeros -T "ChIP-seq library correlation without input" -o readCount_correlation.pdf #Plot PCA plotPCA -in readCount.npz -T "DMel control and cold ChIP-seq libraries without input" -o readCount_PCA.pdf #deactivate conda environment source ~/miniconda2/bin/deactivate deeptools ``` The correlation shows three groups. The K9 of DmSJ7 and DMGoth10.1 well separated from all other datasets. Then there are two groups : the K4s all together, which is good, and the K9 of DmGoth6.3 and DmSJ23. These K9 are similar to the K4 data too, which is strange but predictable given that some of the K9 fingerprints do not look good... This is not "signal" and is not normalized. The most clear aspect is the lack of global difference between controls and their cold shocks. ![](https://i.imgur.com/M1dlefB.png) The PCA shows three distinct groups, as the correlation. Firstly the K4me3 and the K9me3 data are clearly separated. All K9me3 data cluster together, CS and controls. For the K4me3 data, CS and controls are also clustered together but we see two groups : DmGoth10.1 + DmSJ7; and DmGoth6.3 + DmSJ23. ![](https://i.imgur.com/pm1PZbB.png) In conclusion: the correlation and PCA agree with the sample's biology. Let's compare these results, with normalized data, using bamCoverage. The binSize is set at 50 by default. Effective genome size is taken from the deepTools suite, for uniquely mapping reads to dm6. The normalization used is RPGC reads per genomic content (1x normalization, instead of RPKM thanks to [this](https://groups.google.com/g/deeptools/c/th96gaftAXQ)), and all reads are extended. ```ssh #activate deeptools environment source ~/miniconda2/bin/activate deeptools ##start loop script## while read line; do R1=$(echo $line|awk '{print $1}'); #bamCoverage bamCoverage --bam ${R1}.filtered02.bam -p 4 --effectiveGenomeSize 129789873 --normalizeUsing RPGC --extendReads -o ${R1}.bw done <bam_list.txt ##end loop script## #multiBigwigSummary multiBigwigSummary bins -b CNVR15.R1.bw CNVR16.R1.bw CNVR17.R1.bw CNVR18.R1.bw CNVR19.R1.bw CNVR20.R1.bw CNVR21.R1.bw CNVR22.R1.bw CNVR23.R1.bw CNVR24.R1.bw CNVR25.R1.bw CNVR26.R1.bw CNVR27.R1.bw CNVR28.R1.bw CNVR29.R1.bw CNVR30.R1.bw CNVR32.R1.bw CNVR33.R1.bw CNVR35.R1.bw CNVR36.R1.bw CNVR38.R1.bw CNVR39.R1.bw CNVR41.R1.bw CNVR42.R1.bw CNVR44.R1.bw CNVR45.R1.bw CNVR47.R1.bw CNVR48.R1.bw CNVR50.R1.bw CNVR51.R1.bw CNVR53.R1.bw CNVR54.R1.bw -p 4 --smartLabels -o normalized-bw.npz #plotCorrelation of readCount plotCorrelation -in normalized-bw.npz -c spearman -p heatmap --skipZeros -T "ChIP-seq library correlation without input" -o bw_correlation.pdf #Plot PCA plotPCA -in normalized-bw.npz -T "DMel control and cold ChIP-seq libraries without input" -o bw_PCA.pdf #deactivate conda environment source ~/miniconda2/bin/deactivate deeptools ``` **IGV of normalized data** Clearly the enrichements are prettier and clearer for DmGoth10.1 and DmSJ7. The K9 is weak and broad for these samples while for the others there are some peaks at the same place as K4me3. I am finding really hard to get a grasp on these dataset, so let's look at the global enrichments and then at specific cases. DmGoth10.1 and DmSJ7 ![](https://i.imgur.com/Et84EXk.png) DmGoth3.6 & DmSJ23 ![](https://i.imgur.com/4lkxj3H.png) **Correlation & PCA** The correlation plot looks exactly like the raw data. The PCA further divides the DmGoth6.3 and DmSJ27 into two groups, but they are all mixed up (there are CS and controls, and strains in both groups).This is reassuring because the K9s cluster together on the PCA. ![](https://i.imgur.com/9BO3vlg.png) ![](https://i.imgur.com/BKDPQ4u.png) ### Enrichment at gene promoters Now we can look at the correlation between samples for K4me3 and K9me3 at gene promoters (TSS +/- 2Kb). At first I got an error with multiBigwigSummary "invalid interval bounds". So I looked at my bed file and found some mitochondrial, rRNA and other weird entries. I kept only the known chr entries. ```ssh #activate deeptools environment source ~/miniconda2/bin/activate deeptools #multiBigwigSummary at promoters multiBigwigSummary BED-file -b CNVR15.R1.bw CNVR16.R1.bw CNVR17.R1.bw CNVR18.R1.bw CNVR19.R1.bw CNVR20.R1.bw CNVR21.R1.bw CNVR22.R1.bw CNVR23.R1.bw CNVR24.R1.bw CNVR25.R1.bw CNVR26.R1.bw CNVR27.R1.bw CNVR28.R1.bw CNVR29.R1.bw CNVR30.R1.bw CNVR32.R1.bw CNVR33.R1.bw CNVR35.R1.bw CNVR36.R1.bw CNVR38.R1.bw CNVR39.R1.bw CNVR41.R1.bw CNVR42.R1.bw CNVR44.R1.bw CNVR45.R1.bw CNVR47.R1.bw CNVR48.R1.bw CNVR50.R1.bw CNVR51.R1.bw CNVR53.R1.bw CNVR54.R1.bw -p 4 --smartLabels -o normalized_promoters_bw.npz --BED ../promoters.2000.bed #plotCorrelation of promoter enrichment plotCorrelation -in normalized_promoters_bw.npz -c spearman -p heatmap --skipZeros -T "ChIP-seq library correlation without input" -o correlation_Promoters.pdf #Plot PCA plotPCA -in normalized_promoters_bw.npz -T "DMel control and cold ChIP-seq libraries without input" -o PCA_Promoters.pdf #deactivate conda environment source ~/miniconda2/bin/deactivate deeptools ``` Again looks similar to the genome-wide non-normalized, and normalized data. ![](https://i.imgur.com/lJx605V.png) ![](https://i.imgur.com/NZ3n4tY.png) And finally, to see if the data is good, let's look at H3K4me3 and H3K9me3 enrichment at promoter regions using plotprofile (average enrichment at the TSS +/- 2Kb). I tried looking at all the libraries at the same time, but the graph is too crowded, so I made plots per population, then per K4 and per K9. ```ssh #activate deeptools environment source ~/miniconda2/bin/activate deeptools #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth10.1 computeMatrix reference-point --referencePoint center -S CNVR15.R1.bw CNVR17.R1.bw CNVR19.R1.bw CNVR21.R1.bw CNVR16.R1.bw CNVR18.R1.bw CNVR20.R1.bw CNVR22.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_DmGoth10.1-ComputeMatrix.gz #plotProfile DmGoth10.1 plotProfile --matrixFile Promoters_DmGoth10.1-ComputeMatrix.gz --perGroup --outFileName Promoters_DmGoth10.1_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth6.3 computeMatrix reference-point --referencePoint center -S CNVR32.R1.bw CNVR35.R1.bw CNVR38.R1.bw CNVR41.R1.bw CNVR33.R1.bw CNVR36.R1.bw CNVR39.R1.bw CNVR42.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_DmGoth6.3-ComputeMatrix.gz #plotProfile DmGoth6.3 plotProfile --matrixFile Promoters_DmGoth6.3-ComputeMatrix.gz --perGroup --outFileName Promoters_DmGoth6.3_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ7 computeMatrix reference-point --referencePoint center -S CNVR23.R1.bw CNVR25.R1.bw CNVR27.R1.bw CNVR29.R1.bw CNVR24.R1.bw CNVR26.R1.bw CNVR28.R1.bw CNVR30.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_DmSJ7-ComputeMatrix.gz #plotProfile DmSJ7 plotProfile --matrixFile Promoters_DmSJ7-ComputeMatrix.gz --perGroup --outFileName Promoters_DMSJ7_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ23 computeMatrix reference-point --referencePoint center -S CNVR44.R1.bw CNVR47.R1.bw CNVR50.R1.bw CNVR53.R1.bw CNVR45.R1.bw CNVR48.R1.bw CNVR51.R1.bw CNVR54.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_DmSJ23-ComputeMatrix.gz #plotProfile DmSJ23 plotProfile --matrixFile Promoters_DmSJ23-ComputeMatrix.gz --perGroup --outFileName Promoters_DmSJ23_PlotProfile.pdf #Can also plot all K4me3 together to show they all have the same profile #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb All K4 computeMatrix reference-point --referencePoint center -S CNVR44.R1.bw CNVR47.R1.bw CNVR50.R1.bw CNVR53.R1.bw CNVR23.R1.bw CNVR25.R1.bw CNVR27.R1.bw CNVR29.R1.bw CNVR15.R1.bw CNVR17.R1.bw CNVR19.R1.bw CNVR21.R1.bw CNVR32.R1.bw CNVR35.R1.bw CNVR38.R1.bw CNVR41.R1.bw --smartLabels -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K4-ComputeMatrix.gz #plotProfile All K4 plotProfile --matrixFile Promoters_K4-ComputeMatrix.gz --perGroup --legendLocation upper-right --outFileName Promoters_K4_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb All K9 computeMatrix reference-point --referencePoint center -S CNVR45.R1.bw CNVR48.R1.bw CNVR51.R1.bw CNVR54.R1.bw CNVR24.R1.bw CNVR26.R1.bw CNVR28.R1.bw CNVR30.R1.bw CNVR16.R1.bw CNVR18.R1.bw CNVR20.R1.bw CNVR22.R1.bw CNVR33.R1.bw CNVR36.R1.bw CNVR39.R1.bw CNVR42.R1.bw --smartLabels -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K9-ComputeMatrix.gz #plotProfile All K9 plotProfile --matrixFile Promoters_K9-ComputeMatrix.gz --perGroup --legendLocation upper-right --outFileName Promoters_K9_PlotProfile.pdf #deactivate conda environment source ~/miniconda2/bin/deactivate deeptools ``` The K4me3 and K9me3 profiles at gene TSSs are as expected. K4me3 enrichment around TSSs and very little K9me3. Most importantly, the profiles all look similar, suggesting the genome-wide K4me3 data is good. No global differences between CS and controls. :man-with-bunny-ears-partying: ![](https://i.imgur.com/XlbJgaE.png) I have also ploted K4me3 and K9me3 separately, to compare the different samples. H3K4me3 is equivalent between samples, however we can see the H3K9me3 data differs. There are a couple of libraries that do not show any enrichment. In any case the coverage at promoters is really small. We should check for TE enrichment for the K9me3. ![](https://i.imgur.com/n1xj4bu.png) In conclusion : H3K4me3 data without inputs is suitable for analysis. H3K9me3 seems messy but it will depend on further analysis done on TEs. ### Including differentially expressed genes Daniel sent me the promoter regions of UP & down genes of controls vs CS in Dsim & Dmel. > Apliquei para os genes o threshold de padj < 0.05 e log2FC > |1| . Dentro da pasta de cada espécie você vai entrar os folders "up" e "down" e dentro destes você vai encontrar "tss_boundaries". Os GTFs são as coordenadas referentes à + e - 2kb do TSS de cada gene. I tried with GTF and it doesn't work. Will transform into bed files (in excell!) ```bash #activate deeptools environment export PATH="/home/data/rrebollo/miniconda2/bin:$PATH" source activate deeptools #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth10.1 for K4 only UP and DOWN promoters computeMatrix reference-point --referencePoint center -S CNVR15.R1.bw CNVR17.R1.bw CNVR19.R1.bw CNVR21.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 -R dmel_Cold_UP_Goth10_BOUNDARIES.gtf dmel_Cold_DOWN_Goth10_BOUNDARIES.gtf -a 2000 -b 2000 --skipZeros -p 12 -o PromotersDEG_K4_DmGoth10.1-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file plotHeatmap --matrixFile PromotersDEG_K4_DmGoth10.1-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName PromotersDEG_K4_DmGoth10.1_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth10.1 for K9 only only UP and DOWN promoters computeMatrix reference-point --referencePoint center -S CNVR16.R1.bw CNVR18.R1.bw CNVR20.R1.bw CNVR22.R1.bw --samplesLabel Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R dmel_Cold_UP_Goth10_BOUNDARIES.gtf dmel_Cold_DOWN_Goth10_BOUNDARIES.gtf -a 2000 -b 2000 --skipZeros -p 4 -o PromotersDEG_K9_DmGoth10.1-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file plotHeatmap --matrixFile PromotersDEG_K9_DmGoth10.1-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName PromotersDEG_K9_DmGoth10.1_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth6.3 K4 only UP and DOWN promoters computeMatrix reference-point --referencePoint center -S CNVR32.R1.bw CNVR35.R1.bw CNVR38.R1.bw CNVR41.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 -R dmel_Cold_UP_Goth6_BOUNDARIES.gtf dmel_Cold_DOWN_Goth6_BOUNDARIES.gtf -a 2000 -b 2000 --skipZeros -p 4 -o PromotersDEG_K4_DmGoth6.3-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K4 plotHeatmap --matrixFile PromotersDEG_K4_DmGoth6.3-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName PromotersDEG_K4_DmGoth6.3_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth6.3 K9 only UP and DOWN promoters computeMatrix reference-point --referencePoint center -S CNVR33.R1.bw CNVR36.R1.bw CNVR39.R1.bw CNVR42.R1.bw --samplesLabel Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R dmel_Cold_UP_Goth6_BOUNDARIES.gtf dmel_Cold_DOWN_Goth6_BOUNDARIES.gtf -a 2000 -b 2000 --skipZeros -p 4 -o PromotersDEG_K9_DmGoth6.3-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K9 plotHeatmap --matrixFile PromotersDEG_K9_DmGoth6.3-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName PromotersDEG_K9_DmGoth6.3_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ7 K4 only UP and DOWN promoters TO DO computeMatrix reference-point --referencePoint center -S CNVR23.R1.bw CNVR25.R1.bw CNVR27.R1.bw CNVR29.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 -R dmel_Cold_UP_Sj7_BOUNDARIES.gtf dmel_Cold_DOWN_Sj7_BOUNDARIES.gtf -a 2000 -b 2000 --skipZeros -p 4 -o PromotersDEG_K4_DmSJ7-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K4 plotHeatmap --matrixFile PromotersDEG_K4_DmSJ7-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName PromotersDEG_K4_DmSJ7_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ7 K9 only UP and DOWN promoters computeMatrix reference-point --referencePoint center -S CNVR24.R1.bw CNVR26.R1.bw CNVR28.R1.bw CNVR30.R1.bw --samplesLabel Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R dmel_Cold_UP_Sj7_BOUNDARIES.gtf dmel_Cold_DOWN_Sj7_BOUNDARIES.gtf -a 2000 -b 2000 --skipZeros -p 4 -o PromotersDEG_K9_DmSJ7-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K9 plotHeatmap --matrixFile PromotersDEG_K9_DmSJ7-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName PromotersDEG_K9_DmSJ7_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ23 K4 only UP and DOWN promoters computeMatrix reference-point --referencePoint center -S CNVR44.R1.bw CNVR47.R1.bw CNVR50.R1.bw CNVR53.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 -R dmel_Cold_UP_Sj23_BOUNDARIES.gtf dmel_Cold_DOWN_Sj23_BOUNDARIES.gtf -a 2000 -b 2000 --skipZeros -p 4 -o PromotersDEG_K4_DmSJ23-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K4 plotHeatmap --matrixFile PromotersDEG_K4_DmSJ23-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName PromotersDEG_K4_DmSJ23_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ23 K9 only UP and DOWN promoters computeMatrix reference-point --referencePoint center -S CNVR45.R1.bw CNVR48.R1.bw CNVR51.R1.bw CNVR54.R1.bw --samplesLabel Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R dmel_Cold_UP_Sj23_BOUNDARIES.gtf dmel_Cold_DOWN_Sj23_BOUNDARIES.gtf -a 2000 -b 2000 --skipZeros -p 4 -o PromotersDEG_K9_DmSJ23-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K9 plotHeatmap --matrixFile PromotersDEG_K9_DmSJ23-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName PromotersDEG_K9_DmSJ23_Heatmap.pdf #deactivate conda environment conda deactivate deeptools ``` K4 ![](https://i.imgur.com/Q1BafVg.jpg) K9 ![](https://i.imgur.com/H2xvkVQ.jpg) ### TEs Plot profile of K4 and K9 data on TEs as for the normalized data ```ssh #activate deeptools environment export PATH="/home/data/rrebollo/miniconda2/bin:$PATH" source activate deeptools #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth10.1 computeMatrix scale-regions -S CNVR15.R1.bw CNVR17.R1.bw CNVR19.R1.bw CNVR21.R1.bw CNVR16.R1.bw CNVR18.R1.bw CNVR20.R1.bw CNVR22.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R /home/data/rrebollo/drosophila/FRM/melanogaster/genome/dmel.TE.bed -a 2000 -b 2000 --skipZeros -p 4 -o TEs_DmGoth10.1-ComputeMatrix.gz #plotProfile DmGoth10.1 plotProfile --matrixFile TEs_DmGoth10.1-ComputeMatrix.gz --perGroup --outFileName TEs_DmGoth10.1_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth6.3 computeMatrix scale-regions -S CNVR32.R1.bw CNVR35.R1.bw CNVR38.R1.bw CNVR41.R1.bw CNVR33.R1.bw CNVR36.R1.bw CNVR39.R1.bw CNVR42.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R /home/data/rrebollo/drosophila/FRM/melanogaster/genome/dmel.TE.bed -a 2000 -b 2000 --skipZeros -p 4 -o TEs_DmGoth6.3-ComputeMatrix.gz #plotProfile plotProfile --matrixFile TEs_DmGoth6.3-ComputeMatrix.gz --perGroup --outFileName TEs_DmGoth6.3_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ7 computeMatrix scale-regions -S CNVR23.R1.bw CNVR25.R1.bw CNVR27.R1.bw CNVR29.R1.bw CNVR24.R1.bw CNVR26.R1.bw CNVR28.R1.bw CNVR30.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R /home/data/rrebollo/drosophila/FRM/melanogaster/genome/dmel.TE.bed -a 2000 -b 2000 --skipZeros -p 4 -o TEs_DmSJ7-ComputeMatrix.gz #plotProfile DmSJ7 plotProfile --matrixFile TEs_DmSJ7-ComputeMatrix.gz --perGroup --outFileName TEs_DmSJ7_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ23 computeMatrix scale-regions -S CNVR44.R1.bw CNVR47.R1.bw CNVR50.R1.bw CNVR53.R1.bw CNVR45.R1.bw CNVR48.R1.bw CNVR51.R1.bw CNVR54.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R /home/data/rrebollo/drosophila/FRM/melanogaster/genome/dmel.TE.bed -a 2000 -b 2000 --skipZeros -p 4 -o TEs_DmSJ23-ComputeMatrix.gz #plotProfile plotProfile --matrixFile TEs_DmSJ23-ComputeMatrix.gz --perGroup --outFileName TEs_DmSJ23_PlotProfile.pdf #Can also plot all K4me3 together to show they all have the same profile #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb All K4 computeMatrix scale-regions -S CNVR44.R1.bw CNVR47.R1.bw CNVR50.R1.bw CNVR53.R1.bw CNVR23.R1.bw CNVR25.R1.bw CNVR27.R1.bw CNVR29.R1.bw CNVR15.R1.bw CNVR17.R1.bw CNVR19.R1.bw CNVR21.R1.bw CNVR32.R1.bw CNVR35.R1.bw CNVR38.R1.bw CNVR41.R1.bw --smartLabels -R /home/data/rrebollo/drosophila/FRM/melanogaster/genome/dmel.TE.bed -a 2000 -b 2000 --skipZeros -p 4 -o TEs_K4-ComputeMatrix.gz #plotProfile All K4 plotProfile --matrixFile TEs_K4-ComputeMatrix.gz --perGroup --legendLocation upper-right --outFileName TEs_K4_PlotProfile.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb All K9 computeMatrix scale-regions -S CNVR45.R1.bw CNVR48.R1.bw CNVR51.R1.bw CNVR54.R1.bw CNVR24.R1.bw CNVR26.R1.bw CNVR28.R1.bw CNVR30.R1.bw CNVR16.R1.bw CNVR18.R1.bw CNVR20.R1.bw CNVR22.R1.bw CNVR33.R1.bw CNVR36.R1.bw CNVR39.R1.bw CNVR42.R1.bw --smartLabels -R /home/data/rrebollo/drosophila/FRM/melanogaster/genome/dmel.TE.bed -a 2000 -b 2000 --skipZeros -p 4 -o TEs_K9-ComputeMatrix.gz #plotProfile All K9 plotProfile --matrixFile TEs_K9-ComputeMatrix.gz --perGroup --legendLocation upper-right --outFileName TEs_K9_PlotProfile.pdf #deactivate conda environment conda deactivate ``` ![](https://i.imgur.com/B5HUecR.png) Trying 5Kb and 10Kb to see if the K9 comes down => exactly the same thing. ## Manuscript Text ### Methods Illumina adaptors and quality trimming were performed with TrimGalore (ref) with default parameters, and 5' trimming (clip_R1 9 and --clip_R2 9). Cleaned reads were mapped with Bowtie2 (ref) using --sensitive-local parameters against dmel-all-chromosome-r6.16.fasta. All unmapped, multimapped, and duplicated reads were filtered out using Sambamba (ref) with the following parameters (-F "[XS] == null and not unmapped and not duplicate"). Reads within blacklisted regions (ref) were removed thanks to bedtools (ref). All surviving reads were then subjected to ChIP-seq quality control. *to write on the deeptools and other analysis* ### Results *to write* ## Deprecated ### Get reproducible peaks *No need because Diffbind does it. But keep it in mind for go analysis and other peak annotation.* I thought about using IDR, but the literature is not as clear if it works with histone marks so I'm using [ChIPpeakAnno](http://bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html). I followed this [workflow](https://www.bioconductor.org/packages/devel/bioc/vignettes/ChIPpeakAnno/inst/doc/ChIPpeakAnno.html). To install this package, start R (version "4.1"). Did it in my compute. ```r # set working directory setwd("~/work/Collaborations/C.Vieira/FRM/melanogaster/macs") #install ChIPpeakAnno #if (!requireNamespace("BiocManager", quietly = TRUE)) # + install.packages("BiocManager") #BiocManager::install("ChIPpeakAnno") #load library library(ChIPpeakAnno) #melanogaster #DmSJ7 #Load peaks K4 (narrow Peaks) DmSJ7Cont1K4input <- "DmSJ7_Cont1-K4_peaks.narrowPeak" DmSJ7Cont1K4peaks <- toGRanges(DmSJ7Cont1K4input, format="narrowPeak") DmSJ7Cont2K4input <- "DmSJ7_Cont2-K4_peaks.narrowPeak" DmSJ7Cont2K4peaks <- toGRanges(DmSJ7Cont2K4input, format="narrowPeak") DmSJ7CS1K4input <- "DmSJ7_CS1-K4_peaks.narrowPeak" DmSJ7CS1K4peaks <- toGRanges(DmSJ7CS1K4input, format="narrowPeak") DmSJ7CS2K4input <- "DmSJ7_CS2-K4_peaks.narrowPeak" DmSJ7CS2K4peaks <- toGRanges(DmSJ7CS2K4input, format="narrowPeak") #Load peaks K9 (broad Peaks) DmSJ7Cont1K9input <- "DmSJ7_Cont1-K9_peaks.broadPeak" DmSJ7Cont1K9peaks <- toGRanges(DmSJ7Cont1K9input, format="broadPeak") DmSJ7Cont2K9input <- "DmSJ7_Cont2-K9_peaks.broadPeak" DmSJ7Cont2K9peaks <- toGRanges(DmSJ7Cont2K9input, format="broadPeak") DmSJ7CS1K9input <- "DmSJ7_CS1-K9_peaks.broadPeak" DmSJ7CS1K9peaks <- toGRanges(DmSJ7CS1K9input, format="broadPeak") DmSJ7CS2K9input <- "DmSJ7_CS2-K9_peaks.broadPeak" DmSJ7CS2K9peaks <- toGRanges(DmSJ7CS2K9input, format="broadPeak") #Finding overlapping peaks DmSJ7ContK4_ol <- findOverlapsOfPeaks(DmSJ7Cont1K4peaks, DmSJ7Cont2K4peaks, connectedPeaks = "keepAll") DmSJ7CSK4_ol <- findOverlapsOfPeaks(DmSJ7CS1K4peaks, DmSJ7CS2K4peaks, connectedPeaks = "keepAll") DmSJ7ContK9_ol <- findOverlapsOfPeaks(DmSJ7Cont1K9peaks, DmSJ7Cont2K9peaks, connectedPeaks = "keepAll") DmSJ7CSK9_ol <- findOverlapsOfPeaks(DmSJ7CS1K9peaks, DmSJ7CS2K9peaks, connectedPeaks = "keepAll") #Venn Diagrams makeVennDiagram(DmSJ7ContK4_ol, fill=c("#009E73", "#F0E442"), # circle fill color col=c("#D55E00", "#0072B2"), #circle border color cat.col=c("#D55E00", "#0072B2")) makeVennDiagram(DmSJ7CSK4_ol, fill=c("#009E73", "#F0E442"), # circle fill color col=c("#D55E00", "#0072B2"), #circle border color cat.col=c("#D55E00", "#0072B2")) makeVennDiagram(DmSJ7ContK9_ol, fill=c("#009E73", "#F0E442"), # circle fill color col=c("#D55E00", "#0072B2"), #circle border color cat.col=c("#D55E00", "#0072B2")) makeVennDiagram(DmSJ7CSK9_ol, fill=c("#009E73", "#F0E442"), # circle fill color col=c("#D55E00", "#0072B2"), #circle border color cat.col=c("#D55E00", "#0072B2")) #venn diagrams with absolute numbers makeVennDiagram(DmSJ7ContK4_ol, fill=c("#009E73", "#F0E442"), col=c("#D55E00", "#0072B2"), cat.col=c("#D55E00", "#0072B2"), connectedPeaks = "keepAll") makeVennDiagram(DmSJ7CSK4_ol, fill=c("#009E73", "#F0E442"), col=c("#D55E00", "#0072B2"), cat.col=c("#D55E00", "#0072B2"), connectedPeaks = "keepAll") makeVennDiagram(DmSJ7ContK9_ol, fill=c("#009E73", "#F0E442"), col=c("#D55E00", "#0072B2"), cat.col=c("#D55E00", "#0072B2"), connectedPeaks = "keepAll") makeVennDiagram(DmSJ7CSK9_ol, fill=c("#009E73", "#F0E442"), col=c("#D55E00", "#0072B2"), cat.col=c("#D55E00", "#0072B2"), connectedPeaks = "keepAll") #Pies of overlapping distribution pie1(table(DmSJ7ContK4_ol$overlappingPeaks[["DmSJ7Cont1K4peaks///DmSJ7Cont2K4peaks"]]$overlapFeature)) pie1(table(DmSJ7CSK4_ol$overlappingPeaks[["DmSJ7CS1K4peaks///DmSJ7CS2K4peaks"]]$overlapFeature)) pie1(table(DmSJ7ContK9_ol$overlappingPeaks[["DmSJ7Cont1K9peaks///DmSJ7Cont2K9peaks"]]$overlapFeature)) pie1(table(DmSJ7CSK9_ol$overlappingPeaks[["DmSJ7CS1K9peaks///DmSJ7CS2K9peaks"]]$overlapFeature)) #Create mergedpeaks file DmSJ7ContK4 <- as.data.frame(DmSJ7ContK4_ol$mergedPeaks) DmSJ7CSK4 <- as.data.frame(DmSJ7CSK4_ol$mergedPeaks) DmSJ7ContK9 <- as.data.frame(DmSJ7ContK9_ol$mergedPeaks) DmSJ7CSK9 <- as.data.frame(DmSJ7CSK9_ol$mergedPeaks) #export mergedpeaks file write.table(DmSJ7ContK4,"DmSJ7ContK4.txt", row.names = FALSE, sep = "\t", quote=F) write.table(DmSJ7CSK4,"DmSJ7CSK4.txt", row.names = FALSE, sep = "\t", quote=F) write.table(DmSJ7ContK9,"DmSJ7ContK9.txt", row.names = FALSE, sep = "\t", quote=F) write.table(DmSJ7CSK9,"DmSJ7CSK9.txt", row.names = FALSE, sep = "\t", quote=F) #DmGoth10.1 #DmGoth10.1 #Load peaks K4 (narrow Peaks) DmGoth10.1Cont1K4input <- "DmGoth10.1_Cont1-K4_peaks.narrowPeak" DmGoth10.1Cont1K4peaks <- toGRanges(DmGoth10.1Cont1K4input, format="narrowPeak") DmGoth10.1Cont2K4input <- "DmGoth10.1_Cont2-K4_peaks.narrowPeak" DmGoth10.1Cont2K4peaks <- toGRanges(DmGoth10.1Cont2K4input, format="narrowPeak") DmGoth10.1CS1K4input <- "DmGoth10.1_CS1-K4_peaks.narrowPeak" DmGoth10.1CS1K4peaks <- toGRanges(DmGoth10.1CS1K4input, format="narrowPeak") DmGoth10.1CS2K4input <- "DmGoth10.1_CS2-K4_peaks.narrowPeak" DmGoth10.1CS2K4peaks <- toGRanges(DmGoth10.1CS2K4input, format="narrowPeak") #Load peaks K9 (broad Peaks) DmGoth10.1Cont1K9input <- "DmGoth10.1_Cont1-K9_peaks.broadPeak" DmGoth10.1Cont1K9peaks <- toGRanges(DmGoth10.1Cont1K9input, format="broadPeak") DmGoth10.1Cont2K9input <- "DmGoth10.1_Cont2-K9_peaks.broadPeak" DmGoth10.1Cont2K9peaks <- toGRanges(DmGoth10.1Cont2K9input, format="broadPeak") DmGoth10.1CS1K9input <- "DmGoth10.1_CS1-K9_peaks.broadPeak" DmGoth10.1CS1K9peaks <- toGRanges(DmGoth10.1CS1K9input, format="broadPeak") DmGoth10.1CS2K9input <- "DmGoth10.1_CS2-K9_peaks.broadPeak" DmGoth10.1CS2K9peaks <- toGRanges(DmGoth10.1CS2K9input, format="broadPeak") #Finding overlapping peaks DmGoth10.1ContK4_ol <- findOverlapsOfPeaks(DmGoth10.1Cont1K4peaks, DmGoth10.1Cont2K4peaks, connectedPeaks = "keepAll") DmGoth10.1CSK4_ol <- findOverlapsOfPeaks(DmGoth10.1CS1K4peaks, DmGoth10.1CS2K4peaks, connectedPeaks = "keepAll") DmGoth10.1ContK9_ol <- findOverlapsOfPeaks(DmGoth10.1Cont1K9peaks, DmGoth10.1Cont2K9peaks, connectedPeaks = "keepAll") DmGoth10.1CSK9_ol <- findOverlapsOfPeaks(DmGoth10.1CS1K9peaks, DmGoth10.1CS2K9peaks, connectedPeaks = "keepAll") #Venn Diagrams makeVennDiagram(DmGoth10.1ContK4_ol, fill=c("#009E73", "#F0E442"), # circle fill color col=c("#D55E00", "#0072B2"), #circle border color cat.col=c("#D55E00", "#0072B2")) makeVennDiagram(DmGoth10.1CSK4_ol, fill=c("#009E73", "#F0E442"), # circle fill color col=c("#D55E00", "#0072B2"), #circle border color cat.col=c("#D55E00", "#0072B2")) makeVennDiagram(DmGoth10.1ContK9_ol, fill=c("#009E73", "#F0E442"), # circle fill color col=c("#D55E00", "#0072B2"), #circle border color cat.col=c("#D55E00", "#0072B2")) makeVennDiagram(DmGoth10.1CSK9_ol, fill=c("#009E73", "#F0E442"), # circle fill color col=c("#D55E00", "#0072B2"), #circle border color cat.col=c("#D55E00", "#0072B2")) #venn diagrams with absolute numbers makeVennDiagram(DmGoth10.1ContK9_ol, fill=c("#009E73", "#F0E442"), col=c("#D55E00", "#0072B2"), cat.col=c("#D55E00", "#0072B2"), connectedPeaks = "keepAll") makeVennDiagram(DmGoth10.1CSK9_ol, fill=c("#009E73", "#F0E442"), col=c("#D55E00", "#0072B2"), cat.col=c("#D55E00", "#0072B2"), connectedPeaks = "keepAll") #Pies of overlapping distribution pie1(table(DmGoth10.1ContK4_ol$overlappingPeaks[["DmGoth10.1Cont1K4peaks///DmGoth10.1Cont2K4peaks"]]$overlapFeature)) pie1(table(DmGoth10.1CSK4_ol$overlappingPeaks[["DmGoth10.1CS1K4peaks///DmGoth10.1CS2K4peaks"]]$overlapFeature)) pie1(table(DmGoth10.1ContK9_ol$overlappingPeaks[["DmGoth10.1Cont1K9peaks///DmGoth10.1Cont2K9peaks"]]$overlapFeature)) pie1(table(DmGoth10.1CSK9_ol$overlappingPeaks[["DmGoth10.1CS1K9peaks///DmGoth10.1CS2K9peaks"]]$overlapFeature)) #Create mergedpeaks file DmGoth10.1ContK4 <- as.data.frame(DmGoth10.1ContK4_ol$mergedPeaks) DmGoth10.1CSK4 <- as.data.frame(DmGoth10.1CSK4_ol$mergedPeaks) DmGoth10.1ContK9 <- as.data.frame(DmGoth10.1ContK9_ol$mergedPeaks) DmGoth10.1CSK9 <- as.data.frame(DmGoth10.1CSK9_ol$mergedPeaks) #export mergedpeaks file write.table(DmGoth10.1ContK4,"DmGoth10.1ContK4.txt", row.names = FALSE, sep = "\t", quote=F) write.table(DmGoth10.1CSK4,"DmGoth10.1CSK4.txt", row.names = FALSE, sep = "\t", quote=F) write.table(DmGoth10.1ContK9,"DmGoth10.1ContK9.txt", row.names = FALSE, sep = "\t", quote=F) write.table(DmGoth10.1CSK9,"DmGoth10.1CSK9.txt", row.names = FALSE, sep = "\t", quote=F) ``` ### Trials on non normalized data Now that we looked at correlation and all seems fine (even though there is a slighlty difference between the K9 datasets) let's look at K4 and K9 enrichment on promoters between controls, cold shocks and populations. We can first plot a clustered heatmap of K4me3 and K9me3 enrichment at gene promoters in order to observe genes that have switched, or maintained their chromatin state. I did a test with all samples from one population but clearly, the K4me3 enrichment flattens out the K9me3. So I decided to make a separate heatmap for K4 and K9. The example below is for DmGoth10.1. ![](https://i.imgur.com/H3v73Pa.png) ```ssh #activate deeptools environment source ~/miniconda2/bin/activate deeptools #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth10.1 for K4 only computeMatrix reference-point --referencePoint center -S CNVR15.R1.bw CNVR17.R1.bw CNVR19.R1.bw CNVR21.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K4_DmGoth10.1-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file plotHeatmap --matrixFile Promoters_K4_DmGoth10.1-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName Promoters_K4_DmGoth10.1_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth10.1 for K9 only computeMatrix reference-point --referencePoint center -S CNVR16.R1.bw CNVR18.R1.bw CNVR20.R1.bw CNVR22.R1.bw --samplesLabel Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K9_DmGoth10.1-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file plotHeatmap --matrixFile Promoters_K9_DmGoth10.1-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName Promoters_K9_DmGoth10.1_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth6.3 K4 computeMatrix reference-point --referencePoint center -S CNVR32.R1.bw CNVR35.R1.bw CNVR38.R1.bw CNVR41.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K4_DmGoth6.3-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K4 plotHeatmap --matrixFile Promoters_K4_DmGoth6.3-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName Promoters_K4_DmGoth6.3_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmGoth6.3 K9 computeMatrix reference-point --referencePoint center -S CNVR33.R1.bw CNVR36.R1.bw CNVR39.R1.bw CNVR42.R1.bw --samplesLabel Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K9_DmGoth6.3-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K9 plotHeatmap --matrixFile Promoters_K9_DmGoth6.3-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName Promoters_K9_DmGoth6.3_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ7 K4 computeMatrix reference-point --referencePoint center -S CNVR23.R1.bw CNVR25.R1.bw CNVR27.R1.bw CNVR29.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K4_DmSJ7-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K4 plotHeatmap --matrixFile Promoters_K4_DmSJ7-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName Promoters_K4_DmSJ7_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ7 K9 computeMatrix reference-point --referencePoint center -S CNVR24.R1.bw CNVR26.R1.bw CNVR28.R1.bw CNVR30.R1.bw --samplesLabel Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K9_DmSJ7-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K9 plotHeatmap --matrixFile Promoters_K9_DmSJ7-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName Promoters_K9_DmSJ7_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ23 K4 computeMatrix reference-point --referencePoint center -S CNVR44.R1.bw CNVR47.R1.bw CNVR50.R1.bw CNVR53.R1.bw --samplesLabel Con1-K4 Con2-K4 CS1-K4 CS2-K4 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K4_DmSJ23-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K4 plotHeatmap --matrixFile Promoters_K4_DmSJ23-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName Promoters_K4_DmSJ23_Heatmap.pdf #computeMatrix for plotprofiles using referencePoint on TSS+/-2Kb DmSJ23 K9 computeMatrix reference-point --referencePoint center -S CNVR45.R1.bw CNVR48.R1.bw CNVR51.R1.bw CNVR54.R1.bw --samplesLabel Con1-K9 Con2-K9 CS1-K9 CS2-K9 -R ../promoters.2000.bed -a 2000 -b 2000 --skipZeros -p 4 -o Promoters_K9_DmSJ23-ComputeMatrix.gz #plotHeatmaps on promoter +/- 2Kb file K9 plotHeatmap --matrixFile Promoters_K9_DmSJ23-ComputeMatrix.gz --boxAroundHeatmaps no --outFileName Promoters_K9_DmSJ23_Heatmap.pdf #deactivate conda environment source ~/miniconda2/bin/deactivate deeptools ``` Below the heatmap of K4 normalized coverage on promoters. We can see most promoters are not associated with K4. We can see there is no visible difference between replicates, and between control and CS. As we have seen before, despite the normalization, the number of promoters that seem covered by K4me3 in DmGoth6.3 and DmSJ23 are smaller than DmGoth10.1 and DmSJ7, even between controls. Is this something that Marie sees when analising controls? ![](https://i.imgur.com/Nm2dRKn.png) The K9 data at promoters is harder to check, as it would be as very few genes are the target of K9me3. Need to redo this with TEs. ![](https://i.imgur.com/9fgqSnl.png)