Generating HiC Heatmaps for VGP Genome Assemblies

# Generating HiC Heatmaps for VGP Genome Assemblies By Tanya Lama HiC align plots are part of scaffold4_salsa quality checks in the VGP 1.6 assembly pipeline ``` ssh tl50a@ghpcc06.umassrc.org {password} module load anaconda2/4.4.0 source activate vgp cd /project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1 ``` ## 1. Setting up a new conda environment A few dependencies are required for us to generate our heatmap, so we will create a new conda environment called "*vgp*". The == syntax specifies the version you want to install. Use the *conda search* function to ensure packages are available before installing. **Note:** Python 3.6 is recommended because the end-of-life for Python 2 is scheduled for 2020. ``` conda create --name vgp python==3.6 samtools==1.9 pretext-suite==0.0.1 pretextmap==0.1.1 libdeflate==1.2 source activate vgp ``` ## 2. Download bam Navigate to scaffold4_salsa folder in your intermediates and locate the bam file species_ILKL1AMS7R1.bam. Download into your local project space. ![](https://i.imgur.com/EJ7GoQ2.png) ``` wget {url} ``` ## 3. Submit job This will take a few hours to run... Important: A SAM header with contig info must be present for SAM format (-h option for samtools). ``` bsub -q long -W 8:00 -R rusage[mem=8000] -n 4 -R span\[hosts=1\] "samtools view -h fAntMac1_ILKL1AMS7_R1.bam | PretextMap -o fAntMac1_ILKL1AMS7_R1_2.pretext" ``` ## 4. Download the output locally scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fAntMac1_ILKL1AMS714_R1.pretext fAntMac1_ILKL1AMS714_R1.pretext ## 5. Download and install PretextView (do not download PretextView 0.1.0) The new release (0.1.0) is missing at least one key function. Select the correct version under Assets https://github.com/wtsi-hpag/PretextView/releases ## 6. Open your .pretext file using PretextView the file fAntMac1_ILKL1AMS7_R1_2.pretext should be ~30+Mb ## 7. Interpreting the final product The three-dimensional structure of a complete chromosome or genome is usually visualized by some method. The most straightforward visualization of the Hi-C contact matrix is a square heat map, where **the color corresponds to the contact count**. Heat maps can be constructed for the full genome (Fig. 1a) or for individual chromosomes (Fig. 1 b). ![](https://i.imgur.com/JeQHv9K.png) ![](https://i.imgur.com/abJmiI8.jpg) The dominant visual feature in every Hi-C heat map is the strong diagonal, which represents the 3D proximity of pairs of loci that are adjacent in genomic coordinates. The number of rows and columns is equal to the length of the genome divided by the bin size. Whole genome visualizations can reveal potential rearrangements of the genome (Fig.1a) See: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1161-y ## Troubleshooting fAntMac1 ## 1. Calculate n mapped reads ``` bsub -q short -W 3:59 -R rusage[mem=8000] -n 4 -R span\[hosts=1\] "samtools flagstat /project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fAntMac1_ILKL1AMS714_R1.bam" ``` ### fAntMac1 > 506609438 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary ***23958488*** + 0 duplicates 506609438 + 0 mapped (100.00% : N/A) 506609438 + 0 paired in sequencing 253304719 + 0 read1 253304719 + 0 read2 506609438 + 0 properly paired (100.00% : N/A) 506609438 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 62021090 + 0 with mate mapped to a different chr 62021090 + 0 with mate mapped to a different chr (mapQ>=5) ### fEsoLuc1 > 579281612 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary ***80053582*** + 0 duplicates 579281612 + 0 mapped (100.00% : N/A) 579281612 + 0 paired in sequencing 289640806 + 0 read1 289640806 + 0 read2 579281612 + 0 properly paired (100.00% : N/A) 579281612 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 220888442 + 0 with mate mapped to a different chr 220888442 + 0 with mate mapped to a different chr (mapQ>=5) ## 2. Count the number of reads in your fastq file The bam only contains only mapped reads so you need to count the number of reads in the fastq ``` echo $(zcat fAntMac1_ILKL1AMS714_R1.fastq.gz |wc -l)/4|bc echo $(zcat fEsoLuc1_241128L001_R1.fastq.gz |wc -l)/4|bc fEsoLuc1_241128L001_R1.fastq.gz ``` ### fAntMac1 23512944 #this was wrong because we used cat 376902687 #this is correct ### fEsoLuc1 37277321 #this was wrong because we should have used zcat 453200188 #this is correct ## 3. View the first 1000 records of a bam ``` samtools view fAntMac1_ILKL1AMS714_R1.bam | head -n 1000 ``` ## 4. Extract contact lengths from .bam Based on the bam file, a first approximation of the *contact lengths* should be the 9th column the insert size. Extract it with: samtools view bam | awk ``` samtools view fEsoLuc1_241128L001_R1.bam | awk '{print $9}' > fEsoLuc1output ``` #awk '{print $9} NR==10000{exit}' fEsoLuc1output > fEsoLuc10k #print the first 10,000 rows ## Final commands: ``` awk '{print $9}' fAntMac1_ILKL1AMS714_R1.bam > output1.txt awk '{print $9}' fEsoLuc1_241128L001_R1.bam > fEsoLuc1output.txt ``` ## 5. Plot the contact lengths in a frequency plot When contact length = 0 it is an interscaffold interaction ``` scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fEsoLuc10k fEsoLuc10k 196083712 scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fAntMac1output fAntMac1output scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/newfile.txt newfile.txt scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fEsoLucfreq.txt fEsoLucfreq.txt fEsoLucfreq.txt ``` # Generating the necessary data to create a frequency plot ``` sort fAntMac10k -n | uniq -c > newfile.txt ``` # Downloading data from DNANexus using dxpy (conda package) dx login ``` dx download -r "fEsoLuc1:/genomic_data/arima/fEsoLuc1_241128L001_R1.fastq.gz" dx download -r "fEsoLuc1:/genomic_data/arima/fEsoLuc1_241128L001_R1.fastq.gz" /project/uma_lisa_komoroske/Tanya/download/vgp ``` # Formatting read pair distance data for a frequency plot ``` ##original command from Giulio: cat newfile.txt | sed 's/-//g' | sort -n | uniq -c | column -t -O "_" | sed 's/ *//g' | sed 's/_/\t/g' ``` # What do all these sed commands mean? sed is a text editor that has so many commands I have a hard time keeping them all straight. A laymans interpretation: ``` cat newfile.txt | #show me the contents of this file sed 's/-//g' | #replace every - with a space character sort -n | #sort the contents line by line **numerically** uniq -c | #how many times is a line repeated? column -t "/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1" #display the contents of a file in columns -t create a table -O define the order in which columns are displayed | sed 's/ *//g' | #replace * with a space globally sed 's/_/\t/g' #replace _ with a horizontal tab \t ``` ### We actually didn't need all that. Here's our final command: ``` cat file.txt | sed 's/-//g'| sort -n | uniq -c | column -t > fEsoLuc1_file.txt ``` # Download the output and load into R ``` scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fEsoLuc1_file.txt fEsoLuc1_file.txt ``` ###### tags: `VGP`