# Generating HiC Heatmaps for VGP Genome Assemblies
By Tanya Lama
HiC align plots are part of scaffold4_salsa quality checks in the VGP 1.6 assembly pipeline
```
ssh tl50a@ghpcc06.umassrc.org
{password}
module load anaconda2/4.4.0
source activate vgp
cd /project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1
```
## 1. Setting up a new conda environment
A few dependencies are required for us to generate our heatmap, so we will create a new conda environment called "*vgp*". The == syntax specifies the version you want to install. Use the *conda search* function to ensure packages are available before installing. **Note:** Python 3.6 is recommended because the end-of-life for Python 2 is scheduled for 2020.
```
conda create --name vgp python==3.6 samtools==1.9 pretext-suite==0.0.1 pretextmap==0.1.1 libdeflate==1.2
source activate vgp
```
## 2. Download bam
Navigate to scaffold4_salsa folder in your intermediates and locate the bam file species_ILKL1AMS7R1.bam. Download into your local project space.

```
wget {url}
```
## 3. Submit job
This will take a few hours to run...
Important: A SAM header with contig info must be present for SAM format (-h option for samtools).
```
bsub -q long -W 8:00 -R rusage[mem=8000] -n 4 -R span\[hosts=1\] "samtools view -h fAntMac1_ILKL1AMS7_R1.bam | PretextMap -o fAntMac1_ILKL1AMS7_R1_2.pretext"
```
## 4. Download the output locally
scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fAntMac1_ILKL1AMS714_R1.pretext fAntMac1_ILKL1AMS714_R1.pretext
## 5. Download and install PretextView (do not download PretextView 0.1.0)
The new release (0.1.0) is missing at least one key function.
Select the correct version under Assets https://github.com/wtsi-hpag/PretextView/releases
## 6. Open your .pretext file using PretextView
the file fAntMac1_ILKL1AMS7_R1_2.pretext should be ~30+Mb
## 7. Interpreting the final product
The three-dimensional structure of a complete chromosome or genome is usually visualized by some method. The most straightforward visualization of the Hi-C contact matrix is a square heat map, where **the color corresponds to the contact count**.
Heat maps can be constructed for the full genome (Fig. 1a) or for individual chromosomes (Fig. 1 b).


The dominant visual feature in every Hi-C heat map is the strong diagonal, which represents the 3D proximity of pairs of loci that are adjacent in genomic coordinates. The number of rows and columns is equal to the length of the genome divided by the bin size.
Whole genome visualizations can reveal potential rearrangements of the genome (Fig.1a)
See: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1161-y
## Troubleshooting fAntMac1
## 1. Calculate n mapped reads
```
bsub -q short -W 3:59 -R rusage[mem=8000] -n 4 -R span\[hosts=1\] "samtools flagstat /project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fAntMac1_ILKL1AMS714_R1.bam"
```
### fAntMac1
> 506609438 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
***23958488*** + 0 duplicates
506609438 + 0 mapped (100.00% : N/A)
506609438 + 0 paired in sequencing
253304719 + 0 read1
253304719 + 0 read2
506609438 + 0 properly paired (100.00% : N/A)
506609438 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
62021090 + 0 with mate mapped to a different chr
62021090 + 0 with mate mapped to a different chr (mapQ>=5)
### fEsoLuc1
> 579281612 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
***80053582*** + 0 duplicates
579281612 + 0 mapped (100.00% : N/A)
579281612 + 0 paired in sequencing
289640806 + 0 read1
289640806 + 0 read2
579281612 + 0 properly paired (100.00% : N/A)
579281612 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
220888442 + 0 with mate mapped to a different chr
220888442 + 0 with mate mapped to a different chr (mapQ>=5)
## 2. Count the number of reads in your fastq file
The bam only contains only mapped reads so you need to count the number of reads in the fastq
```
echo $(zcat fAntMac1_ILKL1AMS714_R1.fastq.gz |wc -l)/4|bc
echo $(zcat fEsoLuc1_241128L001_R1.fastq.gz |wc -l)/4|bc
fEsoLuc1_241128L001_R1.fastq.gz
```
### fAntMac1
23512944 #this was wrong because we used cat
376902687 #this is correct
### fEsoLuc1
37277321 #this was wrong because we should have used zcat
453200188 #this is correct
## 3. View the first 1000 records of a bam
```
samtools view fAntMac1_ILKL1AMS714_R1.bam | head -n 1000
```
## 4. Extract contact lengths from .bam
Based on the bam file, a first approximation of the *contact lengths* should be the 9th column
the insert size. Extract it with:
samtools view bam | awk
```
samtools view fEsoLuc1_241128L001_R1.bam | awk '{print $9}' > fEsoLuc1output
```
#awk '{print $9} NR==10000{exit}' fEsoLuc1output > fEsoLuc10k #print the first 10,000 rows
## Final commands:
```
awk '{print $9}' fAntMac1_ILKL1AMS714_R1.bam > output1.txt
awk '{print $9}' fEsoLuc1_241128L001_R1.bam > fEsoLuc1output.txt
```
## 5. Plot the contact lengths in a frequency plot
When contact length = 0 it is an interscaffold interaction
```
scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fEsoLuc10k fEsoLuc10k
196083712
scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fAntMac1output fAntMac1output
scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/newfile.txt newfile.txt
scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fEsoLucfreq.txt fEsoLucfreq.txt
fEsoLucfreq.txt
```
# Generating the necessary data to create a frequency plot
```
sort fAntMac10k -n | uniq -c > newfile.txt
```
# Downloading data from DNANexus using dxpy (conda package)
dx login
```
dx download -r "fEsoLuc1:/genomic_data/arima/fEsoLuc1_241128L001_R1.fastq.gz"
dx download -r "fEsoLuc1:/genomic_data/arima/fEsoLuc1_241128L001_R1.fastq.gz" /project/uma_lisa_komoroske/Tanya/download/vgp
```
# Formatting read pair distance data for a frequency plot
```
##original command from Giulio:
cat newfile.txt | sed 's/-//g' | sort -n | uniq -c | column -t -O "_" | sed 's/ *//g' | sed 's/_/\t/g'
```
# What do all these sed commands mean?
sed is a text editor that has so many commands I have a hard time keeping them all straight. A laymans interpretation:
```
cat newfile.txt | #show me the contents of this file
sed 's/-//g' | #replace every - with a space character
sort -n | #sort the contents line by line **numerically**
uniq -c | #how many times is a line repeated?
column -t "/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1" #display the contents of a file in columns -t create a table -O define the order in which columns are displayed
| sed 's/ *//g' | #replace * with a space globally
sed 's/_/\t/g' #replace _ with a horizontal tab \t
```
### We actually didn't need all that. Here's our final command:
```
cat file.txt | sed 's/-//g'| sort -n | uniq -c | column -t > fEsoLuc1_file.txt
```
# Download the output and load into R
```
scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fEsoLuc1_file.txt fEsoLuc1_file.txt
```
###### tags: `VGP`