Troubleshooting fAntMac1 assembly VGP

# Troubleshooting fAntMac1 assembly VGP ``` ssh tl50a@ghpcc06.umassrc.org module load anaconda2/4.4.0 module load java/1.8.0_77 module load gcc/8.1.0 #required for meryl source activate vgp ``` The low signal from our original HiC heatmap was my error -- I needed to manually cat the forward and reverse read files. However, I re-ran the hic heatmap, and the signal is still incredibly low. ## fAntMac1 ![](https://i.imgur.com/3dKRYje.png) ![](https://i.imgur.com/FyPvD20.jpg) ## Compare to fEsoLuc1 ![](https://i.imgur.com/0VKx6vB.png) We need to determine whether this is the product of poor quality data OR a species data swap. fAntMac1 c1 and c2 files were originally swapped with another species, so this is not outside the realm of possibility. # Let's explore data quality first: ## 1. Compare mapping rate between fAntMac1 and fEsoLuc1 ### Extract mapped reads ``` bsub -q short -W 3:59 -R rusage[mem=8000] -n 4 -R span\[hosts=1\] "samtools flagstat /project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fAntMac1_ILKL1AMS714_R1.bam" ``` #### fAntMac1 > 506609438 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary ***23958488*** + 0 duplicates #this is the number of interest 506609438 + 0 mapped (100.00% : N/A) 506609438 + 0 paired in sequencing 253304719 + 0 read1 253304719 + 0 read2 506609438 + 0 properly paired (100.00% : N/A) 506609438 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 62021090 + 0 with mate mapped to a different chr 62021090 + 0 with mate mapped to a different chr (mapQ>=5) #### fEsoLuc1 > 579281612 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary ***80053582*** + 0 duplicates #this number 579281612 + 0 mapped (100.00% : N/A) 579281612 + 0 paired in sequencing 289640806 + 0 read1 289640806 + 0 read2 579281612 + 0 properly paired (100.00% : N/A) 579281612 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 220888442 + 0 with mate mapped to a different chr 220888442 + 0 with mate mapped to a different chr (mapQ>=5) ### Extract n reads from hic fastq ``` echo $(cat fAntMac1_ILKL1AMS714_R1.fastq.gz |wc -l)/4|bc echo $(cat fEsoLuc1_241128L001_R1.fastq.gz |wc -l)/4|bc ``` #### fAntMac1 376902687 #### fEsoLuc1 453200188 ### Divide mapped reads/total reads #### fAntMac1 23958488/376902687 = ~6% mapping rate #### fEsoLuc1 80053582/453200188= ~17.6% mapping rate ### **Synopsis**: We find that fAntMac1 has a much lower mapping rate. Is this within the expected range of values for VGP assemblies? The low mapping rate may be because the data is poor quality or due to a species swap. Mapping rate isn't a very conclusive indicator of what's going on. ## 2. Let's plot the distribution of read pair distance to have a closer look ### View the first 10,000 records of a bam ``` samtools view fAntMac1_ILKL1AMS714_R1.bam | head -n 10000 ``` ### Extract read pair distances from .bam Based on the bam file, read pair distance (contact length) should be the 9th column. Extract it with: ``` samtools view fEsoLuc1_241128L001_R1.bam | awk '{print $9}' > fEsoLuc1output samtools view fAntMac1_ILKL1AMS714_R1.bam | awk '{print $9 > "fAntMac1output"}' ``` ### Select the first 10k reads ``` awk '{print $1} NR==10000{exit}' fEsoLuc1output > fEsoLuc10k awk '{print $1} NR==10000{exit}' fAntMac1output > fAntMac10k ``` ### Sort read pair distances by frequency (count) ``` sort fAntMac10k -n | uniq -c > newfile.txt sort fAntMac10k -n | uniq -c > fEsoLucfreq.txt ``` ### Download the output locally ``` scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/newfile.txt newfile.txt scp -r tl50a@ghpcc06.umassrc.org:/project/uma_lisa_komoroske/Tanya/download/vgp/fAntMac1/fEsoLucfreq.txt fEsoLucfreq.txt ``` ### Plot the distribution of read pair distance in a frequency plot (in R) You can see my R script here: {link to Rpubs} When contact length = 0 it is an interscaffold interaction ![](https://i.imgur.com/3gkCbWw.png) ![](https://i.imgur.com/qmqDRWH.png) ### **Synopsis:** The distribution of read pair distances for fAntMac1 is unusually short and there's just fewer reads overall.fAntMac1 has far fewer (n=1195) interchromosomal interactions (read pair distance =0) compared to fEsoLuc1 (n=4874) ## 3. Check the quality of the fastq data *(we should have done this first)* We might just be working with poor quality data here, in which case we would lose a lot of reads during pre-processing (filtering). ``` bsub -q short -W 1:00 -R rusage[mem=16000] -n 4 -R span\[hosts=1\] "./step1_fastqc.sh fEsoLuc1_241128L001_R1" ``` #### fEsoLuc1 #this is the hic fastq data ![](https://i.imgur.com/GGIIHNH.png) #### fAntMac1 ![](https://i.imgur.com/QXk3hvO.png) *In the future we need to determine a distribution of mapping% across all of our hiC datasets. This would give us a good baseline.* ### **Synopsis**: Read length for fAntMac1 was 101bp vs 150bp for fEsoLuc1. You'll also notice that fAntMac1 has a substantial portion of low quality reads. We are likely losing lots of reads prior to mapping. This doesn't necessarily rule out a species swap, but is certainly a contributing factor: 1) fewer reads 2) shorter reads = less unique mapping (when reads map to multiple locations they are discarded) 3) poor quality reads = lower mapping score in bwa, so discarded. When reads are discarded at this stage they are discarded as a pair. This is certainly contributing to the poor signal seen in our hic heatmap. 2. *In the future we need to determine a distribution of mapping% across all of our hiC datasets. This would give us a good baseline.* # Is it possible that this is species swap? We need to find a QC method that produces a reliable signature when we make a mistake. With 4 data types, multiple sequencing centers, and hundreds of species, it's too easy to have accidental sample swapping. Kerstin suspects these results have signatures of species swapping, so that may still be an explanation for our anomalous results. ## 4. Can mashdist detect a species swap? take 10x, Hic and Pacbio data from two well-assembled fish and run masmap. 2. take the same fish and swap hic runs and run masmap again 3. this will tell us if masmap produces a noticeable signature of species swapping https://files.slack.com/files-pri/T9VF096LR-FRPULMR9B/fesoluc1.png https://files.slack.com/files-pri/T9VF096LR-FRRQUGDDM/mash.combined_genomes.png There are obvious differences between these plots, but file size is affecting the plot too. Whatever is going on the differences between the two datasets looks like night and day. So the values with m....subread as the PacBio sequence runs? The trimmer is the 10XG? And the ones that begin with the species abbreviation the HiC? (yes). What I see in the fAntMac1 plot is three major clusters. The first is PacBio only data. The second has all the HiC data plus some PacBio data, and only two 10X runs. The third has the third cohort of PacBio data plus all of the remaining 10XG data. I believe the Bionano data for this species (warty frogfish) was obtained form another individual because of running out of sample. I wonder if any other data was obtained from another animal? (is the hic data from a third individual) ### **Synopsis** mashdist is not informative enough to say conclusively whether this is a species swap or now. We should switch to a kmer-based approach ## 5. Can kmer distribution detect a species swap? 1. Generate a kmer database for each dataset (hic, 10x) 2. Filter kmers based on their coverage (filter out low coverage kmers 3. If both datasets are from the same individual/species, the kmers from the hic dataset will be present in the 10x dataset. 4. If they are not, it's a different individual/species ### hiC job k=$1 21 input_fofn=$2 dataset/object out_prefix=$3 fAntMac1hic mem_opt=$4 8 -threads n -memory mMb Original submission: bsub -q long -W 8:00 -R rusage[mem=8000] -n 8 -R span\[hosts=1\] "/project/uma_lisa_komoroske/bin/meryl/scripts/_submit_meryl2_build2.sh 21 fAntMac1_hic.fastq.gz fAntMac1 48 threads=$cores" ``` bsub -q long -W 8:00 -R rusage[mem=48000] -n 64 "/project/uma_lisa_komoroske/bin/meryl/scripts/_submit_meryl2_build2.sh 21 fAntMac1_hic.fastq.gz fAntMac1hic 48000 threads=64" #should be 48000 (64 cores x 48000 MB) ``` ### Build meryl dbs for 10x data fAntMac1 trim barcodes off of 10x reads, and build meryl db ### 10x job 2nd try (success) Maybe try using _submit_meryl2_build_10x.sh _submit_meryl2_build_10x.sh trims the 10X barcodes from illumina reads, and runs the 1. Build 2. Merge steps. The barcode trimming step uses scaff_reads from scaff10x. k=$1 21 R1=$2 fAntMac1_R1_001.fastq.gz R2=$3 fAntMac1_R2_001.fastq.gz out_prefix=$4 fAntMac10x ``` bsub -q long -W 8:00 -R rusage[mem=48000] -n 64 "/project/uma_lisa_komoroske/bin/meryl/scripts/_submit_meryl2_build_10x.sh 21 fAntMac1_R1_001.fastq.gz fAntMac1_R2_001.fastq.gz fAntMac10x memory=48000 threads=64" ``` Final merged meryl db will be named as <out_prefix>.meryl ls *_R1_001.fastq.gz >input.fofn #maybe rename the input as this ### Check the output with meryl statistics your_meryl_db | head st ### **Synopsis** ###### tags: `VGP`