--- title: 'mEleMax1' --- *Asian elephant* VGP Genome === ###### tags: `Elephant Project` `VGP` :::construction::: UNDER CONSTRUCTION :::construction::: Table of Contents --- [TOC] Aim --- :::info **Generate a VGP genome for the Asian elephant using the last available version of the assembly pipeline (2.0+)** ::: Data --- * ***Transcriptomic data for annotation*** Total RNA-Seq (TruSeq total RNA protocol) with rRNA depletion using baits. HiSeq4000?? PE 101 bp |Sample ID|Paired reads #|Total throughput (Gbp)| |:-:|:-:|:-:| |1_291_Lung|203758250|41.1| |2_445_thyroid|196528506|39.7| |3_445_lymphnode|174584352|35.2| |4_445_salivarygland|162864238|32.9| |5_466_ovary|181593456|36.7| :warning: Samples *2_445_thyroid*, *3_445_lymphnode* and *4_445_salivarygland* show not ideal GC distribution (among other things): ![](https://i.imgur.com/KBj2ap8.png) ![](https://i.imgur.com/yIx9IHA.png) ![](https://i.imgur.com/pQCcZjp.png) :mag_right: Ribosomal rRNA was extracted by mapping against 131 sequences from [RNAcentral](https://rnacentral.org/) database obtained using the query filter: ``` Loxodonta africana AND so_rna_type_name:"RRNA" AND rna_type:"rRNA" AND qc_warning_found:"False" AND so_rna_type_name:"Cytosolic_rRNA" AND TAXONOMY:"9785" ``` :+1: Low amount of rRNA found: ```csvpreview {header="true"} Sample ID,Paired reads #, % 1_291_Lung,"967,478", 0.474 2_445_thyroid,"426,906", 0.217 3_445_lymphnode,"746,815",0.427 4_445_salivarygland,"511,723",0.314 5_466_ovary,"495,435",0.272 ``` * ***Genomic data*** Male asian elephant (origin: India) 2n=56 (27+1) To get available data: ``` mkdir -p ~/mEleMax1/genomic_data/pacbio aws s3 cp s3://genomeark/species/Elephas_maximus/mEleMax1/genomic_data/pacbio_hifi/ ~/mEleMax1/genomic_data/pacbio_hifi --recursive --no-sign-request mkdir -p ~/mEleMax1/genomic_data/bionano aws s3 cp s3://genomeark/species/Elephas_maximus/mEleMax1/genomic_data/bionano/ ~/mEleMax1/genomic_data/bionano --recursive --no-sign-request mkdir -p ~/mEleMax1/genomic_data/ ws s3 cp s3://genomeark/species/Elephas_maximus/mEleMax1/genomic_data/arima/ ~/mEleMax1/genomic_data/arima --recursive --no-sign-request ``` :::info HiFi total Gbp: 171.14 HiFi **raw** coverage: ~53x ::: |Sample Name|% Dups|% GC|Length|M Seqs| |:-|:-:|:-:|:-:|:-:| |m54306U_210321_092024.Q20|0.4%|40%|16,355 bp|0.5| |m54306U_210421_201740.Q20|0.4%|40%|12,485 bp|0.4| |m54306Ue_210627_165734.hifi_reads|0.3%|40%|10,374 bp|0.3| |m54306Ue_210705_134503.hifi_reads|0.4%|40%|8,033 bp|0.4| |m64055_210419_095605.Q20|0.6%|41%|16,152 bp|0.8| |m64055_210425_003744.Q20|0.5%|40%|10,970 bp|0.7| |m64330e_211019_195149.hifi_reads|1.4%|41%|15,931 bp|1.9| |m64334e_211022_161721.hifi_reads|1.4%|41%|16,119 bp|2.1| |m64334e_211024_014441.hifi_reads|1.4%|41%|16,346 bp|2.1| |m64334e_211025_111049.hifi_reads|1.3%|41%|15,972 bp|2.0| Results --- :mag: All scripts used during the assembly are available on [**GitHub**](https://github.com/diegomics/VGP_Asian_elephant) :black_circle: **_QC_** :::success **PacBio HiFi** quality looks GOOD! :slightly_smiling_face: - Only 0.1% of reads were discarded during trimming ::: --- ![](https://i.imgur.com/SR53hdI.png) ```csvpreview {header="false"} Estimated genome size (bp),"3,350,136,551" Transition parameter*,36 Maximum read depth,108 ``` *transition between haploid and diploid coverage depths :::success Genome size estimation is **very** reasonable ::: :black_circle: **_Contig assembly (hifiasm) & Purge haplotigs and overlaps (purge_dups)_** :::info :bulb: Based on extensive testing with this data, maybe a good and flexible way to face the contigging stage is to test the different **purging duplicates options** within hifiasm (none `-l0`, light `-l1`, aggresive `-l2`, and aggresive - high heterozigosity rate `-l3`) and posteriorly run purge_dups, all with VGP suggested cutoffs (i.e., *Maximum read depth* for the **coverage upper bound** `--purge-max` in hifiasm, and *Transition parameter* and *Maximum read depth* for `-m` and `-u` in purge_dups, respectively) ::: :point_right: Best results were obtained with `l2` and `l3` options (see full results during options testing [here](https://hackmd.io/_mfIxNV4QYGnF5HD9LAUJA)) ![](https://i.imgur.com/8jfTt4R.png) * purge_dups cutoffs histograms for p1 and q2 **mEleMax1_l2** ![](https://i.imgur.com/J2BbO4X.png) **mEleMax1_l3** ![](https://i.imgur.com/yca8iEC.png) * K-mer Multiplicity of c1c2 and p1q2 (stacked) **mEleMax1_l2** ![](https://i.imgur.com/MJmgfyl.png) **mEleMax1_l3** ![](https://i.imgur.com/yvuwd0s.png) * QV Score (detailed) ![](https://i.imgur.com/ljJAvU9.png) * Kmer Completeness (detailed) ![](https://i.imgur.com/bVCxhPi.png) * BUSCOv5 database: mammalia (detailed) **mEleMax1_l2 p1** ``` C:96.1%[S:95.3%,D:0.8%],F:1.1%,M:2.8%,n:9226 8861 Complete BUSCOs (C) 8789 Complete and single-copy BUSCOs (S) 72 Complete and duplicated BUSCOs (D) 105 Fragmented BUSCOs (F) 260 Missing BUSCOs (M) 9226 Total BUSCO groups searched ``` **mEleMax1_l3 p1** ``` C:96.1%[S:95.3%,D:0.8%],F:1.1%,M:2.8%,n:9226 8862 Complete BUSCOs (C) 8789 Complete and single-copy BUSCOs (S) 73 Complete and duplicated BUSCOs (D) 105 Fragmented BUSCOs (F) 259 Missing BUSCOs (M) 9226 Total BUSCO groups searched ``` :::success Assemblies with either `-l2` and `-l3` hifiasm options look **amazing** Less than 120 contigs: considering that n=28, we may have like 4 contigs per chromosome :exploding_head: Let's continue with both in parallel :muscle: ::: :black_circle: **_Bionano scaffolding_** :::info There was no removal of Trim trailing N-bases in the scaffolded assembly (apparently there are none) ::: ![](https://i.imgur.com/0BC78yq.png) * BUSCOv5 database: mammalia (detailed) **mEleMax1_l2 s1** ``` C:96.0%[S:95.2%,D:0.8%],F:1.1%,M:2.9%,n:9226 8860 Complete BUSCOs (C) 8785 Complete and single-copy BUSCOs (S) 75 Complete and duplicated BUSCOs (D) 101 Fragmented BUSCOs (F) 265 Missing BUSCOs (M) 9226 Total BUSCO groups searched ``` **mEleMax1_l3 s1** ``` C:96.0%[S:95.2%,D:0.8%],F:1.1%,M:2.9%,n:9226 8859 Complete BUSCOs (C) 8785 Complete and single-copy BUSCOs (S) 74 Complete and duplicated BUSCOs (D) 103 Fragmented BUSCOs (F) 264 Missing BUSCOs (M) 9226 Total BUSCO groups searched ``` :::success Assemblies looking very good so far. Very good contiguity and BUSCO parameters within what is seek. Hopefully, HiC data will allow a good final result. ::: :black_circle: **_HiC scaffolding_** :::info :bulb: Based on testing, l3 option ended being the best ::: ![](https://i.imgur.com/PBGYh5c.png) ![](https://i.imgur.com/CNx3WrR.png)