---
title: 'mEleMax1'
---
*Asian elephant* VGP Genome
===
###### tags: `Elephant Project` `VGP`
:::construction::: UNDER CONSTRUCTION :::construction:::
Table of Contents
---
[TOC]
Aim
---
:::info
**Generate a VGP genome for the Asian elephant using the last available version of the assembly pipeline (2.0+)**
:::
Data
---
* ***Transcriptomic data for annotation***
Total RNA-Seq (TruSeq total RNA protocol) with rRNA depletion using baits. HiSeq4000?? PE 101 bp
|Sample ID|Paired reads #|Total throughput (Gbp)|
|:-:|:-:|:-:|
|1_291_Lung|203758250|41.1|
|2_445_thyroid|196528506|39.7|
|3_445_lymphnode|174584352|35.2|
|4_445_salivarygland|162864238|32.9|
|5_466_ovary|181593456|36.7|
:warning: Samples *2_445_thyroid*, *3_445_lymphnode* and *4_445_salivarygland* show not ideal GC distribution (among other things):



:mag_right: Ribosomal rRNA was extracted by mapping against 131 sequences from [RNAcentral](https://rnacentral.org/) database obtained using the query filter:
```
Loxodonta africana AND so_rna_type_name:"RRNA" AND rna_type:"rRNA" AND qc_warning_found:"False" AND so_rna_type_name:"Cytosolic_rRNA" AND TAXONOMY:"9785"
```
:+1: Low amount of rRNA found:
```csvpreview {header="true"}
Sample ID,Paired reads #, %
1_291_Lung,"967,478", 0.474
2_445_thyroid,"426,906", 0.217
3_445_lymphnode,"746,815",0.427
4_445_salivarygland,"511,723",0.314
5_466_ovary,"495,435",0.272
```
* ***Genomic data***
Male asian elephant (origin: India)
2n=56 (27+1)
To get available data:
```
mkdir -p ~/mEleMax1/genomic_data/pacbio
aws s3 cp s3://genomeark/species/Elephas_maximus/mEleMax1/genomic_data/pacbio_hifi/ ~/mEleMax1/genomic_data/pacbio_hifi --recursive --no-sign-request
mkdir -p ~/mEleMax1/genomic_data/bionano
aws s3 cp s3://genomeark/species/Elephas_maximus/mEleMax1/genomic_data/bionano/ ~/mEleMax1/genomic_data/bionano --recursive --no-sign-request
mkdir -p ~/mEleMax1/genomic_data/
ws s3 cp s3://genomeark/species/Elephas_maximus/mEleMax1/genomic_data/arima/ ~/mEleMax1/genomic_data/arima --recursive --no-sign-request
```
:::info
HiFi total Gbp: 171.14
HiFi **raw** coverage: ~53x
:::
|Sample Name|% Dups|% GC|Length|M Seqs|
|:-|:-:|:-:|:-:|:-:|
|m54306U_210321_092024.Q20|0.4%|40%|16,355 bp|0.5|
|m54306U_210421_201740.Q20|0.4%|40%|12,485 bp|0.4|
|m54306Ue_210627_165734.hifi_reads|0.3%|40%|10,374 bp|0.3|
|m54306Ue_210705_134503.hifi_reads|0.4%|40%|8,033 bp|0.4|
|m64055_210419_095605.Q20|0.6%|41%|16,152 bp|0.8|
|m64055_210425_003744.Q20|0.5%|40%|10,970 bp|0.7|
|m64330e_211019_195149.hifi_reads|1.4%|41%|15,931 bp|1.9|
|m64334e_211022_161721.hifi_reads|1.4%|41%|16,119 bp|2.1|
|m64334e_211024_014441.hifi_reads|1.4%|41%|16,346 bp|2.1|
|m64334e_211025_111049.hifi_reads|1.3%|41%|15,972 bp|2.0|
Results
---
:mag: All scripts used during the assembly are available on [**GitHub**](https://github.com/diegomics/VGP_Asian_elephant)
:black_circle: **_QC_**
:::success
**PacBio HiFi** quality looks GOOD! :slightly_smiling_face:
- Only 0.1% of reads were discarded during trimming
:::
---

```csvpreview {header="false"}
Estimated genome size (bp),"3,350,136,551"
Transition parameter*,36
Maximum read depth,108
```
*transition between haploid and diploid coverage depths
:::success
Genome size estimation is **very** reasonable
:::
:black_circle: **_Contig assembly (hifiasm) & Purge haplotigs and overlaps (purge_dups)_**
:::info
:bulb: Based on extensive testing with this data, maybe a good and flexible way to face the contigging stage is to test the different **purging duplicates options** within hifiasm (none `-l0`, light `-l1`, aggresive `-l2`, and aggresive - high heterozigosity rate `-l3`) and posteriorly run purge_dups, all with VGP suggested cutoffs (i.e., *Maximum read depth* for the **coverage upper bound** `--purge-max` in hifiasm, and *Transition parameter* and *Maximum read depth* for `-m` and `-u` in purge_dups, respectively)
:::
:point_right: Best results were obtained with `l2` and `l3` options (see full results during options testing [here](https://hackmd.io/_mfIxNV4QYGnF5HD9LAUJA))

* purge_dups cutoffs histograms for p1 and q2
**mEleMax1_l2**

**mEleMax1_l3**

* K-mer Multiplicity of c1c2 and p1q2 (stacked)
**mEleMax1_l2**

**mEleMax1_l3**

* QV Score (detailed)

* Kmer Completeness (detailed)

* BUSCOv5 database: mammalia (detailed)
**mEleMax1_l2 p1**
```
C:96.1%[S:95.3%,D:0.8%],F:1.1%,M:2.8%,n:9226
8861 Complete BUSCOs (C)
8789 Complete and single-copy BUSCOs (S)
72 Complete and duplicated BUSCOs (D)
105 Fragmented BUSCOs (F)
260 Missing BUSCOs (M)
9226 Total BUSCO groups searched
```
**mEleMax1_l3 p1**
```
C:96.1%[S:95.3%,D:0.8%],F:1.1%,M:2.8%,n:9226
8862 Complete BUSCOs (C)
8789 Complete and single-copy BUSCOs (S)
73 Complete and duplicated BUSCOs (D)
105 Fragmented BUSCOs (F)
259 Missing BUSCOs (M)
9226 Total BUSCO groups searched
```
:::success
Assemblies with either `-l2` and `-l3` hifiasm options look **amazing**
Less than 120 contigs: considering that n=28, we may have like 4 contigs per chromosome :exploding_head:
Let's continue with both in parallel :muscle:
:::
:black_circle: **_Bionano scaffolding_**
:::info
There was no removal of Trim trailing N-bases in the scaffolded assembly (apparently there are none)
:::

* BUSCOv5 database: mammalia (detailed)
**mEleMax1_l2 s1**
```
C:96.0%[S:95.2%,D:0.8%],F:1.1%,M:2.9%,n:9226
8860 Complete BUSCOs (C)
8785 Complete and single-copy BUSCOs (S)
75 Complete and duplicated BUSCOs (D)
101 Fragmented BUSCOs (F)
265 Missing BUSCOs (M)
9226 Total BUSCO groups searched
```
**mEleMax1_l3 s1**
```
C:96.0%[S:95.2%,D:0.8%],F:1.1%,M:2.9%,n:9226
8859 Complete BUSCOs (C)
8785 Complete and single-copy BUSCOs (S)
74 Complete and duplicated BUSCOs (D)
103 Fragmented BUSCOs (F)
264 Missing BUSCOs (M)
9226 Total BUSCO groups searched
```
:::success
Assemblies looking very good so far. Very good contiguity and BUSCO parameters within what is seek. Hopefully, HiC data will allow a good final result.
:::
:black_circle: **_HiC scaffolding_**
:::info
:bulb: Based on testing, l3 option ended being the best
:::

