owned this note
owned this note
Published
Linked with GitHub
---
title: 'Andean condor Project'
---
*Pehuel* Genome Assembly
===
###### tags: `Condor Project`
It is a captive male Andean condor used as a primary genetic founder in the breeding program of the Buenos Aires Zoo. The approx. 65 years-old individual was originally from the Andean province of Mendoza (Argentina), a region encompassing the highest mountains in the Americas (up to 6.961 masl).
### Available Data
* WGS: 1 lane HiSeq 4000 2x151bp (75,14 Gbp)
* ONT: 2 MinION runs (2,39 Gbp)
* HiC: 1 lane HiSeq 2000v4 2x76bp (30,27 Gbp)
## QC and trimming
|Sample|Length (bp)|M Seqs|
|-|:-:|:-:|
|WGS.a.1|151|55.4|
|WGS.a.2|151|55.4|
|WGS.b.1|151|193.4|
|WGS.b.2|151|193.4|
|ONT|13436|0.3|
|HiC.1|76|199.2|
|HiC.2|76|199.2|
:scissors: Post demito mapping and trimming
|Sample|Length (bp)|M Seqs|
|-|:-:|:-:|
|WGS.1|150|246.3|
|WGS.2|147|246.3|
|ONT|13437|0.3|
|HiC.1|75|198.3|
|HiC.2|75|198.3|
:::info
Quality of the WGS and HiC reads is OK. ONT sequencing was **very** low throughput.
:::
---

```csvpreview {header="false"}
Estimated genome size,1 315 114 606 bp
Transition parameter*,33
Maximum read depth,99
```
*transition between haploid and diploid coverage depths
:::success
Genome size estimation is reasonable. The heterogeneity is low.
:::
[Assembly pipeline](https://github.com/diegomics/Pahuel_assembly_pipeline)
===

MaSuRCA assembly & purge_dups
===
WGS library insert size values used for Illumina-only assembly
```csvpreview {header="false"}
Mean,350
SD,77
```
* Assembly output (only the primary)
```csvpreview {header="false"}
N50,216580
Sequence,1220326965
Average,33478.6
E-size,277595
Count,36451
```
:::warning
Illumina-only assembly explain the fragmentation and low N50. Long reads are not enough for hybrid approach.
:::
---
Pseudohaplotype.1

Both pseudohaplotypes

:::info
The ammount of retained haplotypes looks low.
:::
---
:::info
:bulb: **Cutoffs:** ++Low++ coverage value marks that any contig with an average coverge less than that number is a junk contig, while ++high++ coverage value marks that contigs with an average coverage over that number are usually repeats. The ++middle++ value is the diploid coverage.
> If there is a single peak, the first cutoff should be placed just leaving the valley, the middle cutoff at 0.75\*seq-coverage, and the last one at 3\*seq-coverage. We could try this using the sequencing coverage value from purge_dups (57). Other possibility is using the VGP recommendation of Transition parameter (33) for middle value and Maximum read depth (99) for high.
:::

`mean: 56, peak: 57, mean not larger than peak, treat as haploid assembly`
* Assembly size comparison:
|Assembly|purged (n)|haplotigs (n)|
|---|:---:|:---:|
|c1|1220.32 Mbp (36451)|-|
|c2|16.71 Mbp (17562)|-|
|cutoffs (default) 5-43-129|1210.30 Mbp (27022)|10.03 Mbp (9486)|
|cutoffs 10-43-129 <--|1210.18 Mbp (26769)|10.14 Mbp (9740)|
|cutoffs 10-43-171|1210.99 Mbp (27205)|9.38 Mbp (9304)|
|cutoffs 12-43-171|1210.92 Mbp (27097)|9.40 Mbp (9414)|
|cutoffs 10-43-99|1209.62 Mbp (26491)|10.71 Mbp (10019)|
|cutoffs 5-33-99|1210.26 Mbp (26913)|10.06 Mbp (9594)|
|cutoffs 10-33-99|1210.15 Mbp (26665)|10.18 Mbp (9843)|
|cutoffs 12-33-99|1210.08 (26556)|10.24 Mbp (9954)|
---
---
---
* BUSCO (aves_odb10)
||masurca.c1|cutoffs (default)|cutoffs* 10-43-129|
|-|:-:|:-:|:-:|
|Complete BUSCOs|7882|7879|7879|
|Complete & single-copy BUSCOs|7841|7838|7838|
|Complete & duplicated|41|41|41|
|Fragmented|205|206|206|
|Missing|251|253|253|
|Total|8338|8338|8338|
:::info
Purge does not improve dramaticaly over c1. I'm going to use the softest (cutoffs 10-43-129)
:::
c1c2
p1q2
merqury y toda la bola
RagTag & LongStitch
===
p1 vs s1
|Assembly|p1|s1|
|-|:-:|:-:|
|# contigs (total)|||
|# contigs (>= 5 Kbp)|||
|# contigs (>= 25 Kbp)|||
|Length (total)|||
|Length (>= 5 Kbp)|||
|Length (>= 25 Kbp)|||
|Largest contig|||
|Total length|||
|Estimated reference length|||
|GC (%)|||
|N50|||
|N75|||
|L50|||
|L75|||
|# N's per 100 kbp|||
||p1|s1|
|-|:-:|:-:|
|Complete BUSCOs|||
|Complete & single-copy BUSCOs|||
|Complete & duplicated|||
|Fragmented|||
|Missing|||
|Total|||
|Assembly Name|QV|Completeness (%)|
|-|:-:|:-:|
|p1|||
|s1|||
3D-DNA
===
s1 vs s2
|Assembly|s1|s2|
|-|:-:|:-:|
|# contigs (total)|||
|# contigs (>= 5 Kbp)|||
|# contigs (>= 25 Kbp)|||
|Length (total)|||
|Length (>= 5 Kbp)|||
|Length (>= 25 Kbp)|||
|Largest contig|||
|Total length|||
|Estimated reference length|||
|GC (%)|||
|N50|||
|N75|||
|L50|||
|L75|||
|# N's per 100 kbp|||
||s1|s2|
|-|:-:|:-:|
|Complete BUSCOs|||
|Complete & single-copy BUSCOs|||
|Complete & duplicated|||
|Fragmented|||
|Missing|||
|Total|||
|Assembly Name|QV|Completeness (%)|
|-|:-:|:-:|
|s1|||
|s2|||
DENTIST & PILON
===
|Assembly|s2|Pehuel.pri|
|-|:-:|:-:|
|# contigs (total)|||
|# contigs (>= 5 Kbp)|||
|# contigs (>= 25 Kbp)|||
|Length (total)|||
|Length (>= 5 Kbp)|||
|Length (>= 25 Kbp)|||
|Largest contig|||
|Total length|||
|Estimated reference length|||
|GC (%)|||
|N50|||
|N75|||
|L50|||
|L75|||
|# N's per 100 kbp|||
||s2|Pehue.pri|
|-|:-:|:-:|
|Complete BUSCOs|||
|Complete & single-copy BUSCOs|||
|Complete & duplicated|||
|Fragmented|||
|Missing|||
|Total|||
### 093
| ASM_ID | Bases_Mb | GC_% | Scaff | Cont | Gaps | Longest_Scaff_Mb | Scaff_N50_Mb | Scaff L50 | Scaff_N95_Mb | Scaff L95 | Longest_Cont_Mb | Cont N50 | Cont L50 | Cont N95 | Cont L95 | QV | KComplet | BUSCO-C % | BUSCO-S % | hap | both Hap | shared |
|:--------- | --------:| -----:| ---------:| ---------:| ------:| ----------------:| ------------:| ---------:| ------------:| ---------:| ---------------:| -----------:| --------:| -----------:| --------:| -----:| ------------:| -------------:| --------------------:| ------------------------------------ | -------- | ------ |
| MaS | 1,219.46 | 42.36 | 38,187 | 43,005 | 4,818 | 1.51 | 0.19 | 0.19 | 0.02 | 0.02 | 0.99 | 0.14 | 0.14 | 0.02 | 0.02 | 50.14 | 97.72 | 94.2 | 93.7 | | | |
| ALGA | 1,209.06 | 42.23 | 213,310 | 213,310 | 0 | 0.16 | 0.02 | 0.02 | 0 | 0 | 0.16 | 0.02 | 0.02 | 0 | 0 | inf | 97.55 | 61.9 | 61.7 |  | - | - |
| Plat | 1,369.29 | 43.46 | 2,098,110 | 2,098,110 | 0 | 0.14 | 0.01 | 0.01 | 0 | 0 | 0.14 | 0.01 | 0.01 | 0 | 0 | inf | 98.12 | 58.9 | 58.7 | | | |
| Plat_Scaf | 1,254.98 | 42.62 | 649,266 | 730,697 | 81,431 | 0.83 | 0.06 | 0.06 | 0 | 0 | 0.23 | 0.02 | 0.02 | 0 | 0 | 66.36 | 98.03 | 88.2 | 87.9 | | | |
| Plat_Gap | 1,255.45 | 42.62 | 649,266 | 686,184 | 36918 | 0.83 | 0.06 | 0.06 | 0 | 0 | 0.28 | 0.03 | 0.03 | 0 | 0 | 64.92 | 98.06 | 88.3 | 87.9 | | | |
| ASM_ID | Total Mb | Scaf | Cont | Gaps | Gaps Mb | Longest Scaf | Scaf N50 | Scaf L50 | Scaf N95 | Scaf L95 | Longest Cont | Cont N50 | Cont L50 | Cont N95 | Cont L95 | QV | KCompl | BUSCO-C % | BUSCO-S % |
|:---------- | --------:| ------:| ------:| ------:| --- | ----------------:| ------------:| -------------:| ------------:| -------------:| ---------------:| -----------:| ------------:| -----------:| ------------:| -----:| ------------:| -------------:| --------------------:|
| AL.pur.Rag | 1,241.86 | 69,106 | 210,892 | 141,786 | 33.54 | 219.88 | 85.35 | 5 | 6.41 | 27 | 0.16 | 0.02 | 22,831 | 0 | 102,052 | 82.79 | 97.51 | 96.6 | 96.3 |
| MA.pur.Rag | 1228.86 | 24605 | 41567 | 16962 | 11.46 | 70.32 | 16.53 | 24 | 0.95 | 109 | 1.08 | 0.16 | 2,301 | 0.02 | 10,339 | 50.14 | 97.63 | 97.1 | 96.8 |
| PL.pur.Rag | 1279.74 | 545216 | 730531 | 185315 | 27.45 | 219.16 | 84.94 | 5 | 0 | 34,096 | 0.23 | 0.02 | 20,314 | 0 | 224,725 | 66.31 | 98.01 | 97.2 | 97 |
| AL.Rag.pur | 1237.88 | 68307 | 205527 | 137220 | 32.78 | 219.89 | 85.4 | 5 | 6.4 | 27 | 0.16 | 0.02 | 22,724 | 0 | 100,562 | 82.78 | 97.42 | 96.6 | 96.3 |
| MA.Rag.pur | 1226.77 | 24113 | 40279 | 16166 | 11.32 | 70.32 | 17.08 | 23 | 1.07 | 107 | 1.08 | 0.16 | 2,295 | 0.02 | 10,240 | 50.19 | 97.59 | 97.1 | 96.8 |
| PL.Rag.pur | 1271.58 | 543737 | 706237 | 162500 | 24.41 | 219.15 | 84.81 | 5 | 0 | 35,255 | 0.23 | 0.02 | 20,170 | 0 | 203,494 | 66.32 | 97.86 | 97.2 | 97 |
| ASM_ID | ASM_LEVEL | Bases_Mb | Est_Size_Mb | Het_% | GC_% | Scaff | Cont | Gaps | Longest_Scaff_Mb | Scaff_N50_Mb | Scaff_NG50_Mb | Scaff_N95_Mb | Scaff_NG95_Mb | Longest_Cont_Mb | Cont_N50_Mb | Cont_NG50_Mb | Cont_N95_Mb | Cont_NG95_Mb | QV | Completeness | Comp_BUSCOs_% | Comp_Single_BUSCOs_% |
|:---------|:------------|-----------:|--------------:|--------:|-------:|--------:|-------:|-------:|-------------------:|---------------:|----------------:|---------------:|----------------:|------------------:|--------------:|---------------:|--------------:|---------------:|------:|---------------:|----------------:|-----------------------:|
| AL_yahs | scaff | 1222.45 | 1222.95 | 0.26 | 42.06 | 15324 | 152596 | 137272 | 219.89 | 85.4 | 85.4 | 7.65 | 7.65 | 0.16 | 0.02 | 0.02 | 0 | 0 | 82.73 | 97.21 | 96.6 | 96.3 |
| MA_yahs | scaff | 1225.06 | 1222.95 | 0.26 | 42.31 | 18999 | 35391 | 16392 | 220.03 | 84.87 | 84.87 | 7.78 | 7.78 | 1.08 | 0.16 | 0.16 | 0.02 | 0.02 | 50.21 | 97.58 | 97.2 | 96.9 |
| PL_yahs | scaff | 1201.75 | 1222.95 | 0.26 | 41.97 | 4105 | 166570 | 162465 | 219.15 | 85.4 | 84.81 | 8.48 | 7.69 | 0.23 | 0.02 | 0.02 | 0 | 0 | 66.11 | 97.05 | 97.2 | 97 |