---
title: 'B. torquatus - collomics'
---
Collomics
===
###### tags: `collomics` `testing`
**Three-toed sloth (Bradypus sp.)**
|Bradypus sp.|[Distribution map](https://doi.org/10.1002/ecy.2663)|
|---|---|
||
New species:
***Bradypus torquatus* = *Bradypus crinitus*** + ***Bradypus torquatus*** [Miranda et al 2022](https://doi.org/10.1093/jmammal/gyac059)
|||
|---|---|
| |
"The estimated split between the two Bradypus torquatus lineages occurred likely during the Gelasian (2.5–1.8 MYA), a period characterized by the last stages of a global cooling trend that led to the quaternary ice ages (International Commission on Stratigraphy, 2007). As quaternary climatic oscillations have been related to a series of contractions and expansions of habitats and species ranges all over the world (see review in Hewitt, 2000), it could be expected that the global cooling at the end of the Pliocene also led to forest contractions that ultimately resulted in the current observed distribution of B. torquatus populations. Indeed, some authors (Haffer, 1969; de Vivo and Carmignotto, 2004) have proposed that these climatic oscillations produced the forest retraction that isolated the northern and southern portions of the Atlantic Forest since the Pliocene." [Lara-Ruiz et al 2008](https://doi.org/10.1016/j.biocon.2008.03.002)
Phylogenetic tree based on 3 mtDNA and 3 nuclear genes: [Schetino et al. 2017](https://doi.org/10.1016/j.gecco.2017.07.002)

# Samples available

[Samples Inventory](https://docs.google.com/spreadsheets/d/1mTs5sQOekC183iTROsOy4gD2IKsKsYNv/edit?usp=sharing&ouid=112605992940717016602&rtpof=true&sd=true)
[New samples received at 09/01/23](https://docs.google.com/spreadsheets/d/1DCU7sd0ewRMJ0Bpfls15Gth-zSlVqkGr/edit?usp=sharing&ouid=112605992940717016602&rtpof=true&sd=true)
Samples sent to Max-Planck-Institut, Dresden
Species|Sample|Population|DNA extraction (Circulomics)|PhenolCloro extraction|Status
|---|---|---|---|---|---|
B. torquatus|M178(S116349)|Aracruz, ES|Extracted from 56mg tissue, preserved in EtOH, Circulomics kit gDNA yield: 426 ng total||30Gb of HiFi data (08/09/22)|
B. torquatus|M184|Una (BA)|27.4 ng - too low|Yield 22 ng eluted after 1st precipitation| |
B. torquatus|M198|Silva Jardim (RJ)|389.4 ng|
B. variegatus|M584(S115804)|Rondonia|Extracted from 20mg heart tissue, preserved in EtOH, Circulomics kit (recovered gDNA from supernatant is included) gDNA yield: 3,5 ug total / decent gDNA (average size circa 14kb, circa 19ug yield)||Low input PacBio HiFi library preparation -> 1st SMRT data 01.05.22 -> BUT short polymerase reads (14 kb) and low local base rate (0.92 -> expected 2,8) -> Polymerase is very slow or even drops of the library (contaminants, nicks, secondary structures?)
B. variegatus|534M|Rondonia|rather short gDNA (average size circa 5kb, circa 16 ug yield, circulomics tissue kit for extraction)|
B. variegatus|383M|Rondonia|rather short gDNA and low yield, not persued further|
# Genome assembly Nov 2022 -> M178 (Aracruz, ES)
:::warning
**Ultra low input protocol:**
Library was barcoded and amplified -> bioinformatics pipeline needs to be extended by barcode and PCR primer trimming followed by duplication removal.
:::
## Karyotype
Paper [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3383463/)
"The diploid numbers of Bradypus species are better characterized: Bradypus torquatus showed 2n = 50, Bradypus variegatus had 2n = 54 and Bradypus tridactylus had 2n = 52. The karyotype of B. pygmaeus has not been described."

## 1st SMRT data for Bradypus torquatus available at 08/09/2022
Notes from Martin Pippel (Max Planck Dresden):
* Duplication rate was 1.3%.
* In total we lost only 9.1% bases due to the two demultiplexing steps and duplication removal.
* PacBio suggests to run the deduplication step with the combined data sets (all SMRTs) across all libraries. That means we can provide you with THE final data set, once the complete sequencing is finished. Or we can deliver the data per SMRT and you will run pbmarkdup on your own.
### 1) QC
|file |num_seqs |sum_len |min_len|avg_len |max_len|Q1 |Q2 |Q3 |sum_gap|N50 |Q20(%)|Q30(%)|GC(%)|
|:-------------------------------------------|:--------|:-------------|:------|:-------|:------|:----|:-----|:-----|:------|:-----|:-----|:-----|:----|
|m54345U_220829_075455.dc.demult2.dedup|2,768,793|30,040,568,793|425 |10,849.7|32,410 |9,088|10,578|12,199|0 |11,060|99.5 |98.58 |38.86|
|m54345U_220829_075455.dc.demult2.dedup.CUTADAPT|2,768,792|30,040,558,864|425 |10,849.7|32,410 |9,088|10,578|12,199|0 |11,060|99.5 |98.58 |38.86|
|m54345U_220829_075455.deepc|2,999,956|32,760,545,012|2|10,920.3|32,531|9,153|10,644|12,273|0|11,129|99.49|94.3|38.86|
Quality|Lenght|GC distribution|
|---|---|---|
||
### 2) Genome profile
```csvpreview {header="true"}
,Kmer=32*,Kmer=21
Estimated genome size (bp),"463,479,661","539,852,903"
Transition parameter**,21,21
Maximum read depth,63,63
```
*Kmer=32 (new recommended VGP value)
**transition between haploid and diploid coverage depths
Kmer=32 | Kmer=21
---|---
|
:::info
Seems like low coverage makes genomescope classify kmers belonging to the haploid peak as errors
:::
### 3) Hifiasm
C1 stats:
```csvpreview {header="false"}
hifiasm mode*,l1,l2,l3
sum,"3,641,041,817","2,928,580,447","2,894,961,549"
n,"74,064","40,115","38,817"
average,"49,160.75","73,004.62","74,579.73"
largest,"716,830","949,645","841,067"
N50,"58,710","103,830","105,400"
L50,"18,253","8,459","8,257"
N60,"47,876","83,656","85,437"
L60,"25,135","11,604","11,307"
N70,"39,112","65,179","66,775"
L70,"33,556","15,570","15,141"
N80,"31,881","48,175","49,649"
L80,"43,872","20,793","20,157"
N90,"25,333","33,132","34,074"
L90,"56,660","28,093","27,169"
N100,"6,680","9,428","9,428"
```
\* Level of purge duplication: l0 = disable (default for trio-binning assembly), l1 = only purge contained haplotigs (VGP), l2 = purge all types of haplotigs (Bat1K), l3 = purge all types of haplotigs in the most aggressive way (default for non-trio assembly).
BandageNG graph of 'l1' assembly:

:::info
At this coverage, hifiasm assemblies resulted fragmented. It reminds a short reads assembly
:::
### 4) Contamination analysis using BlobPlot
Plot of 'l1' assembly:

:::info
There are 24 contigs (out of 74,064) spanning ~1Mb flagged as contamination. A closer inspection of these sequences is needed to confirm it
:::
## Dataset for *Bradypus torquatus* available at 02/11/2022
Scripts[ here](https://github.com/diegomics/Collomics)
[Galaxy tutorial](https://training.galaxyproject.org/training-material/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html)
* Description of the data:
In total we ran 4 SMRTs for 2 libraries:
* L112870:
* m54345U_220829_075455 (First run analyzed above)
* m64046_220920_112413 (weak run)
* L117834:
* m64046_221018_095104
* m54345U_221028_141615
L112870 run 25.08.22 and 17.09.22 (weak run) -> 20 ng input DNA -> 13 cycles for PCR
L117834 run 17.10.22 and 27.11.22 -> 20 ng input DNA -> 13 cycles for PCR
Split into 2 PCR reactions and pooled
As these two libraries were prepared separately, we have separated and processed them according to library. The deduplication step is designed to remove artefacts from the PCR amplification step in the library prep, as a result we ran pbmarkdup on the combined files for L112870 and L117834 separately. In theory, there should not be any duplications across the libraries, apart from biological ones. The **duplication rates** for each library were quite low (**L112870 1.8% & L117834 3.1%**) so it doesn't look like this should be a huge issue for these samples. (Thomas e-mail)
---
Comparison of k-mer spectra between libraries:


GC bias visual analysis:
|Lib L112870|Lib L117834|
|---|---|
||
> Observations:
---
### 1) QC
|file | format | type|num_seqs |sum_len |min_len|avg_len |max_len|Q1 |Q2 |Q3 |sum_gap|N50 |Q20(%)|Q30(%)| GC(%)|
|:-------------------------------------------|:--------|:-------------|:------|:-------|:------|:----|:-----|:-----|:------|:-----|:-----|:-----|:----|:-------|:-------|
L112870.Final_DC.dedup.fq.gz | FASTQ | DNA | 3,755,693| 40,578,753,373 | 414 | 10,804.6 | 32,755| 9,050| 10,532 | 12,161 | 0| 11,021 | 99.53| 98.67|38.87|
L112870.Final_DC.dedup.fq.trimmed.fq.gz |FASTQ | DNA| 3,755,692 |40,578,743,444|414| 10,804.6 |32,755 | 9,050.0|10,532.0|12,161.0| 0 |11,021|99.53 |98.67|38.87
L117834.Final_DC.dedup.fq.gz | FASTQ | DNA | 5,875,788 | 64,484,424,801 |450 | 10,974.6 |27,889 | 9,517| 10,603 | 12,082 | 0 | 10,960 | 99.46 | 98.49 |38.87|
|L117834.Final_DC.dedup.fq.trimmed.fq.gz| FASTQ| DNA| 5,875,786| 64,484,388,768|450| 10,974.6| 27,889| 9,517.0 |10603.0 |12,082.0| 0 | 10,960|99.46|98.49| 38.87
Library|Quality|Lenght|GC distribution|
|---|---|---|---|
L112870 |||
L117834 |||
### 2) Genome profile
```csvpreview {header="true"}
,Kmer=31
Estimated genome size (bp),"2,999,586,646"
Transition parameter,25
Maximum read depth,75
```
Frequency|Coverage*Frequency
---|---
|
Per library:
L112870|L117834
---|---
|
|
|
### 3) Hifiasm
Genome assembly summary stats comparing different purging parameters in Hifiasm:
| | l3_primary | l3_altern | l2_primary | l2_altern | l1_primary | l1_altern | l0_primary | l0_altern
--- | --- | --- | --- | --- | --- | --- | --- | ---
contigs | 32.825 | 133.470 | 33.986 | 131.504 | 44,829 | 110,482 | 96,378 | 73,987
Total contig length | 5.196.709.304 | 5.123.297.302 | 5.247.411.469 | 5.063.717.858 | 5,855,788,680 | 4,337,487,095 | 9,067,555,161 | 1,896,813,738
Average contig length | 158315.59 | 38385.38 | 154399.21 | 38506.19 | 130,625 | 39,260 | 94,083 | 25,637
Contig N50 | 273.414 | 51.674 | 269.252 | 51.880 | 232,771 | 52,105 | 162,283 | 25,698
Contig auN | 387234.69 | 82948.80 | 381270.15 | 82967.82 | 325,104 | 82,276 | 228,721 | 36,013
Contig L50 | 5.210 | 25.140 | 5.328 | 24.809 | 6,945 | 21,317 | 15,435 | 23,005
Largest contig | 3.354.422 | 1.196.395 | 2.743.282 | 1.082.057 | 2,624,380 | 1,082,057 | 2,624,380 | 540,054
Smallest contig | 8.822 | 3.651 | 7.959 | 3.651 | 6,674 | 5,879 | 6,674 | 5,879
GC content % | 39.21 | 39.17 | 39.19 | 39.19 | 39 | 39 | 39 | 39
segments | 32.825 | 133.470 | 33.986 | 131.504 | 44,829 | 110,482 | 96,378 | 73,987
Total segment length | 5.196.709.304 | 5.123.297.302 | 5.247.411.469 | 5.063.717.858 | 5,855,788,680 | 4,337,487,095 | 9,067,555,161 | 1,896,813,738
Average segment length | 158315.59 | 38385.38 | 154399.21 | 38506.19 | 130,625 | 39,260 | 94,083 | 25,637
gaps | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
paths | 32.825 | 133.470 | 33.986 | 131.504 | 44,829 | 110,482 | 96,378 | 73,987
**L3:** [GEP final report](https://drive.google.com/file/d/1NbioCscWuIT_7xNyoax2YeVbZZSa5YEj/view?usp=share_link)
BUSCO 79,2%
Frequency|Coverage*Frequency|
|---|---|
|
log|log
|
Primary|Alternate|
|---|---|
|
**Spectra pri/alt**|**Spectra copy**|
||
### 4) Purge dups
Genome assembly summary stats (primary) comparing different purging parameters (l3 and l2) in Hifiasm and different parameters (25/75 and default) in purgedups:
| | l3_purge_25_75 | l3_purge_default | l2_purge_25_75 | l2_purge_default
--- | --- | --- | --- | ---
contigs | 18293 | 18650 | 18639 | 19010
Total contig length | 3.219.946.939 | 3.273.225.045 | 3.229.574.756 | 3.285.327.118
Average contig length | 176020.71 | 175508.05 | 173269.74 | 172821
Contig N50 | 320.234 | 317.799 | 316995 | 314629
Contig auN | 435613.57 | 432944.97 | 428525.7 | 426015.21
Contig L50 | 2.799 | 2.864 | 2844 | 2910
Largest contig | 3.354.422 | 3.354.422 | 2.743.282 | 2.743.282
Smallest contig | 8.941 | 8.941 | 9549 | 9549
GC content % | 39.43 | 39.43 | 39.43 | 39.42
soft-masked bases | 0 | 0 | 0 | 0
segments | 18.293 | 18.650 | 18639 | 19010
Total segment length | 3.219.946.939 | 3.273.225.045 | 3.229.574.756 | 3.285.327.118
Average segment length | 176020.71 | 175508.05 | 173269.74 | 172821
gaps | 0 | 0 | 0 | 0
paths | 18.293 | 18.650 | 18639 | 19010
* Plots for assembly l3 25/75:
**l3 primary only: [GEP final report](https://drive.google.com/file/d/1yKP659WPr7Dsf1PIpXYmT4gCTwm5ojcd/view?usp=share_link)**
**l3 primary + alternate: [GEP final report](https://drive.google.com/file/d/1Oa9te4qTNBLMHH3YKWeRwLLFDpa0xqCW/view?usp=share_link)**
BUSCO 77,6%
Primary|Alternate|
|---|---|
|||
**Spectra pri/alt**|**Spectra copy**|
|||
### 5) 2nd round of purge dups for the assembly l3
| | l3_primary_2ndPurge
--- | ---
contigs | 16,734
Total contig length | 3,035,912,178
Average contig length | 181,422
Contig N50 | 334,832
Contig auN | 445,782
Contig L50 | 2,561
Largest contig | 3,354,422
Smallest contig | 9,841
GC content % | 39
soft-masked bases | 0
segments | 16,734
Total segment length | 3,035,912,178
Average segment length | 181,422
gaps | 0
paths | 16,734
**L3 primary + alternate: [GEP final report](https://drive.google.com/file/d/1d2kZxEK-wvVix6ecRcDdMeh_-1NhLHr0/view?usp=share_link)**
BUSCO 77.3%
Primary|Alternate|
|---|---|
|||
**Spectra pri/alt**|**Spectra copy**|
|||
:::info
:information_source: Heterozygous peak at 32 and homozygous peak at 8 is weird!
:::
**Busco comparison - l3 assembly**
n=9226
Species | Complete BUSCOs ( C ) | Single-copy (S) | Duplicated (D) | Fragmented (F) | Missing (M)
--- | --- | --- | --- | --- | ---
l3| 7309 (79.2%) | 46.20% | 33.00% | 5.40% | 15.40%
l3_1stPurg| 7148 (77.5%) | 70.10% | 7.40% | 5.60% | 16.90%
l3_2ndPurg| 7134 (77.3%) | 72.90% | 4.40% | 5.60% | 17.10%
### 6) [BandageNG](https://github.com/asl/BandageNG)
by Tom
*De novo assembly graphs contain not only assembled contigs but also the connections between those contigs, which were previously not easily accessible. Bandage visualises assembly graphs, with connections, using graph layout algorithms.*

### 7) Contamination analysis using BlobPlot

No evidence of contamination
---------------------------------------------
### Brainstorming possible reasons for the fragmented genome
1) Parts of the reads are “polyploid” due to some weird changes (preservation conditions?) on the sample DNA.
2) Contamination (other taxon? other individual?)
3) Undetected PCR duplicates in large scale
4) Bias creating regions of lower coverage and hence, gaps (Bias in the genome coverage introduced by the amplification during lib prep)
--------------------------------------------
### Running tests...
#### 1) Trimming by min quality 41 to remove weird peak at 39
|Lib L112870|Lib117834|
|---|---|
|||
Hifiasm l3 assembly stats:
-> 2 round of purging
| | BraTor_QualTrim_l3_primary | l3_1purg_primary | l3_2purg_primary
--- | --- | --- | ---
contigs | 37.054 | 20133 | 18677
Total | 4.963.955.612 | 3.112.905.224 | 2.988.969.117
Average | 133965.45 | 154617.06 | 160034.75
Contig N50 | 230.014 | 279297 | 288707
Contig auN | 319192.93 | 371470.54 | 377835.13
Contig L50 | 5.997 | 3161 | 2977
Largest | 2.258.432 | 2258432 | 2258432
Smallest | 6.582 | 9198 | 9651
GC | 39.11 | 39.39 | 39.43
soft-masked | 0 | 0 | 0
segments | 37.054 | 20133 | 18677
Total | 4.963.955.612 | 3.112.905.224 | 2.988.969.117
Average | 133965.45 | 154617.06 | 160034.75
gaps | 0 | 0 | 0
paths | 37.054 | 20133 | 18677
||Primary|Alternate|
---|---|---|
l3 | |
l3 1 purg ||
l3 2 purg ||
#### 2) Mapping Bradypus reads to Choloepus didactylus genome

##### Check unmapped reads
Of ~9,600,000 input reads, ~340,000 (3.5 %) did not map to the Choloepus genome. NCBI Blast of 20 reads randomly selected from all unmapped reads:
14 reads did not correspond to anything in the database
6 reads yielded hits:
Bradypus variegatus (sloth) BVA1 satellite, 84 % identity
2x Myrmecophaga tridactyla (anteater) MTR1 satellite, 82 % and 96 %
2x Homo sapiens chr 20, 75 % and 63 %
Chrysolina oricalcia (a beetle ) chr 8, 13 %
:warning: Hard to believe that 13% are completely random. Can we align them more sensitively?
Mapping unmapped reads to human, chicken and E. coli resulted in the following number of mapped reads:
human: 3000 (0.86% of unmapped reads)
chicken: 300 (0.08%)
E. coli: 100 (0.03 %)So overall, the unmapped reads contain some mammalian sequences, but also a lot of junk(?).
#### 3) Mapping Bradypus reads to human hg38 genome
For comparison, we also show the alignments for the VGP Choloepus didactylus.
On the right, there is an unpurged dup.

Here is a 40kb region where choDid but not braTor1 has any aligning contig.

Here is another region. From the high similarity of the chain block structure between the 2 sloths, we know that nearly all regions are present in the assembly. Except a 10kb and a 14kb highlighted region that overlap genes --> these are most likely missing from the assembly.
But instead of having everthing left and right of the missing region as a contig, we have many fragmented contigs.

Here the two contigs are separated by 200 bp. If we jump to choDid for this position, we see overlapping reads, but less coverage and the blue highlight=overlap is only 200 bp. Clearly not enough for HiFiasm.

Here is the NRBP1 (14kb) missing region in choDid2. The blue highlight is the region that is missing in braTor1. There are no reads, but left and right, plenty of reads.
This is the VGP choDid browser, where Bernhard mapped the braTor reads to it. Given that there is left and right high coverage, it looks like this region was either omitted systematically in the lib prep. Or it was already degraded in the input DNA.

Here is another example of a 50kb region with several genes that is missing. choDid genome shows that we really lack any reads in that region


:::info
We have 2 related problems:
1) Regions that have sparse coverage, not giving large-enough read overlaps for building larger contigs.
2) Regions that lack any reads at all.
Both are likely related to bias introduced by PacBio ultralow (likely PCR bias??). There can still be other issues (adapters etc).
Ideas:
:bulb: Lets discuss running another SMRT that would ideally fix some of the biases.
:bulb: Lets pick 3-5 regions that look like the NRBP1 region, where a few kb are missing. Then lets design braTor1 primers using the contigs that are left/right and PCR it in the input sample. If we get a PCR of the expected size, the region is present in the input but gets lost during UL lib prep. If we don't get a PCR, the region was already not in the input.
:bulb: Camila suggested to do a Illumina run (short insert WGS, no PCR) on the braTor sample used for PacBio. @Sylke Winkler Do we have enough tissue left?
Decision:
-Sequence another SMRT
-Sequence Hi-C
:::
#### 4) Assembly with deep consensus
*"To get the most out of DeepConsensus, we highly recommend that you run ccs with the parameters given in the quick start. This is because ccs by default filters out reads below a predicted quality of 20, which then cannot be rescued by DeepConsensus."* https://github.com/google/deepconsensus
* Summary of the data:
|file |format|type|num_seqs|sum_len |min_len|avg_len|max_len|Q1 |Q2 |Q3 |sum_gap|N50 |Q20(%)|Q30(%)|GC(%)|
|:-----------------------------------|:-----|:---|:-------|:----------|:------|:------|:------|:-----|:------|:------|:------|:----|:-----|:-----|:----|
|m54345U_220829_075455.deepc.fastq.gz|FASTQ |DNA |2999956 |32760545012|2 |10920.3|32531 |9153.0|10644.0|12273.0|0 |11129|99.49 |94.30 |38.86|
|m54345U_221028_141615.deepc.fastq.gz|FASTQ |DNA |3175949 |35166615560|10 |11072.8|29570 |9608.0|10701.0|12179.0|0 |11055|99.48 |94.74 |38.87|
|m64046_220920_112413.deepc.fastq.gz |FASTQ |DNA |1079629 |11615211163|73 |10758.5|32849 |9026.0|10478.0|12133.0|0 |10987|99.62 |95.83 |38.87|
|m64046_221018_095104.deepc.fastq.gz |FASTQ |DNA |3277481 |35948248969|115 |10968.3|27944 |9519.0|10591.0|12075.0|0 |10949|99.43 |93.82 |38.89|
:exclamation: increase of 8,6% in read number
- Trimming using Hi-Fi index sequence:
|file|format|type|num_seqs|sum_len|min_len|avg_len|max_len|Q1|Q2|Q3|sum_gap|N50|Q20(%)|Q30(%)|GC(%)|
|:-----------------------------------|:-----|:---|:-------|:----------|:------|:------|:------|:-----|:------|:------|:------|:----|:-----|:-----|:----|
m54345U_220829_075455.deepc.trimmed.fq.gz | FASTQ| DNA | 2999955 | 32760533623 | 2| 10920.3 | 32531 | 9153.0 | 10644.0 | 12273.0 | 0 | 11129 | 99.49 | 94.30 | 38.86
m54345U_221028_141615.deepc.trimmed.fq.gz | FASTQ | DNA | 3175949 | 35166615560 | 10 | 11072.8 | 29570 | 9608.0 | 10701.0 | 12179.0 | 0 | 11055 |99.48 | 94.74 | 38.87
m64046_220920_112413.deepc.trimmed.fq.gz | FASTQ | DNA | 1079629 | 11615211163 | 73 | 10758.5 | 32849 | 9026.0 | 10478.0 | 12133.0 | 0 | 10987 | 99.62 | 95.83 | 38.87
m64046_221018_095104.deepc.trimmed.fq.gz | FASTQ | DNA | 3277481 | 35948248969 | 115 | 10968.3 | 27944 | 9519.0 | 10591.0 | 12075.0 |0 | 10949 | 99.43 | 93.82 | 38.89
- Trimming (cutadapt) using primer sequence:
- AAGCAGTGGTATCAACGCAGAGTACT and AGTACTCTGCGTTGATACCACTGCTT
|file|format|type|num_seqs|sum_len|min_len|avg_len|max_len|Q1|Q2|Q3|sum_gap|N50|Q20(%)|Q30(%)|GC(%)|
|:-----------------------------------|:-----|:---|:-------|:----------|:------|:------|:------|:-----|:------|:------|:------|:----|:-----|:-----|:----|
m54345U_220829_075455.deepc.primer.trimmed.fq.gz | FASTQ | DNA | 2352 | 24784465 | 2 | 10537.6 |24623 | 8514.5 | 10101.5 | 12274.5 | 0 | 10947 | 98.30 | 83.56 | 38.95
m54345U_221028_141615.deepc.primer.trimmed.fq.gz | FASTQ | DNA | 16283 | 180824139 | 10 | 11105.1 | 23169 | 9484.0 | 10687.0 | 12419.0 | 0 | 11152 | 97.53 | 78.44 | 38.79
m64046_220920_112413.deepc.primer.trimmed.fq.gz | FASTQ | DNA | 899 | 9041992 | 73 | 10057.8 | 23640 | 8358.5 | 10089.0 | 12143.5 | 0 | 11077 | 98.79 | 87.86 | 39.88
m64046_221018_095104.deepc.primer.trimmed.fq.gz | FASTQ | DNA | 15771 | 172870323 | 364 | 10961.3 | 21951 | 9381.0 | 10535.0 | 12244.5 | 0 | 11009 | 97.26 | 74.24 | 38.74
:::warning
:question: Check how cutadapt is trimming primer sequence
:::
- Genome profile
Kmer=31
```csvpreview {header="true"}
,dc reads, subreads
Estimated genome size (bp),"2,999,586,646","2,998,779,404"
Transition parameter,25,28
Maximum read depth,75,84
```
||Frequency|Coverage*Frequency
---|---|---
|fq reads||
|subreads||
* Hifiasm l3 assembly (compared to l3 assembly)
| | l3_primary | l3_altern | Subreads l3_primary | Subreads l3_altern |
--- | --- | --- | --- | --- |
contigs | 32.825 | 133.470 | 47.466| 130.466
Total contig length | 5.196.709.304 | 5.123.297.302 | 6.051.501.415 | 4.676.924.082
Average contig length | 158.315,59 | 38.385,38 | 127.491,29 |35.847,84
Contig N50 | 273.414 | 51.674 | 224.255 | 44.981
Contig auN | 387.234,69 | 82.948,80 | 312.599,97 | 71.372,60
Contig L50 | 5.210 | 25.140 | 7.522 | 26.655
Largest contig | 3.354.422 | 1.196.395 | 2.811.890 | 755.281
Smallest contig | 8.822 | 3.651 | 6.133 | 3.922
GC content % | 39,21 | 39,17 | 39,25 | 39,12
segments | 32.825 | 133.470 | 47.466 | 130.466
Total segment length | 5.196.709.304 | 5.123.297.302 | 6.051.501.415 | 4.676.924.082
Average segment length | 158.315,59 | 38.385,38 | 127.491,29 | 35.847,84
gaps | 0 | 0 | 0 | 0
paths | 32.825 | 133.470 | 47.466 | 130.466
#### 5) Mitogenome
MitoHiFi results:
|Input data | Length
|---|---|
|1) Reads| 16924
|2) Contigs| 78739
MitoHiFi for two different libraries:
|Input data | Length
|---|---|
|1) Reads of L112870 | 16908 |
|2) Reads of L117834 | 16920 |
:::success
Alignment of the 2 mitogenomes yielded the same mitochondrial sequence
:::
#### 6) Test ploidy with [Smudgeplot](https://github.com/KamilSJaron/smudgeplot)
This tool extracts heterozygous kmer pairs from kmer count databases and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc.
Requires 1,2T (at least) -> 1st trial: 2 days and stopped because of time limit
2nd trial: running :woman-running: for only one library (L112870) (Dec 1st to 6th - in Nile)
Alternative: Try [PloydyPlot](https://github.com/thegenemyers/MERQURY.FK#PloidyPlot) program
Results for lib L112870 - lower coverage


-we extract heterozygous kmer pairs from kmer count databases (kmers that are 1SNP from each other)
-compare the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB))
-Smudgeplots show the haplotype structure using heterozygous kmer pairs: every haplotype structure has a unique smudge on the graph and the heat of the smudge indicates how frequently the haplotype structure is represented in the genome compared to the other structures.
#### 7) Assembly with HiCanu
Assembly summary | Hicanu_DefaultErrorRate_0.045 | Hicanu_ErrorRate_0.1
--- | --- | ---
contigs | 222,441 |Running... :woman-running:|
Total contig length | 11,524,632,377 |
Average contig length | 51809.84 |
Contig N50 | 84712 |
Contig auN | 144115.99 |
Contig L50 | 33497 |
Largest contig | 2042847 |
Smallest contig | 5065 |
#### 8) Assembly with [Flye](https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md)
* Testing different reads errors: 0.001 (default) and 0.01
--read-error float: adjust parameters for given read error rate (as fraction e.g. 0.03)
Assembly summary | Flye_DefaultErrorRate0.001 | Flye_ErrorRate0.01 | Hifiasm l3 2 purg
--- | --- | --- | ---
contigs |100,413| 23,822 | 18,677 |
Total contig length |4,671,452,665| 3,256,142,183 | 2.988.969.117 |
Average contig length |46,522.39| 136,686.35 | 160,034.75|
Contig N50 |102,409| **355,718** | 288,707
Contig auN |181,216.36| 538,889.55 | 377,835.13
Contig L50 |10,339| 2,324 | 2,977
Largest contig |2,351,091| **4,294,707** | 2,258,432
Smallest contig |75| 25 | 9,651
**Busco comparison - Flye x 'l3 2 purg' assemblies**
n=9226
Species | Complete BUSCOs ( C ) | Single-copy (S) | Duplicated (D) | Fragmented (F) | Missing (M)
--- | --- | --- | --- | --- | ---
l3_2ndPurg| 7134 (77.3%) | 72.90% | 4.40% | 5.60% | 17.10%
Flye error rate 0.01 | 7529 (81.6%) | 76.9% | 4.7% | 5.0% | 13.4% |
||Primary|Alternate|
---|---|---|
l3 2 purg ||
Flue 0.01 || Currently Flye will likely produce a collapsed assembly of diploid genomes, levaing only one mosaic allele per haplotype. If heterzygosity is low (e.g. ~0.1%), it should not be a problem for contiguity; however SNPs and structural variations between the alternative haplotypes will not be captured.
GEP final repor,t [here](https://drive.google.com/file/d/1-NFME2r1WEM0HNrHCD19cTM3eppr72Qs/view?usp=share_link)
:::warning
:question: Should I try a higher error rate?
:question: Should I purge Flye assembly? No indications in the K-mer analysis of duplications
:::
#### 9) Clean adapter with [FSC_adaptor](https://github.com/ncbi/fcs/wiki/FCS-adaptor)
```
#accession length action range name
ptg002819l_1_1 69976 ACTION_TRIM 69941..69976 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00325.1:Adaptor used in I.M.A.G.E. library NIH_MGC_126 and other libraries
ptg009737l_1_1 65604 ACTION_TRIM 65580..65604 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00325.1:Adaptor used in I.M.A.G.E. library NIH_MGC_126 and other libraries; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00323.1:Deoxy-UMP adaptor used in I.M.A.G.E. library Zebrafish SJD 2 day embryo and other libraries
ptg015024l_1_1 55246 ACTION_TRIM 55221..55246 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00325.1:Adaptor used in I.M.A.G.E. library NIH_MGC_126 and other libraries; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00323.1:Deoxy-UMP adaptor used in I.M.A.G.E. library Zebrafish SJD 2 day embryo and other libraries
ptg015334l_1_1 80481 ACTION_TRIM 1..26 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00325.1:Adaptor used in I.M.A.G.E. library NIH_MGC_126 and other libraries; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00323.1:Deoxy-UMP adaptor used in I.M.A.G.E. library Zebrafish SJD 2 day embryo and other libraries
ptg019465l_1_1 166659 ACTION_TRIM 1..26 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00325.1:Adaptor used in I.M.A.G.E. library NIH_MGC_126 and other libraries; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00323.1:Deoxy-UMP adaptor used in I.M.A.G.E. library Zebrafish SJD 2 day embryo and other libraries
ptg020220l_1_1 157830 ACTION_TRIM 1..25 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00325.1:Adaptor used in I.M.A.G.E. library NIH_MGC_126 and other libraries; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00323.1:Deoxy-UMP adaptor used in I.M.A.G.E. library Zebrafish SJD 2 day embryo and other libraries
ptg031987l_1_1 30495 ACTION_TRIM 1..26 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00325.1:Adaptor used in I.M.A.G.E. library NIH_MGC_126 and other libraries; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00323.1:Deoxy-UMP adaptor used in I.M.A.G.E. library Zebrafish SJD 2 day embryo and other libraries
```
| | l3_primary_2ndPurge | after adapter trimming
--- | --- | --- |
contigs | 16,734 | 16,734
Total contig length | **3,035,912,178** | **3,035,911,988**
Average contig length | **181,422** | **181,421.78**
Contig N50 | 334,832 |334832
Contig auN | 445,782 |445,782.42
Contig L50 | 2,561 | 2,561
Largest contig | 3,354,422 |3,354,422
Smallest contig | 9,841 |9,841
GC content % | 39 | 39.48
soft-masked bases | 0 | 0
segments | 16,734| 16,734
Total segment length | 3,035,912,178 |3,035,911,988
Average segment length | 181,422 | 181,421.78
gaps | 0| 0
paths | 16,734| 16,734
### Questions for Sylke:
:question: **New B. variegatus samples**
Regarding the Bradypus variegatus samples, we have other 3 muscle samples available at BeGenDiv that we could try to perform the DNA extraction aiming higher molecular weight DNA. I selected the 3 first samples based on the quality of the DNA extracts that we did a long time ago at BeGenDiv, so they are supposed to be the best 3 samples. However, we can have different results using more appropriated protocols. The alternative samples come from the same collection and same population, so they were exposed to the similar conditions, but I think it is worth to give it a try. What do you think?
# Obtaining genome variation
## 1) Repeat masking

## 2) SNP calling
https://drive.google.com/file/d/10vPMHI4RBm_vcAZuryRV0ELTL1uzYCY2/view?usp=sharing


x/x -> polishing should solve
More info on DeepVariant Calling Report here: https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-vcf-stats-report.md
# Comparative genomes
Genomes available:
+ Pilosa
* Megalonychidae (two-toed sloths)
- Choloepus didactylus - PRJNA678727
- Choloepus hoffmani - PRJNA30809
(published - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3207357/
https://www.diark.org/diark/species_list/Choloepus_hoffmanni)
* Bradypodidae (three-fingered sloths)
- Bradypus variegatus - PRJNA399370
- Bradypus tridactylus - Marcela/Jacqui assembly)
* Myrmecophagidae (anteaters)
- Myrmecophaga tridactyla - PRJNA399407
- Tamandua tetradactyla - PRJNA399420
+ Cingulata
* Chlamyphoridae
- Chaetophractus vellerosus - PRJNA399381 - too bad
- Tolypeutes matacus - PRJNA399423
* Dasypodidae
- Dasypus novemcinctus (nine-banded armadillo) - PRJNA12594 - I could not find the raw data
## BUSCO comparison
* 9226 Total BUSCO groups searched (mammalia_odb)
| Species | Complete BUSCOs ( C ) | Single-copy (S) | Duplicated (D) | Fragmented (F) | Missing (M) | Assembly | link to the data |
| --------------------------------- | ---------------- | ----------- | ---------- | ----------- | ----------- | ------------------------------------ | ------------------------------------------------------ |
| Bradypus torquatus l3 2nd purge | 7134 (77.3%) | [72.9%] | [4.4%] | 5.6% | 17.1% | Our genome | - |
| Choloepus didactylus | 7892 (85.5%) | [82.7%] | [2.8%] | 1.7% | 12.8% | Chromosomal level, 10x data, [GCF_015220235.1](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_015220235.1/) | |
| Choloepus hoffmani | 7535 (81,6%) | [79.6%] | [2%] | 2.3% | 16.1% | Scaffold level, Illumina short reads data, [GCA_000164785.2](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_000164785.2/) | |https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR857572/SRR857572 |
| Bradypus variegatus | 2921 (31.6%) | [31.2%] | [0.4%] | 5.9% | 62.5% | Scaffold level, Illumina short reads 57.9x, [GCA_004027775.1](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_004027775.1/) | | | |
| Bradypus tridactylus | 5137 (55.7%) | [55.2%] | [0.5%] | 6.8% | 37.5 | Marcela/Jacqui's assembly, Illumina short reads | - |
| Myrmecophaga tridactyla | 5532 (59.9%) | [59.4%] | [0.5%] | 6.6% | 33.5% | Illumina HiSeq, 28.9x, Scaff.level | |
| Tamandua tetradactyla | 5116 (55.4%) | [54.4%] | [1%] | 7.1% | 37.5% | Illumina HiSeq, 30x, | |
| Dasypus novemcinctus | 7529 (81.6%) | [80.5%] | [1.1%] | 3% | 15.4% | Broad Inst., Scaff. level 6x, Sanger | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA12594/ |
| Tolypeutes matacus | 3820 (41.4%) | [40.3%] | [1.1%] | [10.5%] | [48.1%] | Scaffold level, 10x reads 22.1x, [GCA_004025125.1](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_004025125.1/ ) | https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR16989941&display=metadata |
obs: gene prediction tools: metaeuk X Augustus
metaeuk - missing .fna files,
https://gitlab.com/ezlab/busco/-/issues/449
https://gitlab.com/ezlab/busco/-/issues/374
augutus --> slower, more missing genes, but generates .fna sequences!
* Tool to make busco plot: scripts/generate_plot.py
https://busco.ezlab.org/busco_userguide.html#plot
## Genome Evaluation Pipeline - GEP
https://git.imp.fu-berlin.de/cmazzoni/GEP
In the future :sleeping:
## Ideas:
Other assemblers:
https://github.com/Nextomics/NextDenovo
## Bradypus tridactylus
- Raw data:
~/spree/raw_data/2017/NextSeq/171128_Cat_3_2017-18-19-CM-seaturtles-sloths/Data/Intensities/BaseCalls/BTR343_S1_R1_001.fastq.gz
65412249 reads
~/spree/raw_data/2017/NextSeq/171206_Cat_3_2017-18-19-CM-seaturtles-sloths-2/Data/Intensities/BaseCalls/BTR343_S1_R1_001.fastq.gz
154130468 reads
~/spree/raw_data/2017/NextSeq/171201_Cat_3_2017-18-19-CM-seaturtles-sloths/Data/Intensities/BaseCalls/BTR343_S1_R1_001.fastq.gz
363161773 reads
~/spree/raw_data/2017/NextSeq/171013_Cat_3_2017-18-CM-slothsGenomes-2017-19-CM-seaturtleGenomes-2017-21-CM-IsraeliSeaturtle/Data/Intensities/BaseCalls/2017_18/sloth_BTR343_S1_R1_001.fastq.gz
905800 reads
65.412.249 + 154.130.468 + 363.161.773 + 905.800 = total 583.610.290
Nile: /srv/public/users/muliano/b.tridactylus:
BTR343_S1a_S1_forwardB2.paired_001.fastq.gz
BTR343_S1a_S1_reverseB2.paired_001.fastq.gz
514568298 reads
- Assembly:
Yangtze: /srv/public/users/johnsoj90/sloth_genome/B_tridactylus/purged_5_34_65.fa - after 2 purg rounds
https://hackmd.io/@mWzudS-MSAyL1bxNCCsTyw/HywvKJ54d
Curta: /scratch/larissasa/bradypus/03_OtherGenomes/BraTri
- Comparison between BraTri and BraTor contigs mapped to ChoDid
* 1st checked region: no reads and no contigs for B. torquatus, but reads for B. tridactylus (BRT343)


* 2nd checked region: low coverage for B. torquatus and a lot of reads for B. tridactylus, but weird region (peak of high coverage before low coverage region)



* Other examples of regions with no or low coverage for B. torquatus and good coverage for B. tridactylus



* Examples of regions with no or low coverage for B. torquatus and weird coverage for B. tridactylus


# Data analysis May 2023
## Complete data

[Complete table here](https://docs.google.com/spreadsheets/d/1nIe2BFliI1l_bwKXNIYcxSrWM2NDdsZ0/edit?usp=drive_link&ouid=113258903582532178037&rtpof=true&sd=true)
## 1) QC
|file | format | type |num_seqs |sum_len |min_len|avg_len |max_len|Q1 |Q2 |Q3 |sum_gap|N50 |Q20(%)|Q30(%)|GC(%)|
|:-------------------------------------------|:--------|:-------------|:------|:-------|:------|:----|:-----|:-----|:------|:-----|:-----|:-----|:----|:-----|:----|
m64046_230503_143043.Final_DC.fq.gz | FASTQ | DNA | 3491020 | 35117215288 | 526 | 10059.3 | 33161 | 8806.0 | 9734.0 | 11055.0 | 0 | 10041 | 99.46 | 84.79 | 39.04
|m64046_230503_143043.Final_DC.trimmed.fq.gz | FASTQ | DNA | 3491019 | 35117206107 | 526 | 10059.3 | 33161 | 8806.0 | 9734.0 | 11055.0 | 0 | 10041 | 99.46 | 84.79 | 39.04
m64046_230503_143043.Final_DC.trimmed.trimmed_primer.fq.gz | FASTQ | DNA | 3491019 | 34849889609 | 1 | 9982.7 | 33135 | 8767.0 | 9697.0| 11021.0 | 0 | 10016 | 99.47 | 84.84 | 39.03
m64046_230504_234210.Final_DC.fq.gz | FASTQ | DNA | 3439428 | 34289895583 | 539 | 9969.7 | 29950 | 8744.0 | 9648.0 |10949.0 | 0 | 9947 | 99.42 | 83.26 | 39.25
|m64046_230504_234210.Final_DC.trimmed.fq.gz | FASTQ | DNA| 3439427 | 34289883072 | 539 | 9969.6 | 29950 | 8744.0 | 9648.0 | 10949.0 |0 | 9947 | 99.42| 83.26 | 39.25
m64046_230504_234210.Final_DC.trimmed.trimmed_primer.fq.gz | FASTQ | DNA | 3439427 | 34031705952 | 1 | 9894.6 | 29924 | 8706.0 | 9612.0 |10915.0 | 0 | 9922 | 99.43 |83.32 | 39.23
FastqC plots:
Lib|Quality|Lenght|GC distribution|bp content
|---|---|---|---|---|
|m64046_230504_234210|||||
|m64046_230503_143043||||
## 2) Genome profiling


Genome profiling suggests data not entirely representative of a diploid genome
## 3) Summary of different assembly tests
Large variety in output from assemblers
Assembler | Size | N50 | #Contigs
|---|---|---|---|
Hifiasm | 6.5Gb |670kb |21,154
Hicanu | 11.7Gb |89kb | 220,262
Flye_hifi | 7.5Gb |105kb| 111,051
Flye_clr | 3.0Gb |304kb |20,057
Flye_hifi_1% |3.4Gb |3.0Mb |12,779
Flye_hifi_3% |3.2Gb| 3.5Mb |9,839
**Genome assembly with Flye**
Pacbio-hifi mode with error-rate 0.03, a minimum overlap of 5kb and then running purge-dups
||Flye|
|---|---|
Size| 3,107,685,920bp
GC%| 39.7
Contigs| 5,088
Longest Contig| 29,115,850bp
Contig N50| 3,644,537
Contig N95| 178,003bp

Minibusco:C:94.6% [S:92.3%, D:2.3%], F:1.3%,M:4.0%,n:11366


**Issue**
At least 3 “haplotype” present after mapping reads to consensus assembly

Occurrences where no read maps 100% to genome – but “consensus” is correct

Presence of at least 2 complete, distinct mitogenomes in the sample

## Best assembly
:::success
Best assembly: Flye HiFi mode, 3% error rate, 5kb minimum overlap, plus purging
Available in curta: `/scratch/brown/asm_mBraTor/assembly/Flye/run_03/purge_dups/merfin/asm_mBraTor.flye.purged.merfin.fa`
:::

||Flye|
|---|---|
Size| 3,108,716,211 bp
#contigs| 5,088
Longest Contig| 29,123,362 bp
Contig N50| 3,648,353
Average contig length| 610,989.82 bp
QV | 47.97
kmer completeness | 81.78%
**Minibusco**
lineage: eutheria_odb10
S:92.36%, 10498
D:2.28%, 259
F:1.32%, 150
I:0.01%, 1
M:4.03%, 458
N:11366
BUSCO running...
## 4) k-mer comparison between libaries
k-mer obtained using Meryl

Each number is the number of unique 31-mers found in the intersections. i.e. there are 2,788 million kmers shared between all 4 libraries.
The one with the most unique is L117834, which is the run with two big SMRTs. Otherwise I would say they are pretty consistent.

:::success
The new lib was adding 62 million kmers that are not rare and which we haven't seen with the other data.
So polymerase C likely reduces PCR bias, as it fills regions that has no read before. But Bernhard showed a few screenshots where entire genes are still not covered by a read, despite >80X coverage
:::
## 5) Mapping to C. didactylus
Running bedtools intersect comparison

## 6) Coverage/Polymerase C tests
How is responsible for the improvement in the assembly?

Results [here](https://docs.google.com/spreadsheets/d/1wNbSZ2lDCU9iasZDQ1leWoUejKq8P1M1/edit?usp=drive_link&ouid=113258903582532178037&rtpof=true&sd=true)
The main finding was that having as much of the library produced with enzyme C as possible made the assembly better.
Test2 vs Test4: adding pol C improved the assembly a lot.
Test1 vs Test2: going deep with polymerase A and B does not help much.
Test3 vs. Test7. Here we have the same amount of coverage, I have just downsampled all libraries to the same amount in Test7 (1/4 enzyme C), whereas Test3 contains 2 SMRTs with polymerase A&B and 1 SMRT with polymerase C (1/3 enzyme C). The Test7 with more enzyme C library gives a more complete (via compleasm & BUSCO) and more contiguous assembly.
Test8 vs Test9: This is even seen in the extreme of taking 1 SMRT with enzyme C and 1 SMRT with enzymes A&B. Even at 9X coverage, the library with enzyme C produced an assembly with 1Mb contig N50, but the BUSCO is quite low.
The test we are missing is whether an assembly with 3 SMRTs using only enzyme C would be best, rather than having a mix of the enzymes, however I appreciate this is overkill for this particular specimen and as Camila suggested, it would make sense to try this on another sample.
To answer Astrid’s question, all 4 libraries look to contain 2 individual (at least according to mito reads and coverage).
## 7) Mitogenomes
Mitogenome (5 mitogenomes [available](https://docs.google.com/spreadsheets/d/1p0pndCobFIN7gbD6W7L7I-eV4QXk4Ikg/edit?usp=sharing&ouid=113258903582532178037&rtpof=true&sd=true))

COI [Schetino et al. 2017](https://doi.org/10.1016/j.gecco.2017.07.002)

CYTB [Schetino et al. 2017](https://doi.org/10.1016/j.gecco.2017.07.002)

Control Region [Schetino et al. 2017](https://doi.org/10.1016/j.gecco.2017.07.002)

HiC data (mapped against mito1) is also contaminated

## 8) TOGA annotation
Goal: to assess assembly completeness, because we have a set of 'ancestral mammalian genes' generated by an intersection of many TOGA runs. Essentially, it is like BUSCO, but more sensitive. If we run TOGA on Bradypus torquatus and the VGP sloth and screen for this ancestral gene set, the assemblies are easier to compare.
Bernhard message:
The TOGA annotation of the latest braTor assembly Larissa and Tom created is done now. I compared it with other Xenarthran assemblies publicly available, using TOGA's ancestral placental mammal gene set (human coding gene that has an intact reading frame in at least one afrotheria and xenarthra assembly) as a benchmark. This gene set contains 18,430 genes and allows to estimate assembly quality and completness in detail, see the attached figures.
What we see is that the Bradypus assembly (assembly ID HLbraTor2 in our database) has the lowest number of intact genes and the highest number of genes with inactivating mutations of all long-read assemblies, but more intact genes than all short-read assemblies. The relatively high number of genes with inactiviating mutations is probably in part due to base errors in the assembly, which is expected with the QV score being 47 or so. The lower number of intact genes is again expected after what we have anecdotally seen in the browser.
Still, the HiFi assembly performs better than all short read assemblies, at least given number of intact genes. Bottom line: We can create decent assemblies from (some) museum specimens that outperform Illumina assemblies in terms of intact annotated genes.


TOGA for the Test 3 assembly (30X coverage):


In terms of ancestral placental mammalian genes, both the complete (55X, HLbraTor2) and the test3 (HLbraTor3) assembly perform almost equally well, see attached table and plot.
It’s nice because we don’t need 55x to improve the assembly, at least when library C has the same coverage in both.
## Scaffolding with HiC data

Bradypus x Choloepus didactylus

:open_mouth::grin:
||Metrics|
---|---
scaffolds| 2617
Total scaffold length| 3108202320
Average scaffold length| 1187696.72
Scaffold N50| 156651210
Scaffold auN| 157468970.31
Scaffold L50| 8
Largest scaffold| 288111789
Smallest scaffold| 473
contigs| 5149
Total contig length| 3107695920
Average contig length| 603553.30
Contig N50| 3573567
Contig auN| 4676617.31
Contig L50| 238
Largest contig| 29115850
Smallest contig:|473
gaps in scaffolds| 2532
Total gap length in scaffolds| 506400
Average gap length in scaffolds| 200.00
Gap N50 in scaffolds| 200
Gap auN in scaffolds| 200.00
Gap L50 in scaffolds| 1266
Largest gap in scaffolds| 200
Smallest gap in scaffolds| 200
Base composition (A:C:G:T)| 937416853:616832292:616849663:936597112
GC content %| 39.70
soft-masked bases| 0
segments| 5149
Total segment length| 3107695920
Average segment length| 603553.30
gaps| 2532
paths| 2617
## Astrid's contamination data
I ran mapping of the reads against the mitogenome and the assembly with MitoHiFi and I recovered the same 2 mitogenomes. By checking the mapping by eye, I had the impression that there were Bahia haplotypes, but MitoHifi did not assembly any mitogenome from Bahia. Check if Bahia haplotype has different clipping pattern.
## To do:
1) **Resequencing** short reads 3 individuals (25x for 2 individuals and 50x for reference genome individual) -> use this data to check if ultra low lib prep is causing some erros (as frame shift - indels and stop codon)
**Status: I requested DNA aliquots to Sylke**
2) Selection analysis with the annotated genes - Comparative genomics (focus on convergent evolution x ancestral - it will be dificult to desintagle) with Choloepus -> check software for ontology (squares) with Elisa