# RESULTS - Yellow Cardinal (Gubernatrix cristata) reference genome assembly (PacBio HiFi reads)
*scripts in my [Github](https://github.com/marisoldominguez)*
Species Name: yellow cardinal
Local species name: Cardenal amarillo
Scientific name: Gubernatrix cristata
Class: Aves
Order: Passeriformes
Family: Thraupidae
Genus: Gubernatrix
Conservation status: **[Endangered](https://www.iucnredlist.org/es/species/22721578/131888081)**
**Genome assembly naming** (according to Rhie et al. 2021, and corroborated in CGP species list - page 23): *bGubCri1.1*
![](https://i.imgur.com/FMXuVyb.jpg)
![](https://i.imgur.com/2k8MGb3.jpg)
*Photos of the individual being sequenced for species reference genome.*
## Background knowledge
-Genome Assemblies of close species:
![](https://i.imgur.com/Nb4Qugb.png)
-Karyotype:
Link to paper with karyotype of the species: [here](https://drive.google.com/file/d/1beMeV1R-88sRI1MDVE1Sfap9kYV5GD7q/view?usp=sharing)
The karyotype of this species contains 78 chromosomes: 12 pairs of macrochromosomes + 27 microchromosomes.
*Gubernatrix cristata* 2n=78
* Best VGP bird genome: **zebra finch** *Taeniopygia guttata* GCA_003957565.4 (PacBio CLR)
* Genome in same family (*Thraupidae*)at chromosome level: small tree finch ***Camarhynchus parvulus*** (Nanopore + HiC) https://www.ncbi.nlm.nih.gov/assembly/GCF_901933205.1
## Sample Information
Sample ID: 5697
Sample type: **blood in 96% ethanol**
Sex: **female**
Development stage: adult
Collectors: Dr. Bettina Mahler and Lic. Melina Atencio in the rescue center Fundacion Temaiken (CRET)
Origin: confiscated, in Rescue Center since 16.10.2019. Assigned to the management unit of the east, where there is no hybridization (based on mtDNA + 10 microsatellites markers).
Location: ADD MAP
Date of collection: **19.04.2021**
Date of sample delivery from Argentina to Germany: **19.04.2021**
Date of sample arrival to Uni Potsdam: **10.05.2021**
## DNA isolation
Date of DNA extraction: **17.05.2021**
Extraction kit for MHW DNA: Circulomics Nanobind CBB Big DNA
Protocol: [here](https://drive.google.com/file/d/1h-WXobb3CII9CpjwkXuXini435InkVqo/view?usp=sharing) but before 2X washes with PBS to remove ethanol:
80 ul of Ethanol/blood mix pipetted in Protein LoBind tubes.
1. Spin 5 min 2200 g 4°C to remove ethanol
2. Add 500 ul of 1x PBS to wash. Spin 5 min 2200 g 4°C and remove supernatant. Repeat for a total of 2 washes.
3. Pellet is directly dissolved in 200 ul of 1x PBS.
I sent 3 extractions from same individual to be considered for sequencing.
## Sequencing
DRESDEN-concept Genome Center (MPI-Dresden)
Contact: Dr. Sylke Winkler (winkler@mpi-cbg.de)
Date of sample delivery to sequencing center: **20.07.2021**
Date of sample reception in the sequencing center: **21.07.2021**
Template Specimen Form: [here](https://drive.google.com/file/d/1UtUPy8uKe2T8J_iR-3ePsoxTK7461igF/view?usp=sharing)
Date of sequencing completion: 20.08.2021
Sequencing machine:
QC in sequencing center (extraction 3 chosen): [here](https://drive.google.com/file/d/1FUG46828MQJY2LbqZ3P4qJbRvu_qaztq/view?usp=sharing)
## PacBio HiFi reads pre-processing
**1.** **bam to fq**: bam2fastx using SMRTtool version 1.3.1 (also tested with bedtools 2.9)
[submit_bam2fastx_newHPC.slurm](https://drive.google.com/file/d/1lSYmctsZnRZmFoMAxNEUQFu0KblVdWWT/view?usp=sharing)
**2. Trimming of PacBio SMRT adapters** from reads (cutadapt) --> 4.5 % reads removed
[submitcutadaptnewHPC.slurm](https://drive.google.com/file/d/1_Ni22w8yN3suxcHC0J6GH7A7IbtARzht/view?usp=sharing)
# Reads Statistics
**1. Fastqc on trimmed reads**
[submit_fastqc_trimmed_reads.slurm](https://drive.google.com/file/d/154yKNeLz8_4fHKZwD1vbQ0DNk19yD4uW/view?usp=sharing)
* link to full fastqc report **before** trimming sequencing adapters from reads [here](https://drive.google.com/file/d/1ppck5pY7ZYKCswjB4ArVzNu7WIARzEJw/view?usp=sharing)
* link to full fastqc report on **trimmed** reads [here](https://drive.google.com/file/d/1kO6I8tKme3-2Zy6u2LoTLcQKX13QKpDj/view?usp=sharing), main results:
![](https://i.imgur.com/gwG3BS9.png)
**2. Plot reads length**
![](https://i.imgur.com/rMCh5LP.png)
**3. Reads statistics**
Count reads on .fasta file (N = 2 489 222)
```
grep ">" /path_to_reads/reads_noAdapt.fasta | wc -l
```
Count reads on .fastq file (N = 2 489 222)
```
expr $(cat /path_to_reads/reads_noAdapt.fastq | wc -l) / 4
```
Using asmstats script:
[submit_trimmed_asmstats.slurm](https://drive.google.com/file/d/1Rjb8n7jA5L60mSb1NNQAwoh7TOv_cSKK/view?usp=sharing)
Outfile reads_noAdapt.fastq.stats [here](https://drive.google.com/file/d/1_jhrUgOdlufE5bDcxJqbYsdVCj0q6FYI/view?usp=sharing).
![](https://i.imgur.com/F74vQEg.png)
**4. Kmer analysis on reads**
With the resulting 21-mer or 31-mer histograms, GenomeScope2 was used to estimate the haploid genome length, repeat content, and heterozygosity:
* Assuming k21:
[submit_Jellyfish_trimmed_k21.slurm](https://drive.google.com/file/d/1id-VkCkCUNTB2nJIh49_3OmBZHUdoHnS/view?usp=sharing)
*I used jellyfish version 2.2.10*
* Assuming k31:
[submit_Jellyfish_trimmed_k31.slurm](https://drive.google.com/file/d/1_GhZVkBATwnBDOngLZN4yhSVSpv1drkh/view?usp=sharing)
Open the histo files in [genomescope2 website](http://qb.cshl.edu/genomescope/genomescope2.0/)
[k21.histo](https://drive.google.com/file/d/1yvNxjOaxRKsiGfDcAfLmUArRstqzcmfZ/view?usp=sharing)
![](https://i.imgur.com/814sXYx.png)
Link to genomescope2 results [here](http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=5d0pXxIZINtIM6igWHcg)
Kmer: 21
Haploid genome length: 1 057 467 553 bp ~ 1.06 Gb
Heterozygosity: 1.1%
Repts (100-uniq): 11.8%
Parameters for purge_dups:
- Estimated genome haploid size: 1057467553 bp
- Transition Parameter (kmercov*1.5): 21.045 -> 21 reads (integer)
- Maximum read depth (Transition Par*3): 63.135 -> 63 reads (integer)
[k31.histo](https://drive.google.com/file/d/19zt3RybTbybSlnzphghbwDwRIS2cB8yV/view?usp=sharing)
![](https://i.imgur.com/zxIeYwI.png)
Link to genomescope2 results [here](http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=VtMACah2eEURIERJjRmW)
Kmer: 31
Haploid genome length: 1 058 655 883 bp ~ **1.06 Gb**
Heterozygosity: 0.99 %
Repts (100-uniq): 9.9 %
Parameters for purge_dups:
- Estimated genome haploid size: 1 058 655 883 bp
- Transition Parameter (kmercov*1.5): 20.7 -> 21 reads (integer)
- Maximum read depth (Transition Par*3): 62.1 -> 62 reads (interger)
**Reads Coverage: 30.3X**
(number of bases sequenced -sum value in .stats file-) / (genome size in genomescope2 according to kmer analysis)
(32036044402/1058655883)
The **estimated genome size** for both kmer sizes is around **1.06 Gb**, this is also close to the **zebra finch** (Taeniopygia guttata, 1.056 Gb), and other birds like **small tree finc**h (Camarynchus parvulus, 1.064 Gb), and **medium ground finch** (Geospiza fortis, 1.065 Gb).
![](https://i.imgur.com/kc8d6Ig.png)
## Genome Assemblies
**1.** **Hifiasm** (https://github.com/chhylp123/hifiasm) with different parameters of internal purging and kmer sizes (l0,l1,l2 and nok, k21, k31)
*I used Hifiasm version 0.16.1-r375*
- scripts for assemblies *(change parameters -l and -k accordingly)*
[submit_hifiasm_trimmed_l2_k31_primary.slurm](https://drive.google.com/file/d/1KnaKGmzBN4CCdo1mr_0udRiJC-3qA-Wa/view?usp=sharing)
- script for statistics for the asm:
Pri:
[submit_asmstats_hifiasm_l2_p_k31.slurm](https://drive.google.com/file/d/1Ja1e-xk-dv-2RNNHFw0iIiyEveU7Lr-d/view?usp=sharing)
![](https://i.imgur.com/7AXOJAr.png)
Alt:
[submit_asmstats_hifiasm_l2_a_k31.slurm](https://drive.google.com/file/d/1mIWaPJxIiveEPmaGx7hOuq3YpkvIsVjX/view?usp=sharing)
![](https://i.imgur.com/7Jqyx9Z.png)
### Assembled contigs viewed by Bandage:
Primmary assembly:
YelCar_hifiasm_trim_l2.k31.PRI.gfa
![](https://i.imgur.com/FecY4fY.png)
*Looks good, most of the contigs do not have 2 branches.*
Zoom-in of all contigs I found with bubbles:
![](https://i.imgur.com/clbmCWd.png)
*(hopefully these will be solved after purging - I can not corroborate since purgedups does not provide a .gfa file)*
Alternative assembly:
YelCar_hifiasm_trimmed_l2_k31.asm.a_ctg.gfa
![](https://i.imgur.com/wR9bb2A.png)
*It is more fragmented, and much more bubbles.*
**PURGING**
purge_dups was used to reduce redundancy purging false duplications which occur by incorrectly retaining some secondary alleles in the primary contig set.
https://github.com/dfguan/purge_dups
*default, and VGP suggested cutoffs were used.*
- scripts for purging - 1st ROUND (purge PRI c1 to obtain **p1**)
[submit_purge_dups_trimmed_c1.l2.k31_new.slurm](https://drive.google.com/file/d/1zj1wnS4tL7I0EhNojV0y5QZotcoQ6JyB/view?usp=sharing)
Pri (**p1**):
![](https://i.imgur.com/9KSV7lR.png)
Alt (**p2**):
![](https://i.imgur.com/RWJ3enp.png)
*job was ran with --cpus-per-task=24, --mem=64G. It took 15 min.*
- scripts for purging - 2nd ROUND (purge ALT c2p2 to obtain **q2**)
[submit_purge_dups_trimmed_c1.l2.k31_default_c2p2_21022022.slurm](https://drive.google.com/file/d/15dT4N7K7exWfPRKBL-OUCb8VMD_pbOto/view?usp=sharing)
Pri:
![](https://i.imgur.com/DtICdw1.png)
Alt (**q2**):
![](https://i.imgur.com/d0R6cyW.png)
*requested --cpus-per-task=24, --mem=64G, and it took 15 min to run*
Values based on **kmercov** from genomescope2:
For k31:
- Transition Parameter (kmercov*1.5): 20.7 -> 21 reads (integer)
- Maximum read depth (Transition Par*3): 62.1 -> **62** reads (interger)
-
For k21:
- Transition Parameter (kmercov*1.5): ): 21.045 -> 21 reads (integer)
- Maximum read depth (Transition Par*3): 63.135 -> **63** reads (interger)
**2. IPA**
https://github.com/PacificBiosciences/pbipa
*I used ipa (wrapper) version=1.3.1*
* script for the assembly of the primmary (PRI) and alternative (ALT) assemblies
[submit_IPA_NoAdapt.slrum](https://drive.google.com/file/d/1KqeyFeEtcs49l1PX0xjlufoaPBz11djr/view?usp=sharing)
*It was ran requesting 50G of mem and lasted 2 days to finish.*
Pri:
![](https://i.imgur.com/cEnWOIH.png)
Alt:
![](https://i.imgur.com/MKcntzu.png)
- - 1st ROUND of PURGING (purge PRI p1 to obtain **c1**)
*I used purgedups v1.2.5*
https://github.com/dfguan/purge_dups
[submit_purge_dups_trimmed_c1.IPA.slurm](https://drive.google.com/file/d/1-NW8Zl-3Aj0ZzjhV5kdmGNeVFGf-6lrn/view?usp=sharing)
Pri (**c1**):
![](https://i.imgur.com/0GvutNJ.png)
Alt:
![](https://i.imgur.com/mR9n0av.png)
- - 2nd ROUND of PURGING (purge ALT c2p2 to obtain **q2**)
[submit_IPA_c2p2_purged_q2.slurm](https://drive.google.com/file/d/1fsRnV4Dq16fcEIkQLXziE-4yMug9Uyd7/view?usp=sharing)
Pri:
![](https://i.imgur.com/kUnPTt6.png)
Alt (**q2**):
![](https://i.imgur.com/m4yZJ22.png)
*It was ran with --cpus-per-task=24, --mem=24G and took 10 min.*
## Genome Evaluation Pipeline (GEP)
https://git.imp.fu-berlin.de/cmazzoni/GEP
Developers: James Sullivan and Dr. Camila Mazzoni.
To evaluate all my hifiasm and IPA asms, before and after purging. Nt=**29**.
STEP 1: **building the meryl kmer database**
[YelCar_hifi_sampleSheet.tsv](https://drive.google.com/file/d/1QHb4yJsxEp64lHNeuvUk8OzmT9r-278m/view?usp=sharing)
STEP 2: **run the genome asm evaluation**
[runEval_YelCar_hifiasm_IPA.tsv](https://drive.google.com/file/d/1LMauVT_heurKLvc9TbiHSQVvqZaJ6usm/view?usp=sharing)
*It took 1 day to ran requestin 80 cores and 5 jobs.*
| Assembly | hifiasm_l0_nok | purgedups_l0_nok_default | purgedups_l0_nok_VGPcutoffs | hifiasm_l0_k21 | purgedups_l0_k21_default | purgedups_l0_k21_VGPcutoffs | hifiasm_l0_k31 | purgedups_l0_k31_default | purgedups_l0_k31_VGPcutoffs | hifiasm_l1_nok | purgedups_l1_nok_default | purgedups_l1_nok_VGPcutoffs | hifiasm_l1_k21 | purgedups_l1_k21_default | purgedups_l1_k21_VGPcutoffs | hifiasm_l1_k31 | purgedups_l1_k31_default | purgedups_l1_k31_VGPcutoffs | hifiasm_l2_nok | purgedups_l2_nok_default | purgedups_l2_nok_VGPcutoffs | hifiasm_l2_k21 | purgedups_l2_k21_default | purgedups_l2_k21_VGPcutoffs | hifiasm_l2_k31 | purgedups_l2_k31_default | purgedups_l2_k31_VGPcutoffs | IPA | IPA_purged |
|:-----------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:--------------|:--------------|
| total_num_bases | 2,224,168,269 | 1,351,576,889 | 1,188,367,148 | 2,222,471,140 | 1,346,836,325 | 1,182,863,508 | 2,223,097,156 | 1,355,784,751 | 1,182,862,056 | 1,332,099,590 | 1,176,726,747 | 1,176,647,996 | 1,331,886,907 | 1,180,543,672 | 1,180,465,568 | 1,330,877,578 | 1,179,270,444 | 1,177,972,303 | 1,245,686,462 | 1,165,793,779 | 1,165,793,779 | 1,244,971,205 | 1,172,156,874 | 1,172,156,874 | 1,243,890,104 | 1,164,967,086 | 1,164,967,086 | 1,080,623,443 | 1,070,029,210 |
| size_estimate | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 |
| max_heterozygosity | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% |
| GC_Content | 43.4626 | 43.1192 | 43.3178 | 43.4481 | 43.1054 | 43.2881 | 43.4551 | 43.0657 | 43.2764 | 43.9681 | 43.2168 | 43.2159 | 43.9665 | 43.2379 | 43.2376 | 43.9685 | 43.2309 | 43.2214 | 43.7381 | 43.1243 | 43.1243 | 43.7162 | 43.1699 | 43.1699 | 43.7021 | 43.1111 | 43.1111 | 42.4761 | 42.3778 |
| total_num_scaffolds | 2,757 | 1,215 | 1,083 | 2,781 | 1,245 | 1,115 | 2,754 | 1,205 | 1,073 | 1,261 | 798 | 794 | 1,311 | 830 | 829 | 1,288 | 793 | 786 | 972 | 616 | 616 | 1,005 | 650 | 650 | 978 | 597 | 597 | 575 | 490 |
| total_num_contigs | 2,757 | 1,215 | 1,083 | 2,781 | 1,245 | 1,115 | 2,754 | 1,205 | 1,073 | 1,261 | 798 | 794 | 1,311 | 830 | 829 | 1,288 | 793 | 786 | 972 | 616 | 616 | 1,005 | 650 | 650 | 978 | 597 | 597 | 590 | 501 |
| number_of_gaps | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 11 |
| Longest_scaffold | 46,214,770 | 46,214,770 | 46,214,770 | 46,214,784 | 46,214,784 | 46,214,784 | 46,214,776 | 46,214,776 | 46,214,776 | 50,262,241 | 50,262,241 | 50,262,241 | 50,372,431 | 50,372,431 | 50,372,431 | 50,428,791 | 50,428,791 | 50,428,791 | 64,130,911 | 64,130,911 | 64,130,911 | 65,868,658 | 65,868,658 | 65,868,658 | 69,566,365 | 69,566,365 | 69,566,365 | 33,552,962 | 33,552,962 |
| Scaffold_N50_length | 2,955,405 | 7,258,968 | 8,379,364 | 2,998,866 | 7,383,908 | 8,477,060 | 2,998,866 | 7,258,958 | 8,379,360 | 10,709,157 | 14,151,982 | 14,151,982 | 11,908,915 | 14,446,381 | 14,446,381 | 12,368,032 | 14,424,108 | 15,226,941 | 17,289,722 | 17,435,227 | 17,435,227 | 17,291,129 | 17,435,228 | 17,435,228 | 17,435,235 | 21,431,074 | 21,431,074 | 8,893,208 | 8,893,208 |
| Scaffold_NG50_length | 10,045,395 | 10,045,395 | 10,045,395 | 10,097,027 | 10,072,371 | 9,722,664 | 10,111,749 | 10,111,749 | 10,045,389 | 15,882,063 | 15,882,063 | 15,882,063 | 16,399,956 | 16,399,956 | 16,399,956 | 16,399,925 | 16,399,925 | 16,399,925 | 21,425,933 | 21,425,933 | 21,425,933 | 21,431,078 | 21,431,078 | 21,431,078 | 22,844,882 | 22,844,882 | 22,844,882 | 8,901,910 | 8,901,910 |
| Scaffold_N95_length | 232,223 | 362,570 | 374,275 | 228,491 | 340,168 | 317,809 | 231,040 | 364,671 | 364,465 | 312,400 | 596,648 | 596,648 | 276,936 | 542,153 | 542,153 | 299,635 | 557,635 | 559,069 | 445,326 | 891,889 | 891,889 | 407,217 | 733,833 | 733,833 | 436,642 | 879,043 | 879,043 | 533,164 | 660,386 |
| Scaffold_NG95_length | 3,485,820 | 2,473,102 | 1,574,909 | 3,496,061 | 2,473,082 | 1,604,414 | 3,554,867 | 2,531,255 | 1,574,891 | 2,730,996 | 2,153,208 | 2,153,208 | 2,698,482 | 2,219,116 | 2,219,116 | 2,698,460 | 2,219,116 | 2,219,116 | 3,399,398 | 2,986,878 | 2,986,878 | 3,320,541 | 3,236,925 | 3,236,925 | 3,386,191 | 3,231,517 | 3,231,517 | 868,503 | 868,503 |
| Longest_contig | 46,214,770 | 46,214,770 | 46,214,770 | 46,214,784 | 46,214,784 | 46,214,784 | 46,214,776 | 46,214,776 | 46,214,776 | 50,262,241 | 50,262,241 | 50,262,241 | 50,372,431 | 50,372,431 | 50,372,431 | 50,428,791 | 50,428,791 | 50,428,791 | 64,130,911 | 64,130,911 | 64,130,911 | 65,868,658 | 65,868,658 | 65,868,658 | 69,566,365 | 69,566,365 | 69,566,365 | 33,552,962 | 33,552,962 |
| Contig_N50_length | 2,955,405 | 7,258,968 | 8,379,364 | 2,998,866 | 7,383,908 | 8,477,060 | 2,998,866 | 7,258,958 | 8,379,360 | 10,709,157 | 14,151,982 | 14,151,982 | 11,908,915 | 14,446,381 | 14,446,381 | 12,368,032 | 14,424,108 | 15,226,941 | 17,289,722 | 17,435,227 | 17,435,227 | 17,291,129 | 17,435,228 | 17,435,228 | 17,435,235 | 21,431,074 | 21,431,074 | 8,893,208 | 8,893,208 |
| Contig_NG50_length | 10,045,395 | 10,045,395 | 10,045,395 | 10,097,027 | 10,072,371 | 9,722,664 | 10,111,749 | 10,111,749 | 10,045,389 | 15,882,063 | 15,882,063 | 15,882,063 | 16,399,956 | 16,399,956 | 16,399,956 | 16,399,925 | 16,399,925 | 16,399,925 | 21,425,933 | 21,425,933 | 21,425,933 | 21,431,078 | 21,431,078 | 21,431,078 | 22,844,882 | 22,844,882 | 22,844,882 | 8,901,910 | 8,901,910 |
| Contig_N95_length | 232,223 | 362,570 | 374,275 | 228,491 | 340,168 | 317,809 | 231,040 | 364,671 | 364,465 | 312,400 | 596,648 | 596,648 | 276,936 | 542,153 | 542,153 | 299,635 | 557,635 | 559,069 | 445,326 | 891,889 | 891,889 | 407,217 | 733,833 | 733,833 | 436,642 | 879,043 | 879,043 | 527,794 | 625,785 |
| Contig_NG95_length | 3,485,820 | 2,473,102 | 1,574,909 | 3,496,061 | 2,473,082 | 1,604,414 | 3,554,867 | 2,531,255 | 1,574,891 | 2,730,996 | 2,153,208 | 2,153,208 | 2,698,482 | 2,219,116 | 2,219,116 | 2,698,460 | 2,219,116 | 2,219,116 | 3,399,398 | 2,986,878 | 2,986,878 | 3,320,541 | 3,236,925 | 3,236,925 | 3,386,191 | 3,231,517 | 3,231,517 | 856,071 | 856,071 |
| qv_score | 61.9987 | 62.0147 | 61.6227 | 61.945 | 62.0304 | 61.624 | 62.0604 | 62.1082 | 61.6803 | 60.8608 | 61.6912 | 61.6909 | 60.8165 | 61.7131 | 61.7129 | 60.8343 | 61.617 | 61.635 | 60.5357 | 61.342 | 61.342 | 60.389 | 61.3572 | 61.3572 | 60.4856 | 61.3088 | 61.3088 | 56.3843 | 57.2844 |
| kmer_completeness | 99.3468 | 85.5233 | 82.6667 | 99.3354 | 85.4438 | 82.561 | 99.3428 | 85.6085 | 82.5495 | 84.2954 | 82.507 | 82.5067 | 84.3931 | 82.6174 | 82.6171 | 84.4225 | 82.6217 | 82.6059 | 83.2772 | 82.4928 | 82.4928 | 83.3569 | 82.6469 | 82.6469 | 83.3613 | 82.5778 | 82.5778 | 81.3829 | 81.2586 |
| Complete_Buscos | 96.2% | 95.8% | 95.8% | 96.3% | 95.8% | 95.9% | 96.3% | 95.8% | 95.7% | 96.0% | 95.8% | 95.8% | 96.0% | 95.9% | 95.9% | 96.0% | 95.8% | 95.8% | 96.0% | 95.9% | 95.9% | 96.0% | 95.9% | 95.9% | 96.1% | 96.0% | 96.0% | 95.6% | 95.7% |
| Complete_single_buscos | 16.8% | 78.6% | 94.8% | 16.9% | 78.7% | 94.7% | 16.8% | 78.4% | 94.7% | 87.1% | 94.7% | 94.7% | 87.5% | 94.8% | 94.8% | 87.2% | 94.8% | 94.8% | 93.6% | 94.9% | 94.9% | 93.7% | 94.9% | 94.9% | 93.8% | 95.0% | 95.0% | 94.6% | 94.7% |
PDF report for each asm evaluated [here](https://drive.google.com/file/d/1emPu1VEXbFs84HnUCY7okc1EiCcOWlEU/view?usp=sharing)
*Achtung: the K-mer Multiplicity PRI (stacked) plot is wrong - the plot is actually showing the flattened pri+alt. The correct plot is in the merquey folder: asmnamemerqOutput.spectra-cn.st.png*
* Notes on [merqury](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-9) results:
-**Merqury copy number spectrum plots** (similar to KAT):
tracks the multiplicity of each k-mer found in the reads and colors it by the number of times it is found in a given assembly
->when using hifiasm.l0, the plots for the primmary asm have also a blue peak (denoting kmers appearing twice), which should have not been there since the asm is already phased. For hifiasm.l1 too, but after purging the blue is gone.
->no asms showed a bar at the origin of the plot representing kmers found only in the asm
-> the higher fraction of k-mers missing in the assembly (black peak, read-only) seen in all asm plots probably reflects kmers specific to the other haplotig! This is corroborated when looking at the plot flattened pri+alt, in which the grey peak of read-only kmers disappears.
-**kmer completeness**: first filter out low-copy k-mers that are likely to represent sequencing errors, and then define k-mer completeness as the fraction of reliable k-mers in the read set that are also found in the assembly.
-**qv-score**:(Phred quality score, kmer survival rate, or assembly consensus quality value) assumes that kmers found only in the asm are errors. It depends on the coverage, quality of the read set, and kmer size chosen. Higher QVs indicate a more accurate consensus.
[FullTable_asms_comparisons.xlsx](https://docs.google.com/spreadsheets/d/1RD4zSMCuin1e0ovnP5k5UIenwJLs5cFM/edit?usp=sharing&ouid=108013701707137496295&rtpof=true&sd=true)
* Criteria for "good" assembly:
| Criterion | Best assembly | Best value |
| -------- | -------- | -------- |
| Contig N50>10Mb | **hifiasm.l2.k31.purged** (both with default or VGP suggested values) | 21 Mb |
| Busco completness>95% | all, but best: hifiasm.l0 (before purging), and **hifiasm.l2.k31.purged** (both with default or VGP suggested values) | 96% |
| Busco duplicated rate<3% | **all**, except hifiasm.l0 and l1 before purging, and l0 after purging with default values | 1% |
| < Number of contigs | IPA, IPApurged, **hifiasm.l2.k31.purged** (default and
| > qv-score | hifiasm.l0.k31.purged.default | 62.11 |
| > kmer completness | hifiasm.l0.k31 | 99.34 |
| copy number spectrum plots | all hifiasm.l1.purged, hifiasm.l2 and all **hifiasm.l2.purged** | no blue peak, no bar of kmers found only in the asm |
## Best assembly
All the QC metrics are better for asm **hifiasm.l2.k31.purged** except the *qv-score* and *kmer completness*.
![](https://i.imgur.com/ZhLbjR1.png)
![](https://i.imgur.com/dNhtrng.jpg)
We estimated a very high base accuracy of QV (consensus quality) 61.3 and 62.8 for both haplotypes, indicating one error per ~71.4 million base pairs (=61.3088x1164967086).
*****
### Other assemblers I tried: Hicanu
https://github.com/marbl/canu/releases
```
module purge
module load lang/Anaconda3/2020.11
source activate java11-env
export PATH=/work/dominguez/YelCar_GenomeAssembly_MPI/canu-2.1.1/bin:$PATH
canu -assemble -d /path/hicanu_FatNode/ -p YelCar_hicanu_assemble genomeSize=1.06g -pacbio-hifi /path_ro_reads/reads_noAdapt.fastq useGrid=false
#change genomeSize for what was predicted by genomescope2
#Requested --mem=150G, --cpus-per-task=8, and took xx to finish.
```
*****
### Other assemblers I tried: LJA
https://github.com/AntonBankevich/LJA
```
module purge
module load lib/zlib/1.2.11-GCCcore-10.2.0
module load lang/Anaconda3
source activate cmake-env
cd /path/LJA/
/work/dominguez/LJA/bin/lja -o /path/LJA_fatnode_Cintel/ --reads /path_to_reads/reads_noAdapt.fastq --threads 8 --diploid
```
The job to assemble the hifi reads seems not to use much memory over most of its runtime. It seems that it is only at the end that it performs a huge allocation. And it is killed because of that. So, I then ran it in our "fat" node (where I can request --mem=1500G). In this case, the job failed after 14 days of running due to a time limitation. I, then, tested it with a shorter version of the infile, requesting 24 CPUs (--mem-per-cpu=16G). It took 1h and then also failed in the step of uncompressing homopolymers in contigs. It produced the expected output files except from the final fasta assembly and final gfa.
*when fatnode is free, try: submitLJAtrimmedFatNodeintelv2.slurm*
*****
### Other assemblers: Rust-mdbg
https://github.com/ekimb/rust-mdbg
I did not try it given that it does not resolve both haplotypes.
*****
### MUMMER
https://github.com/mummer4/mummer
https://mummer4.github.io/manual/manual.html
**Compare my best asm with a reference from a related species genome.**
There are no VGP genomes in the same family (Thraupidae).
I could use another family, like:
Zebra finch genome VGP([GCF_003957565.2_bTaeGut1.4.pri](https://www.ncbi.nlm.nih.gov/assembly/GCF_003957565.2/))...
or a species of the same family (but not so good asm), like **small tree finch **([GCA_902806625.1_Camarhynchus_parvulus_V1.1](https://https://www.ncbi.nlm.nih.gov/assembly/GCA_902806625.1/)), and medium ground finch ([GCF_000277835.1_GeoFor_1.0](https://www.ncbi.nlm.nih.gov/assembly/GCF_000277835.1/)).
The genome asm for the small tree finch (Camarhynchus parvulus) is slightly better (genome coverage 31X, less gaps).
all-vs-all comparison of nucleotide sequences contained in multi-Fasta data files. It is best used for similar sequences that may have large rearrangements.
download Camarhynchus parvulus (Campar) asm from ncbi (in terminal):
```
wget -O Campar_genome1.1.fna.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/806/625/GCA_902806625.1_Camarhynchus_parvulus_V1.1/GCA_902806625.1_Camarhynchus_parvulus_V1.1_genomic.fna.gz
```
job script: [submit_mummer_Hifiasm.l2.k31.p.PRI_vs_Campar.slurm](https://drive.google.com/file/d/1ABDOC2OYKV0spl6pCpQ-FVUueXqVAAML/view?usp=sharing) *Ran with --cpus-per-task=24, --mem=64G, and took 3hs.*
*I used NUCmer (NUCleotide MUMmer) version 3.1*
Dotplot between two assemblies (ran delta-filter, show-coords and mummerplot):
[submit_mummer_Hifiasm.l2.k31.p.PRI_vs_Campar_2.slurm
](https://drive.google.com/file/d/1oXEq6G959TbFwNN3hhswXHYqfi5d4JVZ/view?usp=sharing)
![](https://i.imgur.com/Af1UcEx.png)
Change in the .gp file towards the bottom this lines:
```
set linestyle 1 lt 1 lw 0.3 pt 6 ps 0.1
set linestyle 2 lt 3 lw 0.3 pt 6 ps 0.1
set linestyle 3 lt 2 lw 0.3 pt 6 ps 0.1
```
![](https://i.imgur.com/yyB33BV.png)
I can do a circos plot with the mummer .coords file.
---
### . CONTAMINATION CHECK
## Blast best asm against ncbi database (lagoa) and see in MEGAN (locally)
`blastn -db /srv/public/shared/data/blastDB/nt/nt -query /srv/public/users/dominguez/asm/hifiasm.l2.k31.default_p1_purged_default.fa -out /srv/public/users/dominguez/2ndSpikeIn_testAdapters/blast/RefGenome_vs_ncbi/hifiasm.l2.k31.default_p1_purged_NT.xml -max_target_seqs 1 -num_threads 36 -evalue 10e-20 -outfmt 5`
![](https://i.imgur.com/ecXI6Ed.png)
From 597 contigs, 27 with no hits.
Th vast majority of hits within Aves, but a very small number of hits with Boreoeutheria (placental mammals, n=2), and Austrfundulus limnaeus (fish, n1).
---
# Assembly *bGubCri1.2*
### DeepConsensus
Alternative to **pbccs** for generation of ccs reads. It is 20X more computational expensive but reduces reads error, creates **more coverage** of the same PacBio reads, and increases contig N50 and QC.
Paper here: https://www.biorxiv.org/content/10.1101/2021.08.31.458403v1.full.pdf
github here: https://github.com/google/deepconsensus
DeepVariant_DeepConsensus (kate file):
- Installed through apptainer and run example data.
- Run my data on a job array: [script](https://drive.google.com/file/d/1mlIo2aE8pqUpHcrqFt1zrui9URqecXOw/view?usp=sharing).
Check if some jobs in the array failed:
## After it finished: to check failed jobs in array
```
ls -dq *output.fastq | wc -l
#9996
#there are 4 jobs missing
ub=10000
seq "$ub" | while read -r i; do [[ -f "$i.m64046_210816_140027.output.fastq" ]] || echo "$i is missing"; done
#jobs failed that I should re-run:
#710
#1165
#1264
#1463
```
Re run those manual to retrieve the missing pieces.
simply use the same job-script, comment out the array part, set imax to 10000 and i to the index of the missing chunk.
```
#concatenate all fastq outputs: DeepConsensus_cat_fastq.sh
cat /work/dominguez/YelCar_GenomeAssembly_MPI/reads/DeepConsensus/YelCarReads_chunks/results/*.output.fastq > /work/dominguez/YelCar_GenomeAssembly_MPI/reads/DeepConsensus/YelCarReads_chunks/results/total_ccs.fastq
#total.fastq is 75 456 Gb and the old ccs.fasta was 126 Gb.
```
***
# NEW CCS FILE (DC)
Re-run whole pipelne with this new ccs file
# 1. Reads Statistics
1a. **TRIMMING**. Removing adapter contaminated reads:
submit_cutadapt_new_ccs_DC.sh
```
cutadapt -j 10 -o /path/new_ccs_DC_noAdapt.fastq /path/total_ccs.fastq -b ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT -b ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT --discard-trimmed
```
**4.4%** reads were removed because they contained adapters (cutadapt version 3.5)
BUT I found [this paper](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08375-1#MOESM4) comparing performance of cutadapt to a new tool to filter reads with adapters (HiFiAdapterFilt). They are similar but the command they used for cutadapt is different, so I try this one. [See bellow]
1b. **Reads Statistics (asmstats)** before and after trimming
submit_asmstats_new_DC_ccs.sh
```
#BEFORE TRIMMING
asmstats /path/total_ccs.fastq > /path/total_ccs.fastq.stats
#AFTER TRIMMING
asmstats /path/new_ccs_DC_noAdapt.fastq > /path/new_ccs_DC_noAdapt.fastq.stats
```
Before trimming: 2 977 781 ccs reads !
![](https://i.imgur.com/knhrB6i.png)
After trimming: 2 846 116 ccs reads with NO ADAPTERS!
![](https://i.imgur.com/YbdeJwS.png)
1c. **Fastqc** after triming adapters (IGNORE THIS CAUSE IT PERFORMS BETTER FOR SHORT READS - AND I HAVE LONG READS!)
submit_fastqc_new_DC_ccs.sh
[link](https://drive.google.com/file/d/1WAVGnbxmyaVA92VcGwzg3Zhp3QLc3zZL/view?usp=sharing) to report in GoogleDrive
```
fastqc -t 6 /path/new_ccs_DC_noAdapt.fastq -o /path/new_ccs_DC_noAdapt
```
![](https://i.imgur.com/8uOzkW3.png)
**PROBLEM:** Quality scores per base were higher (~70%) before running DeepConsensus than now (~50%) ! Also before quality was decreasing with position in read and now it increases until it reaches a platau.
![](https://i.imgur.com/L6uRlfp.png)
**PROBLEM:** Avergae quality per read was ~90 before and not ~50 !
![](https://i.imgur.com/wcECgWU.png)
![](https://i.imgur.com/OAKed1M.png)
![](https://i.imgur.com/YtAkn0N.png)
![](https://i.imgur.com/3qrEGvT.png)
1d. **Kmer analysis** (jellyfish + genomescope2)
submit_Jellyfish_new_DC_ccs_trimmed_k31.sh
```
jellyfish count -C -m 31 -s 1000000 -t 20 -o /path/new_ccs_DC_noAdapt.k31.jf /path/new_ccs_DC_noAdapt.fastq
jellyfish histo /path/new_ccs_DC_noAdapt.k31.jf > /path/new_ccs_DC_noAdapt.k31.histo
```
![](https://i.imgur.com/Er5ujYn.png)
Haploid genome length: 1 061 015 273 bp ~ 1.06 Gb
Heterozygosity: 0.978 %
Repts (100-uniq): 10 %
Parameters for purge_dups:
Estimated genome haploid size: xx bp
Transition Parameter (kmercov*1.5): 15.7*1.5-> 24 reads (integer)
Maximum read depth (Transition Par*3): 24*3 -> 72 reads (interger)
Reads Coverage: **35x**
(number of bases sequenced -sum value in .stats file-) / (genome size in genomescope2 according to kmer analysis)
(36 870 266 213 / 1 061 441 021)
CONCLUSION: **Reads coverage improved!**
### 1d.bis. Re-do kmer analysis
![](https://i.imgur.com/60Kj1KE.png)
Run again jellyfish histo command increasing parameter -h (because many kmers were grouped in the max default category) because the estimation for genome size can change.
```
jellyfish count -C -m 31 -s 1000000 -t 20 -o /path/new_ccs_DC_noAdapt.k31_h.jf /path/new_ccs_DC_noAdapt.fastq
jellyfish histo -h 100000000 /path/new_ccs_DC_noAdapt.k31_h.jf > /path/new_ccs_DC_noAdapt.k31_h.histo
```
![](https://i.imgur.com/7RIrBUG.png)
Estimated Genome Size increased:
![](https://i.imgur.com/9MfsEa7.png)
Haploid genome length: 1 101 204 876 bp ~ 1.1 Gb
Heterozygosity: 0.978 %
Repts (100-uniq): 13.2 %
Parameters for purge_dups:
Transition Parameter (kmercov*1.5): 15.7*1.5-> 24 reads (integer)
Maximum read depth (Transition Par*3): 24*3 -> 72 reads (interger)
Reads Coverage: **33.5x**
(number of bases sequenced -sum value in .stats file-) / (genome size in genomescope2 according to kmer analysis)
(36 870 266 213 / 1 101 204 876)
CONCLUSION: **Estimated haploid genome size and reads coverage both increased a little!**
---
### Re-do trimming
1a. cutadapt new command.
I found [this paper](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08375-1#MOESM4) comparing performance of cutadapt to a new tool to filter reads with adapters (HiFiAdapterFilt). They are similar but the command they used for cutadapt is different (apart from universal adapter, they trim as well a 2 primer, and they force a 100% overlap), so I try this one.
submit_cutadapt_new_ccs_DC_OtherAdapter.sh
```
cutadapt -j 10 -o /path/new_ccs_DC_noAdapt_Others.fastq /path/total_ccs.fastq -b AAAAAAAAAAAAAAAAAATTAACGGAGGAGGAGGA;min_overlap = 35 -b ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT;min_overlap = 45 --discard-trimmed --revcomp -e 0.1
```
Now only 315 reads (**0.0%**) were removed !!!
To find out which adapters were used in my seq run check subreadset.xml file (it is ok because they are universal, see page 4 [here](https://www.pacb.com/wp-content/uploads/2015/09/Guide-Pacific-Biosciences-Template-Preparation-and-Sequencing.pdf)).
1b. Reads statistics with asmstats on new trimmed reads
```
asmstats /path/new_ccs_DC_noAdapt_Others.fastq > /path/new_ccs_DC_noAdapt_Others.fastq.stats
```
![](https://i.imgur.com/qNsMlId.png)
1c. FASTQC after triming new adapters
<span style="color:red">THIS IS NOT IMPORTANT AS THIS TOOL IS FOR SHORT READS AND NOT LONG HIFI READS!!!!
</span>submit_fastqc_new_DC_ccs_noAdaptOthers.sh
```
fastqc -t 6 /path/new_ccs_DC_noAdapt_Others.fastq -o /path/FASTQC_new_ccs_DC_noAdaptOthers
```
![](https://i.imgur.com/qVTHjRA.png)
![](https://i.imgur.com/LwUSbxB.png)
![](https://i.imgur.com/4XoKbM6.png)
![](https://i.imgur.com/XrKl6xG.png)
![](https://i.imgur.com/hh7hbty.png)
![](https://i.imgur.com/NAdiKz7.png)
Per base quality scores are still worst than before DC.
1d. **Kmer analysis** (jellyfish + genomescope2)
submit_Jellyfish_new_DC_ccs_trimmed_Others_k31_h.sh
```
jellyfish count -C -m 31 -s 1000000 -t 20 -o /path/new_ccs_DC_noAdapt_Others.k31_h.jf /path/new_ccs_DC_noAdapt_Others.fastq
jellyfish histo -h 100000000 /path/new_ccs_DC_noAdapt_Others.k31_h.jf > /path/new_ccs_DC_noAdapt_Others.k31_h.histo
```
![](https://i.imgur.com/B9cD2Tb.png)
![](https://i.imgur.com/voIXcj8.png)
Haploid genome length: 1 100 327 273 bp ~ 1.1 Gb
Heterozygosity: 0.977 %
Repts (100-uniq): 13.2 %
*Parameters for purgedups*:
Transition Parameter (kmercov*1.5): 16.4*1.5-> 25 reads (integer)
Maximum read depth (Transition Par*3): 25*3 -> 75 reads (interger)
Reads Coverage: **35x**
(number of bases sequenced -sum value in .stats file-) / (genome size in genomescope2 according to kmer analysis)
(38570683500 / 1 100 327 273)
CONCLUSION: **Estimated haploid genome size and reads coverage both increased a little!**
1e. Plot reads length
script only works with fasta files.
```
module load lang/Anaconda3/2020.11
source activate biopython-env
export PYTHONPATH=$PYTHONPATH:/work/dominguez/YelCar_GenomeAssembly_MPI
python -m plot_fasta_length_v3 /path/new_ccs_DC_noAdapt_Others.fasta.gz /path/new_ccs_DC_noAdapt_Others.fastq.Plotlength.png
```
![](https://i.imgur.com/i3WjITh.png)
***
# 2. Genome assemblies
All with k31, changing purging parameter l0, l1 and l2
### 2a. Hifiasm
### Hifiasm l0
submit_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31_primary.sh
```
hifiasm -t 38 -l0 --primary -o /path/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm /path/new_ccs_DC_noAdapt_Others.fastq -k 31
# convert gfa to fa
/work/dominguez/YelCar_GenomeAssembly_MPI/gfa2fa /path/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm.p_ctg.gfa > /path/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm.p_ctg.fa
/work/dominguez/YelCar_GenomeAssembly_MPI/gfa2fa /path/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm.a_ctg.gfa > /path/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm.a_ctg.fa
# statistics for primary asm c1
module purge
module load lang/Anaconda3/2020.11
source activate seqtk-env
export PATH=$PATH:/work/dominguez/YelCar_GenomeAssembly_MPI/
asmstats /path/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm.p_ctg.fa > /path/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm.p_ctg.fa.stats
# statistics for asm alternate c2
asmstats /path/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm.a_ctg.fa > /path/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm.a_ctg.fa.stats
```
Statistics for primary assembly hifiasm l0:
![](https://i.imgur.com/00Yqga7.png)
Statistics for the alternate assembly hifiasm l0:
![](https://i.imgur.com/b1yU9rF.png)
### Hifiasm l1
Statistics for primary assembly hifiasm l1:
![](https://i.imgur.com/BLlyYDS.png)
Statistics for alternate assembly hifiasm l1:
![](https://i.imgur.com/3MFozHM.png)
### Hifiasm l2
Statistics for **primary** assembly hifiasm l2:
![](https://i.imgur.com/f3SQj3K.png)
*827 contigs, contig N50: 22.8 Mb*
![](https://hackmd.io/_uploads/rk2z0qoE3.png)
Statistics for **alternate** assembly hifiasm l2:
![](https://i.imgur.com/ZyfcAxP.png)
![](https://hackmd.io/_uploads/HkXbJjiE3.png)
CONCLUSION: Hifiasm primary l2 assmebly improved with DC reads comapred to before (less contigs, greater average read length and N50).
### 2b. Bandage
`rsync -vz --progress dominguez@n-hpc-login1.rz.uni-potsdam.de:/path/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l2_k31.asm.p_ctg.gfa .`
![](https://i.imgur.com/4tQPWDN.png)
some contigs with bubbles:
![](https://i.imgur.com/bT608Fr.png)
### 2c. Purging
Use purge_dups to purge false duplications that occured by incorrectly retaining some secondary alleles in the primary contig set.
**1st ROUND** (purge PRI p1 to obtain **c1**):
[submit_purge_dups_trimmed_c1.l0.k31_new_ccs_DC_trim_reads.sh](https://drive.google.com/file/d/1X-WiJmhxL2Lyu8tg40muE7u6Fv5EpSYn/view?usp=sharing)
![](https://i.imgur.com/lPEry7O.png)
[submit_purge_dups_trimmed_c1.l1.k31_new_ccs_DC_trim_reads.sh](https://drive.google.com/file/d/1yAohxDXsPZ57hjzYNojAJz2KWvVOqtFh/view?usp=sharing)
![](https://i.imgur.com/8BkncC9.png)
[submit_purge_dups_trimmed_c1.l2.k31_new_ccs_DC_trim_reads.sh](https://drive.google.com/file/d/1ZlT2JpcWGPR1Or9Vg__k2dhGw1YxpeAX/view?usp=sharing)
![](https://i.imgur.com/eckfyN3.png)
Hifiasm_**l2**_**purged**_VGP: ***519** contigs, N50-contigs= **26.4Mb***
**2nd ROUND** (purge ALT c2p2 to obtain **q2**):
[submit_purge_dups_l0.k31_default_c2p2_new_ccs_DC_trim.sh](https://drive.google.com/file/d/1c5p2KuZX7-Dwwt3tvwRtR-bIdpGWarDk/view?usp=sharing)
l1 hap_default.fa.stats
![](https://i.imgur.com/wKyt4H5.png)
l0 purged_default.fa (q2)
![](https://i.imgur.com/3A2HXY8.png)
[submit_purge_dups_l1.k31_default_c2p2_new_ccs_DC_trim.sh](https://drive.google.com/file/d/1aLgQPyh5AQleIGGHin66sjpbis1sfbPZ/view?usp=sharing)
l1 hap_default.fa.stats
![](https://i.imgur.com/S9shOFj.png)
l1 purged_default.fa.stats (q2)
![](https://i.imgur.com/sAA9q2q.png)
[submit_purge_dups_l2.k31_default_c2p2_new_ccs_DC_trim.sh](https://drive.google.com/file/d/1wa40lhJHUOFNLo0aLPrhAG8Q5P9bNRSp/view?usp=sharing)
l2 hap_default.fa.stats
![](https://i.imgur.com/zpj8FQP.png)
l2 purged_default.fa.stats (q2)
![](https://i.imgur.com/I37Pgfk.png)
## MuMmer
with C. parvulus:
![](https://hackmd.io/_uploads/BJhXxAiV2.png)
# 3. Genome Evaluation Pipeline (GEP)
[kate file](https://drive.google.com/file/d/19o-rhkO1YQJVQa4kv80ti2VRAOYyCqMy/view?usp=share_link)
New: this time using Busco lineage aves_odb10
To evaluate all my hifiasm asms (n=3), before and after purging (n=6).
**STEP 1**: building the **meryl kmer database**
https://github.com/marbl/merqury/wiki/1.-Prepare-meryl-dbs
This was done manually (slurm script) not using GEP:
script [here](https://drive.google.com/file/d/1Hq452-jNzdkgtT3gXIsnQYqfKvodgA3q/view?usp=share_link)
**STEP 2**. **Merqury**: compare asms before and after purging
purging l0, k31
```
#BEFORE purging PRI (but after hifiasm):
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l0/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm.p_ctg.fa
# BEFORE purging ALT:
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l0/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l0_k31.asm.a_ctg.fa
# AFTER purging PRIM p1:
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l0/hifiasm_l0_purge_dups/c1.l0.k31/purged_default.fa
# AFTER PURGING ALT q2 (q2 is the purged.fa after 2nd round of purge_dups), default:
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l0/hifiasm_l0_purge_dups/c1.l0.k31/q2_default/purged_default.fa
```
purging l1, k31
```
#BEFORE purging PRI (but after hifiasm):
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l1/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l1_k31.asm.p_ctg.fa
# BEFORE purging ALT:
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l1/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l1_k31.asm.a_ctg.fa
# AFTER purging PRIM p1:
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l1/hifiasm_l1_purge_dups/c1.l1.k31/purged_default.fa
# AFTER PURGING ALT q2, default:
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l1/hifiasm_l1_purge_dups/c1.l1.k31/q2_default/purged_default.fa
```
purging l2, k31
```
#BEFORE purging PRI (but after hifiasm):
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l2/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l2_k31.asm.p_ctg.fa
# BEFORE purging ALT:
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l2/YelCar_hifiasm_new_ccs_DC_noAdaptOthers_l2_k31.asm.a_ctg.fa
# AFTER purging PRIM p1:
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l2/hifiasm_l2_purge_dups/c1.l2.k31/purged_default.fa
# AFTER PURGING ALT q2, default:
/work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed_DC/hifiasm_l2/hifiasm_l2_purge_dups/c1.l2.k31/q2_default/purged_default.fa
```
running GEP (only worked in default mode-not slurm)
-first tried only comparing before and after purging of l0
files:
[runEval_Hifiasm_l0.tsv](https://drive.google.com/file/d/14PeWb2Gj1onQDn9PgKbWYXDPnV0P4c25/view?usp=share_link)
[config.yaml](https://drive.google.com/file/d/1c1kVeuJnNcFBcclNx-VnIa7GLL51WzoP/view?usp=share_link)
```
nohup snakemake --profile SUBMIT_CONFIG/default/ --cores 144 --jobs 20 --printshellcmds > output_020523_EvalMode.log 2>&1 &
```
[GEP_FINAL_REPORT_l0.pdf](https://drive.google.com/file/d/1-kD-X5SsTcroy8drJ-jgJff5viwB4oot/view?usp=share_link)
hifiasm_l0_BEFOREpurging:
![](https://i.imgur.com/h5NOmxY.png)
hifiasm_l0_AFTERpurging:
![](https://i.imgur.com/5hNSqjS.png)
![](https://i.imgur.com/Gvfcdjb.png)
![](https://i.imgur.com/e7UpkkH.png)
-now run the other asms: before and after purging of l1,l2
[runEval_Hifiasm_l1_l2.tsv
](https://drive.google.com/file/d/1CYeMoUZNi88fuDh8wZ8Kfh8Zil2ZiVXN/view?usp=share_link)
[config.yaml](https://drive.google.com/file/d/1wpCSDB8pqbGKfOQvpBN4Hb4FQD-yvxUe/view?usp=share_link)
```
module load lang/Anaconda3/2020.11
source activate GEP
nohup snakemake --profile SUBMIT_CONFIG/default/ --cores 144 --jobs 20 --printshellcmds > output_020523_EvalMode_l1_l2.log 2>&1 &
```
[REPORT](https://drive.google.com/file/d/1RS1r3Qut48nYEQniZfwPMRyTZH_B9jUs/view?usp=share_link)
hifiasm_l1_BEFOREpurging:
![](https://i.imgur.com/zg5hGxa.png)
hifiasm_l1_AFTERpurging:
![](https://i.imgur.com/zi5hHv6.png)
hifiasm_l2_BEFOREpurging:
![](https://i.imgur.com/IL9BrGD.png)
hifiasm_l2_AFTERpurging:
![](https://i.imgur.com/aTln0qV.png)
![](https://i.imgur.com/4JltTgp.png)
![](https://i.imgur.com/DIoeslT.png)
BEST ASM: hifiasm_l2_AFTERpurging
***
### NETX STEPS:
1a. assembly mitochondrial genome: MitoHiFi, https://github.com/VGP/vgp-assembly/tree/master/mitoVGP
1b. annotate with MITOS webserver ? https://pubmed.ncbi.nlm.nih.gov/22982435/
2. Repeats annotation: https://www.gensas.org/tools
3. Genes Annotation using the taxonomically closest sp (GEMOMA) or ab initio prediction (MAKER2). Maker might be better as gemoma will ignore divergent genes. Use all protein database for vertebrates (or birds) in NCBI and *find transcriptome data of close species*.
4. Contamination check (megaBLAST, blobtools, KAT): DONE but not with blobtools (no permissions to install it on HPC, try in Alan).
5. Hi-C for scaffolding ($?, MPI-Dreden: 2.393,05 Euro for the pull-downs and library prep. For Illumina sequencing, calculate the desired coverage with 10,8e/Gb)
6. **DeepVariant to polish**: see https://www.science.org/doi/10.1126/sciadv.abm6494. After DV, use VCF outfile to run [bcftools consensus](https://samtools.github.io/bcftools/howtos/consensus-sequence.html) to correct base errors.
7. circulomics DNA isolation of diuca sample
### IDEAS
- Repeat masking + gene annotation can be done automatic if I submit (https://www.ebi.ac.uk/submission/) the genome to ENSEMBL: https://rapid.ensembl.org/index.html (but takes forever), or use MAKER for *ab initio* + evidence- based prediction. I can use evidence (expressed sequence tags (ESTs) and protein databases) from zebra finch for training (like Campagna et al. 2017). It can repeat several times until you get good annotation.
- GEMOMA can annotate proteins if there is a close related species with good annotations (because very divergent regions will not be annotated)
- Instead of checking synteny with MuMMER, do a CIRCOS PLOT? Also for Mummer use Camarhumchus parvulus or zebra finch?
- Use the genome to infer demographic history (PSMC, ROH) and genome-wide heterozygosity (nucleotide diversity in sliding windows)
- Use **SatsumaSynteny** (https://github.com/bioinfologics/satsuma2) to assemble at chromosome level (without HiC) based on another related genome : but available birds genomes differ a lot in chr number!!! (Chromosome number conservation in passerine birds is not as well-conserved as it is in cetaceans). If doing it the closest species with genome assemble at chr level is *C. parvulus* (2n=xx), but zebra finch T. guttata is not so close but has a more similar chr n (2n=80). Also most songbirds have ~ 2n=80
- ![](https://i.imgur.com/b3paGF7.png)
in https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0170997
Campagna et al. 2017 also used this program for *Sporophila hypoxhanta* (closest sequenced species in same family as YelCar) also using zebra finch as model!!!
***