# Yellow Cardinal (Gubernatrix cristata) reference genome assembly (PacBio HiFi reads)
**Genome assembly naming** (according to Rhie et al. 2021, and corroborated in CGP species list - page 23): *bGubCri1.1*


*Photos of the individual being sequenced for species reference genome.*
## Background knowledge
Genome Assemblies of close species:

## Sample Collection
Sample ID: 5697
Sample type: **blood in 96% ethanol**
Sex: **female**
Collectors: Dr. Bettina Mahler and Lic. Melina Atencio in the rescue center Fundacion Temaiken (CRET)
Origin: confiscated, in Rescue Center since 16.10.2019. Assigned to the management unit of the east, where there is no hybridization (markers: mtDNA + 10 microsatellites),
Date of collection: **19.04.2021**
Date of sample delivery from Argentina to Germany: **19.04.2021**
Date of sample arrival to Uni Potsdam: **10.05.2021**
## DNA isolation
Date of DNA extraction: **17.05.2021**
Extraction kit for MHW DNA: Circulomics Nanobind CBB Big DNA
Protocol: [here](https://drive.google.com/file/d/1h-WXobb3CII9CpjwkXuXini435InkVqo/view?usp=sharing) but before 2X washes with PBS to remove ethanol:
80 ul of Ethanol/blood mix pipetted in Protein LoBind tubes.
1. Spin 5 min 2200 g 4°C to remove ethanol
2. Add 500 ul of 1x PBS to wash. Spin 5 min 2200 g 4°C and remove supernatant. Repeat for a total of 2 washes.
3. Pellet is directly dissolved in 200 ul of 1x PBS.
I sent 3 extractions from same individual to be considered for sequencing.
## Sequencing
DRESDEN-concept Genome Center (MPI-Dresden)
Contact: Dr. Sylke Winkler (winkler@mpi-cbg.de)
Date of sample delivery to sequencing center: **20.07.2021**
Date of sample reception in the sequencing center: **21.07.2021**
Template Specimen Form: [here](https://drive.google.com/file/d/1UtUPy8uKe2T8J_iR-3ePsoxTK7461igF/view?usp=sharing)
Date of sequencing completion: 20.08.2021
Link to Data (mail Pippel): https://bds.mpi-cbg.de/pippelshare/bGubCri_qpw6S5CzT2zAUpy11I5kF2HfPAxgHk2s8GDy5YKMbn/
QC in sequencing center (extraction N3 chosen): [here](https://drive.google.com/file/d/1FUG46828MQJY2LbqZ3P4qJbRvu_qaztq/view?usp=sharing)
## PacBio HiFi reads pre-processing
1. **bam to fq**: bam2fastx using SMRTtool (also tested with bedtools 2.9)
[submit_bam2fastx_newHPC.slurm](https://drive.google.com/file/d/1lSYmctsZnRZmFoMAxNEUQFu0KblVdWWT/view?usp=sharing)
```
#1. index
module purge
module load lang/Anaconda3/2020.11
source activate pbbam-env
pbindex /path/m64046_210816_140027.ccs.bam
#2. convert bam to fq
module load lang/Anaconda3/2020.11
source activate bam2fastx-env
bam2fasta -u -o /path/bam2fastx/ /path/m64046_210816_140027.ccs.bam
bam2fastq -u -o /path/bam2fastx/ /path/m64046_210816_140027.ccs.bam
conda deactivate
```
*I used bam2fasta 1.3.1 and bam2fastq 1.3.1*
**2. Trimming of PacBio SMRT adapters** from reads (cutadapt) --> 4.5 % reads removed
[submitcutadaptnewHPC.slurm](https://drive.google.com/file/d/1_Ni22w8yN3suxcHC0J6GH7A7IbtARzht/view?usp=sharing)
```
cutadapt -j 10 -o /path/reads_noAdapt.fastq /path/reads.fastq -b ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT -b ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT --discard-trimmed
```
# Reads Statistics
**1. Fastqc on trimmed reads**
[submit_fastqc_trimmed_reads.slurm](https://drive.google.com/file/d/154yKNeLz8_4fHKZwD1vbQ0DNk19yD4uW/view?usp=sharing)
```
fastqc -t 6 /path/reads_noAdapt.fastq -o /path/fastqc_TrimmedReads
```
* link to full fastqc report **before** trimming sequencing adapters from reads [here](https://drive.google.com/file/d/1ppck5pY7ZYKCswjB4ArVzNu7WIARzEJw/view?usp=sharing)
* link to full fastqc report on **trimmed** reads [here](https://drive.google.com/file/d/1kO6I8tKme3-2Zy6u2LoTLcQKX13QKpDj/view?usp=sharing), main results:

**2. Plot reads length**
```
#not working in HPC, I try in local computer Alan (I limit the transfer to 1Mb/sec so it does not get stalled)
#Transfer from HPC to local (and then from local to Alan):
cd /home/Marisol
scp dominguez@n-hpc-login1.rz.uni-potsdam.de:/work/dominguez/YelCar_GenomeAssembly_MPI/reads/bam2fastx/reads_noAdapt.fasta .
#it was interrupted by lost connection
#resume transfer:
rsync -P --rsh=ssh dominguez@n-hpc-login1.rz.uni-potsk/dominguez/YelCar_GenomeAssembly_MPI/reads/bam2fastx/reads_noAdapt.fasta ./reads_noAdapt.fasta
#From local to Alan:
scp ./reads_noAdapt.fasta alan@141.89.108.18:/home/alL/reads_trimmed/
python plot_fasta_length_v3.py /home/alan/Desktop/MARISOL/reads_trimmed/reads_noAdapt.fasta /home/alan/Desktop/MARISOL/reads_trimmed/reads_noAdapt.fasta.length.png
```

**3. Reads statistics**
Count reads on .fasta file (N = 2 489 222)
```
grep ">" /path_to_reads/reads_noAdapt.fasta | wc -l
```
Count reads on .fastq file (N = 2 489 222)
```
expr $(cat /path_to_reads/reads_noAdapt.fastq | wc -l) / 4
```
Using asmstats script:
[submit_trimmed_asmstats.slurm]([https:/](https://drive.google.com/file/d/1Rjb8n7jA5L60mSb1NNQAwoh7TOv_cSKK/view?usp=sharing)/)
```
module purge
module load lang/Anaconda3/2020.11
source activate seqtk-env
export PATH=$PATH:/work/dominguez/YelCar_GenomeAssembly_MPI/
asmstats /path_to_reads/reads_noAdapt.fastq > /path/reads_statistics/reads_noAdapt.fastq.stats
```
Outfile reads_noAdapt.fastq.stats [here](https://drive.google.com/file/d/1_jhrUgOdlufE5bDcxJqbYsdVCj0q6FYI/view?usp=sharing).

**4. Kmer analysis on reads**
Assuming k21:
[submit_Jellyfish_trimmed_k21.slurm](https://drive.google.com/file/d/1id-VkCkCUNTB2nJIh49_3OmBZHUdoHnS/view?usp=sharing)
*I used jellyfish version 2.2.10*
```
module purge
module load lang/Anaconda3/2020.11
source activate jellyfish-env
#jellyfish count
jellyfish count -C -m 21 -s 1000000 -t 20 -o /path/reads_statistics/YelCar_reads_trim.k21.jf /path/reads/reads_noAdapt.fastq
#jellyfish histo
jellyfish histo /path/reads_statistics/YelCar_reads_trim.k21.jf > /path/reads_statistics/YelCar_reads_trim.k21.histo
```
Assuming k31:
[submit_Jellyfish_trimmed_k31.slurm](https://drive.google.com/file/d/1_GhZVkBATwnBDOngLZN4yhSVSpv1drkh/view?usp=sharing)
```
module purge
module load lang/Anaconda3/2020.11
source activate jellyfish-env
jellyfish count -C -m 31 -s 1000000 -t 20 -o /path/reads_statistics/YelCar_reads_trim.k31.jf /path/reads/reads_noAdapt.fastq
#jellyfish histo
jellyfish histo /path/reads_statistics/YelCar_reads_trim.k31.jf > /path/reads_statistics/YelCar_reads_trim.k31.histo
```
Open the histo files in [genomescope2 website](http://qb.cshl.edu/genomescope/genomescope2.0/)
[k21.histo](https://drive.google.com/file/d/1yvNxjOaxRKsiGfDcAfLmUArRstqzcmfZ/view?usp=sharing)

Link to genomescope2 results [here](http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=5d0pXxIZINtIM6igWHcg)
Kmer = 21
Size: 1 057 467 553 bp ~ 1.06 Gb
Herterozygosity: 1.1%
Repts (100-uniq): 11.8%
Parameters for purge_dups:
- Estimated genome haploid size: 1057467553 bp
- Transition Parameter (kmercov*1.5): 21.045 -> 21 reads (integer)
- Maximum read depth (Transition Par*3): 63.135 -> 63 reads (integer)
[k31.histo](https://drive.google.com/file/d/19zt3RybTbybSlnzphghbwDwRIS2cB8yV/view?usp=sharing)

Link to genomescope2 results [here](http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=VtMACah2eEURIERJjRmW)
Kmer = 31
Size: 1 058 655 883 bp ~ **1.06 Gb**
Herterozygosity: 0.99 %
Repts (100-uniq): 9.9 %
Parameters for purge_dups:
- Estimated genome haploid size: 1 058 655 883 bp
- Transition Parameter (kmercov*1.5): 20.7 -> 21 reads (integer)
- Maximum read depth (Transition Par*3): 62.1 -> 62 reads (interger)
**Reads Coverage: 30.3X**
(number of bases sequenced -sum value in .stats file-) / (genome size in genomescope2 according to kmer analysis)
(32036044402/1058655883)
The **estimated genome size** of both kmers is around **1.06 Gb**, this is also close to the **zebra finch** (Taeniopygia guttata, 1.056 Gb), and other birds like **small tree finc**h (Camarynchus parvulus, 1.064 Gb), and **medium ground finch** (Geospiza fortis, 1.065 Gb).

## Genome Assemblies
1. **Hifiasm** (https://github.com/chhylp123/hifiasm) with different parameters (l0,l1,l2 and nok, k21, k31)
*I used Hifiasm version 0.16.1-r375*
- scripts for assemblies *(change parameters -l and -k accordingly)*
```
module purge
module load lang/Anaconda3/2020.11
source activate hifiasm-env
hifiasm -t 28 -l2 -- primary -o /path/asm/hifiasm.trim.l2.k31.asm /path/reads/reads_noAdapt.fastq -k 31
#took Real time: 27160.206 sec; CPU: 922008.523 sec; Peak RSS: 54.377 GB requesting 38 cpus-per-task with 16G mem-per-cpu.
```
- script for statistics for the asm:
```
/path/gfa2fa /path/hifiasm.trim.l2.k31.asm.bp.p_ctg.gfa > /path/YelCar_hifiasm_trimmed_l2_k31.asm.p_ctg.fa
module purge
module load lang/Anaconda3/2020.11
source activate seqtk-env
export PATH=$PATH:/work/dominguez/YelCar_GenomeAssembly_MPI/
asmstats /path/YelCar_hifiasm_trimmed_l2_k31.asm.p_ctg.fa > /path/hifiasm.trim.l2.k31.asm.p_ctg.fa.stats
```
### Primary contigs viewed by Bandage:
Primmary assembly:
YelCar_hifiasm_trim_l2.k31.PRI.gfa

*Looks good, most of the contigs must do not have 2 branches.*
Zoom-in of all contigs I found with bubbles:

*(hopefully these will be solved after purging - I can not corroborate since purgedups does not provide a .gfa file)*
Alternative assembly:
YelCar_hifiasm_trimmed_l2_k31.asm.a_ctg.gfa

*It is more fragmented, and much more bubbles.*
- scripts for purging - 1st ROUND (purge PRI p1 to obtain **c1**)
```
module purge
module load lang/Anaconda3/2020.11
source activate purge_dups_env
## Change this: ##########################################################
OUT_DIR="/path/hifiasm/purge_dups"
NAME="c1.l2.k31"
ASM="/path/YelCar_hifiasm_trimmed_l2_k31.asm.p_ctg.fa"
ASM_NAME=$(basename $ASM .fa)
HIFI_DIR="/path_to_reads"
TRANS=21
MAX_DEPTH=62
CUTOFFS=$TRANS\_$MAX_DEPTH
###############################################################################
if [ ! -d $OUT_DIR/$NAME ]
then
mkdir -p $OUT_DIR/$NAME
fi
##
# Align HiFi reads and generate BAM files, then calculate read depth histogram and base-level read depth
cd $OUT_DIR/$NAME
for i in $(ls $HIFI_DIR/*_noAdapt.fastq.gz)
do
READ_NAME=$(basename $i .fastq.gz)
minimap2 -xasm20 -I 4G -t 24 $ASM $i | gzip -c - > $OUT_DIR/$NAME/$READ_NAME\.paf.gz
done
## -xmap-hifi is not in latest release. See https://github.com/lh3/minimap2/issues/739
## we will use the Reads mapped to contigs pairwise mapping format (PAF) file for calculating some statistics required in a later stage
## purge_dups initially produces a read-depth histogram from base-level coverages. This information is used for estimating the coverage cutoffs, taking into account that collapsed haplotype contigs will lead to reads from both alleles mapping to those contigs, whereas if the alleles have assembled as separate contigs, then the reads will be split over the two contigs, resulting in half the read-depth (Roach et al. 2018).
pbcstat *.paf.gz
# this produces PB.base.cov and PB.stat files
calcuts PB.stat > cutoffs_default 2>calcults_default.log
python3 /path/hist_plot.py -c cutoffs_default PB.stat PB_default.png
### VGP suggested cutoffs:
calcuts -m$TRANS -u$MAX_DEPTH PB.stat > cutoffs_$CUTOFFS
python3 /path/hist_plot.py -c cutoffs_$CUTOFFS PB.stat PB_$CUTOFFS\.png
##
# Split an assembly and do a self-self alignment
cd $OUT_DIR/$NAME
split_fa $ASM > $ASM_NAME\.split
minimap2 -t 18 -xasm5 -DP $ASM_NAME\.split $ASM_NAME\.split | pigz -p 6 -c - > $ASM_NAME\.split.self.paf.gz
#####
# Purge haplotigs and overlaps:
cd $OUT_DIR/$NAME
purge_dups -2 -T cutoffs_default -c PB.base.cov $ASM_NAME\.split.self.paf.gz > dups_default.bed 2> purge_dups_default.log
purge_dups -2 -T cutoffs_$CUTOFFS -c PB.base.cov $ASM_NAME\.split.self.paf.gz > dups_$CUTOFFS\.bed 2> purge_dups_$CUTOFFS\.log
##
# Get purged primary and haplotig sequences from draft assembly:
cd $OUT_DIR/$NAME
get_seqs -e dups_default.bed $ASM
mv purged.fa purged_default.fa
mv hap.fa hap_default.fa
get_seqs -e dups_$CUTOFFS\.bed $ASM
mv purged.fa purged_$CUTOFFS\.fa
mv hap.fa hap_$CUTOFFS\.fa
############################################################################
module purge
module load lang/Anaconda3/2020.11
source activate seqtk-env
export PATH=$PATH:/work/dominguez/YelCar_GenomeAssembly_MPI/
asmstats /path/hifiasm/purge_dups/c1.l2.k31/hap_21_62.fa > /path/hifiasm/purge_dups/c1.l2.k31/k31_hap_21_62.fa.stats
asmstats /path/hifiasm/purge_dups/c1.l2.k31/purged_21_62.fa > /path/hifiasm/purge_dups/c1.l2.k31/k31_purged_21_62.fa.stats
asmstats /path/hifiasm/purge_dups/c1.l2.k31/hap_default.fa > /path/hifiasm/purge_dups/c1.l2.k31/k31_hap_default.fa.stats
asmstats /path/hifiasm/purge_dups/c1.l2.k31/purged_default.fa > /path/hifiasm/purge_dups/c1.l2.k31/k31_purged_default.fa.stats
########################################################################
# "cutoffs" file has six numbers: the first is low coverage, any contig with an average coverge less than the number is a junk contig, the forth number is diploid (middle) coverage, the last number is high coverage for repeats, contigs with an average coverage over this number are usually repeats
# Next step: merge hap.fa and alt_asm and redo the above steps to get the haplotig set.
#job was ran with --cpus-per-task=24, #SBATCH --mem=64G. It took 15 min.
```
- scripts for purging - 2nd ROUND (purge ALT c2p2 to obtain **q2**)
- - using VGP suggested cutoffs:
```
module purge
module load lang/Anaconda3/2020.11
source activate purge_dups_env
## Change this: ##########################################################
OUT_DIR="/path/hifiasm/purge_dups"
NAME="c1.l2.k31"
ASM="/path/YelCar_hifiasm_trimmed_l2_k31.asm.p_ctg.fa"
ASM_NAME=$(basename $ASM .fa)
ALT="/path/YelCar_hifiasm_trimmed_l2_k31.asm.a_ctg.fa"
HIFI_DIR="/path_to_reads"
TRANS=21
MAX_DEPTH=62
CUTOFFS=$TRANS\_$MAX_DEPTH
###############################################################################
if [ ! -d $OUT_DIR/$NAME/q2_VGPcutoffs ]
then
mkdir -p $OUT_DIR/$NAME/q2_VGPcutoffs
fi
#Merge hap_.fa and ALT
##c2 is before purging so it is in another folder
cd $OUT_DIR/$NAME/q2_VGPcutoffs
cat $ALT >> c2p2.fa; cat /path/hifiasm/purge_dups/c1.l2.k31/hap_21_62.fa >> c2p2.fa
#align HiFi reads to c2p2 and generate BAM files, then calculate read depth histogram and base-level read depth
cd $OUT_DIR/$NAME/q2_VGPcutoffs
for i in $(ls $HIFI_DIR/*_noAdapt.fastq.gz)
do
READ_NAME=$(basename $i .fastq.gz)
minimap2 -xasm20 -I 4G -t 24 c2p2.fa $i | gzip -c - > $OUT_DIR/$NAME/q2_VGPcutoffs/$READ_NAME\.paf.gz
done
pbcstat *.paf.gz
### VGP suggested cutoffs:
calcuts -m$TRANS -u$MAX_DEPTH PB.stat > cutoffs_$CUTOFFS
python3 /path/hist_plot.py -c cutoffs_$CUTOFFS PB.stat PB_$CUTOFFS\.png
##
# Split an assembly and do a self-self alignment
cd $OUT_DIR/$NAME/q2_VGPcutoffs
split_fa c2p2.fa > c2p2.split
minimap2 -t 18 -xasm5 -DP c2p2.split c2p2.split | pigz -p 6 -c - > c2p2.split.self.paf.gz
#############
# Purge haplotigs and overlaps
cd $OUT_DIR/$NAME/q2_VGPcutoffs
purge_dups -2 -T cutoffs_$CUTOFFS -c PB.base.cov c2p2.split.self.paf.gz > dups_$CUTOFFS\.bed 2> purge_dups_$CUTOFFS\.log
# Get purged and haplotig sequences from draft assembly
cd $OUT_DIR/$NAME/q2_VGPcutoffs
get_seqs -e dups_$CUTOFFS\.bed c2p2.fa
mv purged.fa purged_$CUTOFFS\.fa
mv hap.fa hap_$CUTOFFS\.fa
#########################################################################
module purge
module load lang/Anaconda3/2020.11
source activate seqtk-env
export PATH=$PATH:/work/dominguez/YelCar_GenomeAssembly_MPI/
asmstats $OUT_DIR/$NAME/q2_VGPcutoffs/purged_$CUTOFFS\.fa > $OUT_DIR/$NAME/q2_VGPcutoffs/purged_$CUTOFFS\.fa.stats
asmstats $OUT_DIR/$NAME/q2_VGPcutoffs/hap_$CUTOFFS\.fa > $OUT_DIR/$NAME/q2_VGPcutoffs/hap_$CUTOFFS\.fa.stats
## requested --cpus-per-task=24, --mem=64G, and it took 15 min to run
```
Values based on **kmercov** from genomescope2:
For k31:
- Transition Parameter (kmercov*1.5): 20.7 -> 21 reads (integer)
- Maximum read depth (Transition Par*3): 62.1 -> **62** reads (interger)
For k21:
- Transition Parameter (kmercov*1.5): ): 21.045 -> 21 reads (integer)
- Maximum read depth (Transition Par*3): 63.135 -> **63** reads (interger)
- - using default values
```
module purge
module load lang/Anaconda3/2020.11
source activate purge_dups_env
## change this: ##########################################################
OUT_DIR="/path/hifiasm/purge_dups"
NAME="c1.l2.k31"
ASM="/path/YelCar_hifiasm_trimmed_l2_k31.asm.p_ctg.fa"
ASM_NAME=$(basename $ASM .fa)
ALT="/path/YelCar_hifiasm_trimmed_l2_k31.asm.a_ctg.fa"
HIFI_DIR="/path_to_reads"
###############################################################################
if [ ! -d $OUT_DIR/$NAME/q2_default ]
then
mkdir -p $OUT_DIR/$NAME/q2_default
fi
#Merge hap_.fa and ALT
##c2 is before purging so it is in another folder
cd $OUT_DIR/$NAME/q2_default
cat $ALT >> c2p2.fa; cat /hifiasm/purge_dups/c1.l2.k31/hap_default.fa >> c2p2.fa
# align HiFi reads to c2p2 and generate BAM files, then calculate read depth histogram and base-level read depth
cd $OUT_DIR/$NAME/q2_default
for i in $(ls $HIFI_DIR/*_noAdapt.fastq.gz)
do
READ_NAME=$(basename $i .fastq.gz)
minimap2 -xasm20 -I 4G -t 24 c2p2.fa $i | gzip -c - > $OUT_DIR/$NAME/q2_default/$READ_NAME\.paf.gz
done
pbcstat *.paf.gz
calcuts PB.stat > cutoffs_default 2>calcults_default.log
python3 /path/hist_plot.py -c cutoffs_default PB.stat PB_default.png
# Split an assembly and do a self-self alignment
cd $OUT_DIR/$NAME/q2_default
split_fa c2p2.fa > c2p2.split
minimap2 -t 18 -xasm5 -DP c2p2.split c2p2.split | pigz -p 6 -c - > c2p2.split.self.paf.gz
# Purge haplotigs and overlaps
cd $OUT_DIR/$NAME/q2_default
purge_dups -2 -T cutoffs_default -c PB.base.cov c2p2.split.self.paf.gz > dups_default.bed 2> purge_dups_default.log
# Get purged and haplotig sequences from draft assembly
cd $OUT_DIR/$NAME/q2_default
get_seqs -e dups_default.bed c2p2.fa
mv purged.fa purged_default.fa
mv hap.fa hap_default.fa
#########################################################################
module purge
module load lang/Anaconda3/2020.11
source activate seqtk-env
export PATH=$PATH:/work/dominguez/YelCar_GenomeAssembly_MPI/
asmstats $OUT_DIR/$NAME/q2_default/purged_default.fa > $OUT_DIR/$NAME/q2_default/purged_default.fa.stats
asmstats $OUT_DIR/$NAME/q2_default/hap_default.fa > $OUT_DIR/$NAME/q2_default/hap_default.fa.stats
## requested --cpus-per-task=24, --mem=64G, and it took 15 min to run
```
2. IPA
https://github.com/PacificBiosciences/pbipa
*I used ipa (wrapper) version=1.3.1*
* script for the assembly of the primmary (PRI) and alternative (ALT) assemblies
```
module purge
module load lang/Anaconda3/2020.11
source activate snakemake-env2
source activate ipa-env
ipa local --nthreads 60 --njobs 1 -i /path/reads/filename_ccs_reads.fa
# It was ran requesting 50G of mem and lasted 2 days to finish.
```
- - 1st ROUND of PURGING (purge PRI p1 to obtain **c1**)
*I used purgedups v1.2.5*
https://github.com/dfguan/purge_dups
```
module purge
module load lang/Anaconda3/2020.11
source activate purge_dups_env
## Change this: ##########################################################
OUT_DIR="/path/IPA/purge_dups"
NAME="IPA_c1.purged"
ASM="/path/IPA/final.p_ctg.fasta"
ASM_NAME=$(basename $ASM .fa)
HIFI_DIR="/path_to_reads"
###############################################################################
if [ ! -d $OUT_DIR/$NAME ]
then
mkdir -p $OUT_DIR/$NAME
fi
# Align HiFi reads and generate BAM files, then calculate read depth histogram and base-level read depth
cd $OUT_DIR/$NAME
for i in $(ls $HIFI_DIR/*_noAdapt.fastq.gz)
do
READ_NAME=$(basename $i .fastq.gz)
minimap2 -xasm20 -I 4G -t 24 $ASM $i | gzip -c - > $OUT_DIR/$NAME/$READ_NAME\.paf.gz
done
## -xmap-hifi is not in latest release. See https://github.com/lh3/minimap2/issues/739
## we will use the Reads mapped to contigs pairwise mapping format (PAF) file for calculating some statistics required in a later stage
## purge_dups initially produces a read-depth histogram from base-level coverages. This information is used for estimating the coverage cutoffs, taking into account that collapsed haplotype contigs will lead to reads from both alleles mapping to those contigs, whereas if the alleles have assembled as separate contigs, then the reads will be split over the two contigs, resulting in half the read-depth (Roach et al. 2018).
pbcstat *.paf.gz #(produces PB.base.cov and PB.stat files)
calcuts PB.stat > cutoffs_default 2>calcults_default.log
python3 /path/hist_plot.py -c cutoffs_default PB.stat PB_default.png
# Split an assembly and do a self-self alignment:
cd $OUT_DIR/$NAME
split_fa $ASM > $ASM_NAME\.split
minimap2 -t 18 -xasm5 -DP $ASM_NAME\.split $ASM_NAME\.split | pigz -p 6 -c - > $ASM_NAME\.split.self.paf.gz
# Purge haplotigs and overlaps:
cd $OUT_DIR/$NAME
purge_dups -2 -T cutoffs_default -c PB.base.cov $ASM_NAME\.split.self.paf.gz > dups_default.bed 2> purge_dups_default.log
# Get purged primary and haplotig sequences from draft assembly:
cd $OUT_DIR/$NAME
get_seqs -e dups_default.bed $ASM
mv purged.fa purged_default.fa
mv hap.fa hap_default.fa
############################################################################
module purge
module load lang/Anaconda3/2020.11
source activate seqtk-env
export PATH=$PATH:/work/dominguez/YelCar_GenomeAssembly_MPI/
asmstats $OUT_DIR/$NAME/hap_default.fa > $OUT_DIR/$NAME/IPA_hap_default.fa.stats
asmstats $OUT_DIR/$NAME/purged_default.fa > $OUT_DIR/$NAME/IPA_purged_default.fa.stats
#Next step: Merge hap.fa and alt_asm and redo the above steps to get the haplotig set.
# It was ran requesting --cpus-per-task=24, --mem=64G. It took 10 min.
```
- - 2nd ROUND of PURGING (purge ALT c2p2 to obtain **q2**)
```
module purge
module load lang/Anaconda3/2020.11
source activate purge_dups_env
## Change this: ##########################################################
OUT_DIR="/path/reads_trimmed/IPA/purge_dups/"
NAME="IPA_c1.purged"
ASM="/path_to_PRIasm/final.p_ctg.fasta"
ASM_NAME=$(basename $ASM .fa)
ALT="/path_to_ALTasm/final.a_ctg.fasta"
HIFI_DIR="/path_to_reads"
###############################################################################
if [ ! -d $OUT_DIR/$NAME/q2 ]
then
mkdir -p $OUT_DIR/$NAME/q2
fi
# Merge hap_.fa and ALT
##c2 is before purging so it is in another folder
cd $OUT_DIR/$NAME/q2
cat $ALT >> c2p2.fa; cat /path/IPA/purge_dups/IPA_c1.purged/hap_default.fa >> c2p2.fa
# Align HiFi reads to c2p2 and generate BAM files, then calculate read depth histogram and base-level read depth:
cd $OUT_DIR/$NAME/q2
for i in $(ls $HIFI_DIR/*_noAdapt.fastq.gz)
do
READ_NAME=$(basename $i .fastq.gz)
minimap2 -xasm20 -I 4G -t 24 c2p2.fa $i | gzip -c - > $OUT_DIR/$NAME/q2/$READ_NAME\.paf.gz
done
pbcstat *.paf.gz
calcuts PB.stat > cutoffs_default 2>calcults_default.log
python3 /path/hist_plot.py -c cutoffs_default PB.stat PB_default.png
# Split an assembly and do a self-self alignment:
cd $OUT_DIR/$NAME/q2
split_fa c2p2.fa > c2p2.split
minimap2 -t 18 -xasm5 -DP c2p2.split c2p2.split | pigz -p 6 -c - > c2p2.split.self.paf.gz
# Purge haplotigs and overlaps:
cd $OUT_DIR/$NAME/q2
purge_dups -2 -T cutoffs_default -c PB.base.cov c2p2.split.self.paf.gz > dups_default.bed 2> purge_dups_default.log
# Get purged and haplotig sequences from draft assembly:
cd $OUT_DIR/$NAME/q2
get_seqs -e dups_default.bed c2p2.fa
mv purged.fa purged_default.fa
mv hap.fa hap_default.fa
#########################################################################
module purge
module load lang/Anaconda3/2020.11
source activate seqtk-env
export PATH=$PATH:/work/dominguez/YelCar_GenomeAssembly_MPI/
asmstats $OUT_DIR/$NAME/q2/purged_default.fa > $OUT_DIR/$NAME/q2/purged_default.fa.stats
asmstats $OUT_DIR/$NAME/q2/hap_default.fa > $OUT_DIR/$NAME/q2/hap_default.fa.stats
# It was ran with --cpus-per-task=24, #SBATCH --mem=24G and took 10 min.
```
## Genome Evaluation Pipeline (GEP)
https://git.imp.fu-berlin.de/cmazzoni/GEP
Developers: James Sullivan and Dr. Camila Mazzoni.
STEP 1: building the meryl kmer database
```
module load lang/Anaconda3
source activate GEP
nohup snakemake --profile SUBMIT_CONFIG/slurm --cores 20 &
```
STEP 2: run the genome asm evaluation
```
module load lang/Anaconda3
source activate GEP
nohup snakemake --profile SUBMIT_CONFIG/slurm --cores 80 --jobs 5 &
```
| Assembly | hifiasm_l0_nok | purgedups_l0_nok_default | purgedups_l0_nok_VGPcutoffs | hifiasm_l0_k21 | purgedups_l0_k21_default | purgedups_l0_k21_VGPcutoffs | hifiasm_l0_k31 | purgedups_l0_k31_default | purgedups_l0_k31_VGPcutoffs | hifiasm_l1_nok | purgedups_l1_nok_default | purgedups_l1_nok_VGPcutoffs | hifiasm_l1_k21 | purgedups_l1_k21_default | purgedups_l1_k21_VGPcutoffs | hifiasm_l1_k31 | purgedups_l1_k31_default | purgedups_l1_k31_VGPcutoffs | hifiasm_l2_nok | purgedups_l2_nok_default | purgedups_l2_nok_VGPcutoffs | hifiasm_l2_k21 | purgedups_l2_k21_default | purgedups_l2_k21_VGPcutoffs | hifiasm_l2_k31 | purgedups_l2_k31_default | purgedups_l2_k31_VGPcutoffs | IPA | IPA_purged |
|:-----------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:-----------------|:---------------------------|:------------------------------|:--------------|:--------------|
| total_num_bases | 2,224,168,269 | 1,351,576,889 | 1,188,367,148 | 2,222,471,140 | 1,346,836,325 | 1,182,863,508 | 2,223,097,156 | 1,355,784,751 | 1,182,862,056 | 1,332,099,590 | 1,176,726,747 | 1,176,647,996 | 1,331,886,907 | 1,180,543,672 | 1,180,465,568 | 1,330,877,578 | 1,179,270,444 | 1,177,972,303 | 1,245,686,462 | 1,165,793,779 | 1,165,793,779 | 1,244,971,205 | 1,172,156,874 | 1,172,156,874 | 1,243,890,104 | 1,164,967,086 | 1,164,967,086 | 1,080,623,443 | 1,070,029,210 |
| size_estimate | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 | 1,058,734,261 |
| max_heterozygosity | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% | 0.997831% |
| GC_Content | 43.4626 | 43.1192 | 43.3178 | 43.4481 | 43.1054 | 43.2881 | 43.4551 | 43.0657 | 43.2764 | 43.9681 | 43.2168 | 43.2159 | 43.9665 | 43.2379 | 43.2376 | 43.9685 | 43.2309 | 43.2214 | 43.7381 | 43.1243 | 43.1243 | 43.7162 | 43.1699 | 43.1699 | 43.7021 | 43.1111 | 43.1111 | 42.4761 | 42.3778 |
| total_num_scaffolds | 2,757 | 1,215 | 1,083 | 2,781 | 1,245 | 1,115 | 2,754 | 1,205 | 1,073 | 1,261 | 798 | 794 | 1,311 | 830 | 829 | 1,288 | 793 | 786 | 972 | 616 | 616 | 1,005 | 650 | 650 | 978 | 597 | 597 | 575 | 490 |
| total_num_contigs | 2,757 | 1,215 | 1,083 | 2,781 | 1,245 | 1,115 | 2,754 | 1,205 | 1,073 | 1,261 | 798 | 794 | 1,311 | 830 | 829 | 1,288 | 793 | 786 | 972 | 616 | 616 | 1,005 | 650 | 650 | 978 | 597 | 597 | 590 | 501 |
| number_of_gaps | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 11 |
| Longest_scaffold | 46,214,770 | 46,214,770 | 46,214,770 | 46,214,784 | 46,214,784 | 46,214,784 | 46,214,776 | 46,214,776 | 46,214,776 | 50,262,241 | 50,262,241 | 50,262,241 | 50,372,431 | 50,372,431 | 50,372,431 | 50,428,791 | 50,428,791 | 50,428,791 | 64,130,911 | 64,130,911 | 64,130,911 | 65,868,658 | 65,868,658 | 65,868,658 | 69,566,365 | 69,566,365 | 69,566,365 | 33,552,962 | 33,552,962 |
| Scaffold_N50_length | 2,955,405 | 7,258,968 | 8,379,364 | 2,998,866 | 7,383,908 | 8,477,060 | 2,998,866 | 7,258,958 | 8,379,360 | 10,709,157 | 14,151,982 | 14,151,982 | 11,908,915 | 14,446,381 | 14,446,381 | 12,368,032 | 14,424,108 | 15,226,941 | 17,289,722 | 17,435,227 | 17,435,227 | 17,291,129 | 17,435,228 | 17,435,228 | 17,435,235 | 21,431,074 | 21,431,074 | 8,893,208 | 8,893,208 |
| Scaffold_NG50_length | 10,045,395 | 10,045,395 | 10,045,395 | 10,097,027 | 10,072,371 | 9,722,664 | 10,111,749 | 10,111,749 | 10,045,389 | 15,882,063 | 15,882,063 | 15,882,063 | 16,399,956 | 16,399,956 | 16,399,956 | 16,399,925 | 16,399,925 | 16,399,925 | 21,425,933 | 21,425,933 | 21,425,933 | 21,431,078 | 21,431,078 | 21,431,078 | 22,844,882 | 22,844,882 | 22,844,882 | 8,901,910 | 8,901,910 |
| Scaffold_N95_length | 232,223 | 362,570 | 374,275 | 228,491 | 340,168 | 317,809 | 231,040 | 364,671 | 364,465 | 312,400 | 596,648 | 596,648 | 276,936 | 542,153 | 542,153 | 299,635 | 557,635 | 559,069 | 445,326 | 891,889 | 891,889 | 407,217 | 733,833 | 733,833 | 436,642 | 879,043 | 879,043 | 533,164 | 660,386 |
| Scaffold_NG95_length | 3,485,820 | 2,473,102 | 1,574,909 | 3,496,061 | 2,473,082 | 1,604,414 | 3,554,867 | 2,531,255 | 1,574,891 | 2,730,996 | 2,153,208 | 2,153,208 | 2,698,482 | 2,219,116 | 2,219,116 | 2,698,460 | 2,219,116 | 2,219,116 | 3,399,398 | 2,986,878 | 2,986,878 | 3,320,541 | 3,236,925 | 3,236,925 | 3,386,191 | 3,231,517 | 3,231,517 | 868,503 | 868,503 |
| Longest_contig | 46,214,770 | 46,214,770 | 46,214,770 | 46,214,784 | 46,214,784 | 46,214,784 | 46,214,776 | 46,214,776 | 46,214,776 | 50,262,241 | 50,262,241 | 50,262,241 | 50,372,431 | 50,372,431 | 50,372,431 | 50,428,791 | 50,428,791 | 50,428,791 | 64,130,911 | 64,130,911 | 64,130,911 | 65,868,658 | 65,868,658 | 65,868,658 | 69,566,365 | 69,566,365 | 69,566,365 | 33,552,962 | 33,552,962 |
| Contig_N50_length | 2,955,405 | 7,258,968 | 8,379,364 | 2,998,866 | 7,383,908 | 8,477,060 | 2,998,866 | 7,258,958 | 8,379,360 | 10,709,157 | 14,151,982 | 14,151,982 | 11,908,915 | 14,446,381 | 14,446,381 | 12,368,032 | 14,424,108 | 15,226,941 | 17,289,722 | 17,435,227 | 17,435,227 | 17,291,129 | 17,435,228 | 17,435,228 | 17,435,235 | 21,431,074 | 21,431,074 | 8,893,208 | 8,893,208 |
| Contig_NG50_length | 10,045,395 | 10,045,395 | 10,045,395 | 10,097,027 | 10,072,371 | 9,722,664 | 10,111,749 | 10,111,749 | 10,045,389 | 15,882,063 | 15,882,063 | 15,882,063 | 16,399,956 | 16,399,956 | 16,399,956 | 16,399,925 | 16,399,925 | 16,399,925 | 21,425,933 | 21,425,933 | 21,425,933 | 21,431,078 | 21,431,078 | 21,431,078 | 22,844,882 | 22,844,882 | 22,844,882 | 8,901,910 | 8,901,910 |
| Contig_N95_length | 232,223 | 362,570 | 374,275 | 228,491 | 340,168 | 317,809 | 231,040 | 364,671 | 364,465 | 312,400 | 596,648 | 596,648 | 276,936 | 542,153 | 542,153 | 299,635 | 557,635 | 559,069 | 445,326 | 891,889 | 891,889 | 407,217 | 733,833 | 733,833 | 436,642 | 879,043 | 879,043 | 527,794 | 625,785 |
| Contig_NG95_length | 3,485,820 | 2,473,102 | 1,574,909 | 3,496,061 | 2,473,082 | 1,604,414 | 3,554,867 | 2,531,255 | 1,574,891 | 2,730,996 | 2,153,208 | 2,153,208 | 2,698,482 | 2,219,116 | 2,219,116 | 2,698,460 | 2,219,116 | 2,219,116 | 3,399,398 | 2,986,878 | 2,986,878 | 3,320,541 | 3,236,925 | 3,236,925 | 3,386,191 | 3,231,517 | 3,231,517 | 856,071 | 856,071 |
| qv_score | 61.9987 | 62.0147 | 61.6227 | 61.945 | 62.0304 | 61.624 | 62.0604 | 62.1082 | 61.6803 | 60.8608 | 61.6912 | 61.6909 | 60.8165 | 61.7131 | 61.7129 | 60.8343 | 61.617 | 61.635 | 60.5357 | 61.342 | 61.342 | 60.389 | 61.3572 | 61.3572 | 60.4856 | 61.3088 | 61.3088 | 56.3843 | 57.2844 |
| kmer_completeness | 99.3468 | 85.5233 | 82.6667 | 99.3354 | 85.4438 | 82.561 | 99.3428 | 85.6085 | 82.5495 | 84.2954 | 82.507 | 82.5067 | 84.3931 | 82.6174 | 82.6171 | 84.4225 | 82.6217 | 82.6059 | 83.2772 | 82.4928 | 82.4928 | 83.3569 | 82.6469 | 82.6469 | 83.3613 | 82.5778 | 82.5778 | 81.3829 | 81.2586 |
| Complete_Buscos | 96.2% | 95.8% | 95.8% | 96.3% | 95.8% | 95.9% | 96.3% | 95.8% | 95.7% | 96.0% | 95.8% | 95.8% | 96.0% | 95.9% | 95.9% | 96.0% | 95.8% | 95.8% | 96.0% | 95.9% | 95.9% | 96.0% | 95.9% | 95.9% | 96.1% | 96.0% | 96.0% | 95.6% | 95.7% |
| Complete_single_buscos | 16.8% | 78.6% | 94.8% | 16.9% | 78.7% | 94.7% | 16.8% | 78.4% | 94.7% | 87.1% | 94.7% | 94.7% | 87.5% | 94.8% | 94.8% | 87.2% | 94.8% | 94.8% | 93.6% | 94.9% | 94.9% | 93.7% | 94.9% | 94.9% | 93.8% | 95.0% | 95.0% | 94.6% | 94.7% |
PDF report for each asm evaluated [here](https://drive.google.com/file/d/1emPu1VEXbFs84HnUCY7okc1EiCcOWlEU/view?usp=sharing)
*Achtung: the K-mer Multiplicity PRI (stacked) plot is wrong - the plot is actually showing the flattened pri+alt. The correct plot is in the merquey folder: asmnamemerqOutput.spectra-cn.st.png*
* Notes on [merqury](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-9) results:
-**Merqury copy number spectrum plots** (similar to KAT):
tracks the multiplicity of each k-mer found in the reads and colors it by the number of times it is found in a given assembly
->when using hifiasm.l0, the plots for the primmary asm have also a blue peak (denoting kmers appearing twice), which should have not been there since the asm is already phased. For hifiasm.l1 too, but after purging the blue is gone.
->no asms showed a bar at the origin of the plot representing kmers found only in the asm
-> the higher fraction of k-mers missing in the assembly (black peak, read-only) seen in all asm plots probably reflects kmers specific to the other haplotig! This is corroborated when looking at the plot flattened pri+alt, in which the grey peak of read-only kmers disappears.
-**kmer completeness**: first filter out low-copy k-mers that are likely to represent sequencing errors, and then define k-mer completeness as the fraction of reliable k-mers in the read set that are also found in the assembly.
-**qv-score**:(Phred quality score or kmer survival rate) assumes that kmers found only in the asm are errors. It depends on the coverage, quality of the read set, and kmer size chosen. Higher QVs indicate a more accurate consensus.
[FullTable_asms_comparisons.xlsx](https://docs.google.com/spreadsheets/d/1RD4zSMCuin1e0ovnP5k5UIenwJLs5cFM/edit?usp=sharing&ouid=108013701707137496295&rtpof=true&sd=true)
* Criteria for "good" assembly:
| Criterion | Best assembly | Best value |
| -------- | -------- | -------- |
| Contig N50>10Mb | **hifiasm.l2.k31.purged** (both with default or VGP suggested values) | 21 Mb |
| Busco completness>95% | all, but best: hifiasm.l0 (before purging), and **hifiasm.l2.k31.purged** (both with default or VGP suggested values) | 96% |
| Busco duplicated rate<3% | **all**, except hifiasm.l0 and l1 before purging, and l0 after purging with default values | 1% |
| < Number of contigs | IPA, IPApurged, **hifiasm.l2.k31.purged** (default and
| > qv-score | hifiasm.l0.k31.purged.default | 62.11 |
| > kmer completness | hifiasm.l0.k31 | 99.34 |
| copy number spectrum plots | all hifiasm.l1.purged, hifiasm.l2 and all **hifiasm.l2.purged** | no blue peak, no bar of kmers found only in the asm |
copy number spectrum plots
## Best assembly
All the QC metrics are better for asm **hifiasm.l2.k31.purged** except the qv-score and kmer completness.


*****
### Other assemblers I tried: Hicanu
https://github.com/marbl/canu/releases
```
module purge
module load lang/Anaconda3/2020.11
source activate java11-env
export PATH=/work/dominguez/YelCar_GenomeAssembly_MPI/canu-2.1.1/bin:$PATH
canu -assemble -d /path/hicanu_FatNode/ -p YelCar_hicanu_assemble genomeSize=1.06g -pacbio-hifi /path_ro_reads/reads_noAdapt.fastq useGrid=false
#change genomeSize for what was predicted by genomescope2
#Requested --mem=150G, --cpus-per-task=8, and took xx to finish.
```
*****
### Other assemblers I tried: LJA
https://github.com/AntonBankevich/LJA
```
module purge
module load lib/zlib/1.2.11-GCCcore-10.2.0
module load lang/Anaconda3
source activate cmake-env
cd /path/LJA/
/work/dominguez/LJA/bin/lja -o /path/LJA_fatnode_Cintel/ --reads /path_to_reads/reads_noAdapt.fastq --threads 8 --diploid
```
The job to assemble the hifi reads seems not to use much memory over most of its runtime. It seems that it is only at the end that it performs a huge allocation. And it is killed because of that. So, I then ran it in our "fat" node (where I can request --mem=1500G). In this case, the job failed after 14 days of running due to a time limitation. I, then, tested it with a shorter version of the infile, requesting 24 CPUs (--mem-per-cpu=16G). It took 1h and then also failed in the step of uncompressing homopolymers in contigs. It produced the expected output files except from the final fasta assembly and final gfa.
*when fatnode is free, try: submitLJAtrimmedFatNodeintelv2.slurm*
*****
### Other assemblers: Rust-mdbg
https://github.com/ekimb/rust-mdbg
I did not try it given that it does not resolve both haplotypes.
*****
### MUMMER
https://github.com/mummer4/mummer
https://mummer4.github.io/manual/manual.html
**Compare my best asm with a reference from a related species genome.**
There are no VGP genomes in the same family (Thraupidae).
I could use another family, like:
Zebra finch genome VGP([GCF_003957565.2_bTaeGut1.4.pri](https://www.ncbi.nlm.nih.gov/assembly/GCF_003957565.2/))...
or a species of the same family (but not so good asm), like **small tree finch **([GCA_902806625.1_Camarhynchus_parvulus_V1.1](https://https://www.ncbi.nlm.nih.gov/assembly/GCA_902806625.1/)), and medium ground finch ([GCF_000277835.1_GeoFor_1.0](https://www.ncbi.nlm.nih.gov/assembly/GCF_000277835.1/)).
The genome asm for the small tree finch (Camarhynchus parvulus) is slightly better (genome coverage 31X, less gaps).
all-vs-all comparison of nucleotide sequences contained in multi-Fasta data files. It is best used for similar sequences that may have large rearrangements.
download Camarhynchus parvulus (Campar) asm from ncbi (in terminal):
```
wget -O Campar_genome1.1.fna.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/806/625/GCA_902806625.1_Camarhynchus_parvulus_V1.1/GCA_902806625.1_Camarhynchus_parvulus_V1.1_genomic.fna.gz
```
job script: [submit_mummer_Hifiasm.l2.k31.p.PRI_vs_Campar.slurm](https://drive.google.com/file/d/1ABDOC2OYKV0spl6pCpQ-FVUueXqVAAML/view?usp=sharing) (Ran with --cpus-per-task=24, --mem=64G, and took 3hs).
```
module purge
module load lang/Anaconda3/2020.11
source activate mummer-env
#unzip it:
gunzip /work/dominguez/YelCar_GenomeAssembly_MPI/Camarhynchus_parvulus_Asm/Campar_genome1.1.fna.gz
#run mummer:
nucmer -p mummer.l2.k31.p.Pri_Campar /work/dominguez/YelCar_GenomeAssembly_MPI/Camarhynchus_parvulus_Asm/Campar_genome1.1.fna /work/dominguez/YelCar_GenomeAssembly_MPI/asm/reads_trimmed/hifiasm/purge_dups/c1.l2.k31/purged_default.fa
```
*I used NUCmer (NUCleotide MUMmer) version 3.1*
Dotplot between two assemblies
PartB
[submit_mummer_Hifiasm.l2.k31.p.PRI_vs_Campar_2.slurm
](https://drive.google.com/file/d/1oXEq6G959TbFwNN3hhswXHYqfi5d4JVZ/view?usp=sharing)
```
#run delta filter:
delta-filter -i 89 -l 10000 /work/dominguez/YelCar_GenomeAssembly_MPI/Camarhynchus_parvulus_Asm/mummer.l2.k31.p.Pri_Campar.delta > /work/dominguez/YelCar_GenomeAssembly_MPI/Camarhynchus_parvulus_Asm/Mummer_filtered10000/mummer2.l2.k31.p.Pri_Campar.filtered.delta
#Run show-coords:
show-coords /work/dominguez/YelCar_GenomeAssembly_MPI/Camarhynchus_parvulus_Asm/Mummer_filtered10000/mummer2.l2.k31.p.Pri_Campar.filtered.delta> /work/dominguez/YelCar_GenomeAssembly_MPI/Camarhynchus_parvulus_Asm/Mummer_filtered10000/mummer2.l2.k31.p.Pri_Campar.filtered.delta.coords
mummerplot --large --filter --fat /work/dominguez/YelCar_GenomeAssembly_MPI/Camarhynchus_parvulus_Asm/Mummer_filtered10000/mummer2.l2.k31.p.Pri_Campar.filtered.delta -p mummer2.l2.k31.p.Pri_CamPar.filtered -t png
```
This produced a .rplot, gp, and a fplot files in /work/dominguez/. I mv them to /work/dominguez/YelCar_GenomeAssembly_MPI/Camarhynchus_parvulus_Asm/Mummer_filtered10000/
It fails producing the png ("Could not find gnuplot, plot will not be rendered").
I installed gnuplot with conda.
```
gnuplot /work/dominguez/YelCar_GenomeAssembly_MPI/Camarhynchus_parvulus_Asm/Mummer_filtered10000/mummer2.l2.k31.p.Pri_CamPar.filtered.gp > mummer2.l2.k31.p.Pri_CamPar.filtered.png
```

Change in the .gp file towards the bottom this lines:
```
set linestyle 1 lt 1 lw 0.3 pt 6 ps 0.1
set linestyle 2 lt 3 lw 0.3 pt 6 ps 0.1
set linestyle 3 lt 2 lw 0.3 pt 6 ps 0.1
```

I can do a circos plot with the mummer .coords file.
### NETX STEPS:
1. assembly mitochondrial genome
2. Repeats annotation
3. Genes Annotation
4. Gene family expansion/constraction analysis considering other species (CAFE)
## Under construction:
## Contamination check ##
#https://kat.readthedocs.io/en/latest/
module purge
module load lang/Anaconda3/2020.11
source activate kat-env2
kat gcp -o /path/reads_statistics/filename_ccs_fasta.kat -t 40 -p png /path/reads/filename_ccs_reads.fa
#Blobtools?
#Bandage (before purging, but not possible after purging because there is no gfa)
## Tutorials / resources
kmer evaluation: https://github.com/marbl/merqury/wiki/2.-Overall-k-mer-evaluation
https://eukaryotic-genome-assembly.github.io/handsOn_1/
https://training.galaxyproject.org/training-material/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html#post-assembly-processing
https://ucdavis-bioinformatics-training.github.io/2020-Genome_Assembly_Workshop/ONT_Assembly/nanopore
https://canu.readthedocs.io/en/latest/tutorial.html#outputs
https://github.com/dfguan/purge_dups