---
title: All Genomes
break: false
tags: Genomes
---
## Server calendar
https://calendar.google.com/calendar/embed?src=626824cfa2287071cb97b45390780f87fe9ac336150bfa9d31018200c2ad6136%40group.calendar.google.com&ctz=America%2FChicago
^^ Reserve time for assembly steps on each server here ^^
## Species with Genome Data
### Amorphophallus titanum
Nora
### Brighamia
Nora (Jackson?)
### Hibiscus
Nora
### Sanseveria/dracaena
Evana
### Amsonia longiflora
Dylan
### Synthyris bullii
Andrew Davies
## Meeting on 7/1/24
* Nora will run QC on all genomes for next meeting with LongQC
* Should have: 20X+ coverage & >1000 bases
* With HifiASM, will kick out total length (rough size of genome), and N50 to give a reasonable sense of assembly
#### General Assembly Pipeline:
1. Run QC with 'SequelTools' or 'LongQC'(expected 20X+ coverage)
2. Run HifiASM - primary and alternate assembly
3. Run BUSCO - uses known genomes to test assembly quality and completeness
4. Create table that contains: total assembly length (bp), N50 (Kbp; 1Mb is a good #), # scaffolds/contigs, BUSCO score (S:D:F:M%)
5. Error Correction - Use pbmm2 and/or GCpp to map raw reads back to assembly
6. Check BUSCO again!!
7. PROFIT(Genome)
## Getting started with genome data
#### Nora's Notes 7/5/24
1) Copy the tar files and md5sum files from the jbod to the directory we're going to work in
2) un-tar the files
```
tar -xvf file.tar.gz
```
3) check that all the files are there and that nothing has been truncated in the transfer
```
md5sum -c PBG-4009.459.PB_000661_Pool_B_2plx_HiFi_BP_AL_020924_GD_280pM-Cell4.content_md5s
```
4) Some of our files might be in .bam format and we want them to be in fastq format for the qc. In the hifi_reads directory, convert the .bam filesto fastq:
```
samtools fastq file1.bam > file1.fastq
```
5) QC check with longqc
First will need to install the docker image for longqc. https://github.com/yfukasawa/LongQC
Suzy might have to add you to the docker list.
Make a directory for the sample you're going to work with. In that directory:
```
mkdir input
mkdir output
```
then copy the file you are going to qc into input. The qc takes a good while, so do it in screen.
```
docker run -it -v ./input:/input -v ./output:/output longqc sampleqc -x pb-sequel -o /output/bc2037.fastq -p 5 /input/bc2037.fastq
```
run the assembly
```
hifiasm -t 30 sample.fastq
```
Moved the fastq over to @funk, and convert .gfa to .fa
```
awk '/^S/{print ">"$2;print $3}' hifiasm.asm.bp.p_ctg.gfa > hifiasm.asm.bp.p_ctg.fa
```
Run busco
```
busco -i ./hifiasm.asm.bp.p_ctg.fa -m genome -c 30 --config ../config.ini --lineage_dataset embryophyta_odb10 -o ./primaryContigs
```
Notes on busco: running with config.ini file helps find the paths busco needs. Had to delete the metaeuk path, it didn't like the path but it is in the usr/local/bin/ so it runs without setting the path in the config file.
Brigamia results from primary contigs:
```
---------------------------------------------------
|Results from dataset embryophyta_odb10 |
---------------------------------------------------
|C:99.7%[S:18.2%,D:81.5%],F:0.1%,M:0.2%,n:1614 |
|1610 Complete BUSCOs (C) |
|294 Complete and single-copy BUSCOs (S) |
|1316 Complete and duplicated BUSCOs (D) |
|2 Fragmented BUSCOs (F) |
|2 Missing BUSCOs (M) |
|1614 Total BUSCO groups searched |
---------------------------------------------------
```
Combine reads:
```
cat hifiasm.asm.bp.hap1.p_ctg.fa hifiasm.asm.bp.hap2.p_ctg.fa > hibiscus_combined.fa
```
meryl database:
```
../meryl-1.4.1/bin/meryl count k=15 output merylDB hibiscus_combined.fa
../meryl-1.4.1/bin/meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt
```
Create k-mer files (takes a while, use screen, can do both simultaneously in different screens to save time, or also can be done concurrently with the steps below up until the nextPolish2 step):
```
../yak/yak count -o k31.yak -k 31 -b 37 -i hibiscus.fastq
../yak/yak count -o k21.yak -k 21 -b 37 -i hibiscus.fastq
```
Followed by winnowmap. Some resources instruct to use winnowmap command and pipe to samtools, but this doesn't work because winnowmap creates a .sam file that does not have the SAM header. So without piping to samtools:
(hibiscus.fastq is the file of raw sequence reads)
```
../Winnowmap/bin/winnowmap -t 40 -W repetitive_k15.txt -ax map-pb hibiscus_combined.fa hibiscus.fastq -o hibiscus213.sam
```
Need to add a SAM header. To do this I just map a random (small) .fastq to the genome file. First make a bowtie index:(This does take awhile, so be sure to do using screen)
```
bowtie-build hibiscus_combined.fa combined_index
```
Now map:
```
bowtie -S -v 2 -p 4 combined_index input.fastq > output.sam
```
Grab the header:
```
samtools view -H output.sam | grep -v "^@PG" > headerfile
```
and add it to the .sam file output by winnowmap (also this takes a long time because the .sam file is huge; do this using screen):
```
cat headerfile hibiscus213.sam > hibiscus_combinedHeader219.sam
```
and sort, then index:
```
samtools sort -o hibiscus219.sort.bam hibiscus_combinedHeader219.sam
samtools index hibiscus219.sort.bam
```
Now polish with the bam and k-mer files:
```
nextPolish2 -t 60 hibiscus219.sort.bam hibiscus_combined.fa k21.yak k31.yak -o asm.np2.hibiscus_combined220.fa
```
## Blasting the assembly against chloroplast reference--how much is cp in our assembly?
```
mkdir blast_cp
# 1. Make database
makeblastdb -in ../asm.np2.hibiscus_combined220.fa -dbtype nucl
# 2. Run BLAST
blastn -query ../hibiscus_cannibanus.fasta -db hicl_db -out blast_results.tsv -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"
# 3. Filter results
awk '$3 >= 90 && $4 >= 500' blast_results.tsv > filtered_blast_results.tsv
```
$3 = percent identity
$4 = alignment length
This keeps only lines where:
% identity ≥ 90
alignment length ≥ 500 bp
### Grabbing the scaffolds that match the chloroplast:
```
cut -f2 filtered_blast_results.tsv | sort | uniq > chloroplast_hits.txt
/home/srs57/remove/Utilities/miniconda3/bin/seqkit grep -f chloroplast_hits.txt ../asm.np2.hibiscus_combined220.fa > chloroplast_matching_scaffolds.fasta
```
### Then removing the scaffolds that match the chloroplast from the assembly
```
/home/srs57/remove/Utilities/miniconda3/bin/seqkit grep -v -f chloroplast_hits.txt ../asm.np2.hibiscus_combined220.fa > genome_assembly_no_chloroplast.fasta
```
### Now we want to extract the subgenomes from our assembly.
We are going to use Subphaser https://github.com/zhangrengang/SubPhaser
This program requires a config file of homeologous chromosomes. To get an idea of this, we want to do a quick syntony analysis. First, let's try using minimap2:
```
minimap2 -x asm5 -t 8 ../../blast_cp/genome_assembly_no_chloroplast.asm.np2.hibiscus_combined220.fasta ../../blast_cp/genome_assembly_no_chloroplast.asm.np2.hibiscus_combined220.fasta > alignments.paf
```