Genomes Overview & Meeting Notes

--- title: All Genomes break: false tags: Genomes --- ## Server calendar https://calendar.google.com/calendar/embed?src=626824cfa2287071cb97b45390780f87fe9ac336150bfa9d31018200c2ad6136%40group.calendar.google.com&ctz=America%2FChicago ^^ Reserve time for assembly steps on each server here ^^ ## Species with Genome Data ### Amorphophallus titanum Nora ### Brighamia Nora (Jackson?) ### Hibiscus Nora ### Sanseveria/dracaena Evana ### Amsonia longiflora Dylan ### Synthyris bullii Andrew Davies ## Meeting on 7/1/24 * Nora will run QC on all genomes for next meeting with LongQC * Should have: 20X+ coverage & >1000 bases * With HifiASM, will kick out total length (rough size of genome), and N50 to give a reasonable sense of assembly #### General Assembly Pipeline: 1. Run QC with 'SequelTools' or 'LongQC'(expected 20X+ coverage) 2. Run HifiASM - primary and alternate assembly 3. Run BUSCO - uses known genomes to test assembly quality and completeness 4. Create table that contains: total assembly length (bp), N50 (Kbp; 1Mb is a good #), # scaffolds/contigs, BUSCO score (S:D:F:M%) 5. Error Correction - Use pbmm2 and/or GCpp to map raw reads back to assembly 6. Check BUSCO again!! 7. PROFIT(Genome) ## Getting started with genome data #### Nora's Notes 7/5/24 1) Copy the tar files and md5sum files from the jbod to the directory we're going to work in 2) un-tar the files ``` tar -xvf file.tar.gz ``` 3) check that all the files are there and that nothing has been truncated in the transfer ``` md5sum -c PBG-4009.459.PB_000661_Pool_B_2plx_HiFi_BP_AL_020924_GD_280pM-Cell4.content_md5s ``` 4) Some of our files might be in .bam format and we want them to be in fastq format for the qc. In the hifi_reads directory, convert the .bam filesto fastq: ``` samtools fastq file1.bam > file1.fastq ``` 5) QC check with longqc First will need to install the docker image for longqc. https://github.com/yfukasawa/LongQC Suzy might have to add you to the docker list. Make a directory for the sample you're going to work with. In that directory: ``` mkdir input mkdir output ``` then copy the file you are going to qc into input. The qc takes a good while, so do it in screen. ``` docker run -it -v ./input:/input -v ./output:/output longqc sampleqc -x pb-sequel -o /output/bc2037.fastq -p 5 /input/bc2037.fastq ``` run the assembly ``` hifiasm -t 30 sample.fastq ``` Moved the fastq over to @funk, and convert .gfa to .fa ``` awk '/^S/{print ">"$2;print $3}' hifiasm.asm.bp.p_ctg.gfa > hifiasm.asm.bp.p_ctg.fa ``` Run busco ``` busco -i ./hifiasm.asm.bp.p_ctg.fa -m genome -c 30 --config ../config.ini --lineage_dataset embryophyta_odb10 -o ./primaryContigs ``` Notes on busco: running with config.ini file helps find the paths busco needs. Had to delete the metaeuk path, it didn't like the path but it is in the usr/local/bin/ so it runs without setting the path in the config file. Brigamia results from primary contigs: ``` --------------------------------------------------- |Results from dataset embryophyta_odb10 | --------------------------------------------------- |C:99.7%[S:18.2%,D:81.5%],F:0.1%,M:0.2%,n:1614 | |1610 Complete BUSCOs (C) | |294 Complete and single-copy BUSCOs (S) | |1316 Complete and duplicated BUSCOs (D) | |2 Fragmented BUSCOs (F) | |2 Missing BUSCOs (M) | |1614 Total BUSCO groups searched | --------------------------------------------------- ``` Combine reads: ``` cat hifiasm.asm.bp.hap1.p_ctg.fa hifiasm.asm.bp.hap2.p_ctg.fa > hibiscus_combined.fa ``` meryl database: ``` ../meryl-1.4.1/bin/meryl count k=15 output merylDB hibiscus_combined.fa ../meryl-1.4.1/bin/meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt ``` Create k-mer files (takes a while, use screen, can do both simultaneously in different screens to save time, or also can be done concurrently with the steps below up until the nextPolish2 step): ``` ../yak/yak count -o k31.yak -k 31 -b 37 -i hibiscus.fastq ../yak/yak count -o k21.yak -k 21 -b 37 -i hibiscus.fastq ``` Followed by winnowmap. Some resources instruct to use winnowmap command and pipe to samtools, but this doesn't work because winnowmap creates a .sam file that does not have the SAM header. So without piping to samtools: (hibiscus.fastq is the file of raw sequence reads) ``` ../Winnowmap/bin/winnowmap -t 40 -W repetitive_k15.txt -ax map-pb hibiscus_combined.fa hibiscus.fastq -o hibiscus213.sam ``` Need to add a SAM header. To do this I just map a random (small) .fastq to the genome file. First make a bowtie index:(This does take awhile, so be sure to do using screen) ``` bowtie-build hibiscus_combined.fa combined_index ``` Now map: ``` bowtie -S -v 2 -p 4 combined_index input.fastq > output.sam ``` Grab the header: ``` samtools view -H output.sam | grep -v "^@PG" > headerfile ``` and add it to the .sam file output by winnowmap (also this takes a long time because the .sam file is huge; do this using screen): ``` cat headerfile hibiscus213.sam > hibiscus_combinedHeader219.sam ``` and sort, then index: ``` samtools sort -o hibiscus219.sort.bam hibiscus_combinedHeader219.sam samtools index hibiscus219.sort.bam ``` Now polish with the bam and k-mer files: ``` nextPolish2 -t 60 hibiscus219.sort.bam hibiscus_combined.fa k21.yak k31.yak -o asm.np2.hibiscus_combined220.fa ``` ## Blasting the assembly against chloroplast reference--how much is cp in our assembly? ``` mkdir blast_cp # 1. Make database makeblastdb -in ../asm.np2.hibiscus_combined220.fa -dbtype nucl # 2. Run BLAST blastn -query ../hibiscus_cannibanus.fasta -db hicl_db -out blast_results.tsv -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" # 3. Filter results awk '$3 >= 90 && $4 >= 500' blast_results.tsv > filtered_blast_results.tsv ``` $3 = percent identity $4 = alignment length This keeps only lines where: % identity ≥ 90 alignment length ≥ 500 bp ### Grabbing the scaffolds that match the chloroplast: ``` cut -f2 filtered_blast_results.tsv | sort | uniq > chloroplast_hits.txt /home/srs57/remove/Utilities/miniconda3/bin/seqkit grep -f chloroplast_hits.txt ../asm.np2.hibiscus_combined220.fa > chloroplast_matching_scaffolds.fasta ``` ### Then removing the scaffolds that match the chloroplast from the assembly ``` /home/srs57/remove/Utilities/miniconda3/bin/seqkit grep -v -f chloroplast_hits.txt ../asm.np2.hibiscus_combined220.fa > genome_assembly_no_chloroplast.fasta ``` ### Now we want to extract the subgenomes from our assembly. We are going to use Subphaser https://github.com/zhangrengang/SubPhaser This program requires a config file of homeologous chromosomes. To get an idea of this, we want to do a quick syntony analysis. First, let's try using minimap2: ``` minimap2 -x asm5 -t 8 ../../blast_cp/genome_assembly_no_chloroplast.asm.np2.hibiscus_combined220.fasta ../../blast_cp/genome_assembly_no_chloroplast.asm.np2.hibiscus_combined220.fasta > alignments.paf ```