Tunicate Genome Assembly

# Tunicate Genome Assembly ## TL;DR - The *Ascidiella* assembly is decent but covers perhaps 6% of conserved genes. - *Molgula* needs more (and probably better quality) nanopore data. ## Work flow ```mermaid graph LR; A["nanopore data\n(FAST5)"]--guppy-->B["filtered reads\n(FASTQ)"]; B--flye-->C["assembled contigs\n(FASTA)"]; C--BUSCO-->Q1["quality assessment\n(text)"]; C--repeatmodeler-->R["repeat predictions\n(FASTA)"]; C--repeatmasker-->E["masked assembly\n(FASTA)"]; R--repeatmasker-->E E--BRAKER3-->F["gene predictions\n(GFF3)"]; F--BUSCO-->Q2["quality assessment\n(text)"]; Augustus["Augustus/BUSCO Database\n(single-copy orthologs)"]--BRAKER3-->F F--EnTAP-->G["gene annotations\n(TSV)"]; style A fill:#ffe,stroke:#333,stroke-width:3px style B fill:#ffe,stroke:#333,stroke-width:3px style C fill:#ffe,stroke:#333,stroke-width:3px style R fill:#ffe,stroke:#333,stroke-width:3px style E fill:#ffe,stroke:#333,stroke-width:3px style F fill:#ffe,stroke:#333,stroke-width:3px style G fill:#ffe,stroke:#333,stroke-width:3px style Augustus fill:#fff style Q1 fill:#ddd style Q2 fill:#ddd ``` ## Tips - The last guppy log file contains the length of time the program ran in milliseconds. - The number of nanopore "records" can be found from the guppy log files: `wc -l *.log` -24. - To find the distribution of read lengths, run the following in the `pass` and `fail` folders produced by guppy: ```bash zcat *.gz | grep "^@" -A1 | grep -v [@-] >> tmp cat tmp | awk '{print length}' | sort -n | uniq -c | sed "s/^[ ]*//" > passed.read.lengths.txt ``` I wrote an R script to find the N50 and N90 stats, `combine.read.length.distributions.R`, based on the two `read.lengths.txt` files. - I ran the *Ascidiella* flye assembly 7 different ways to optimize the available parameters. The best results came from using the arguments `--nano-raw ` and `--iterations 5` ("polishing"). Don't bother with the `--genome-size`, `--asm-coverage` or `--scaffold` flags. I used 32 threads. *Ascidiella* got away with 128G memory, but for *Molgula* I had to use 248G. - The *Ciona intestinalis* genome is ~1.6Mb, but some many be as large as 874 Mb. So it's hard to know how complete things are beyond a BUSCO analysis. --- ## *Molgula citrina* ### Starting data - FAST5 data: 23 Gb - total number of nanopore records: 966,065 ### Base calling (all reads) - Guppy ran for 5.8 days! - number of reads: 241,521 - total length: 460,191,960 bp - longest read: 949,662 bp - N50: 122,426 bp - N90: 9,790 bp ### Filtering (passing reads) - number of reads: 8,298 (3.44%) - total length: 153,521,883 bp (33.4%) - longest read: 791,705 bp - N50: 192,503 bp - N90: 37,018 bp ### Assembly - Flye ran for 7.5 days - Total length: 290,758 bp - Fragments: 2 - Fragments N50: 146,863 bp - Largest frg: 146,863 bp - Scaffolds: 0 - Mean coverage: 26 - BUSCO: C:0.0%[S:0.0%,D:0.0%],F:0.0%,M:100.0%,n:954 - Anecdotally, the two contigs look entirely like repeats! --- ## *Ascidiella aspera* - Estimated genome size: 309 Mb ### Starting data - FAST5 data: 13 Gb - total number of nanopore records: 734,798 ### Base calling (all reads) - Guppy ran for 2.5 days - number of reads: 367203 - total length: 660754911 bp - longest read: 728650 bp - N50: 4003 bp - N90: 805 bp ### Filtering (passing reads) - number of reads: 286,232 (77.9%) - total length: 555,242,318 bp (84%) - longest read: 437,671 bp - N50: 3,743 bp - N90: 820 bp ### Assembly `~/results/flye/Ascidiella_aspera_run4` - Total length: 27,275,764 bp (estimated 8.8% of the genome) - Fragments: 3814 - Fragments N50: 10,245 bp - Largest frg: 78,586 bp - Scaffolds: 0 - Mean coverage: 6 - BUSCO C:**6.1%**[S:6.1%,D:0.0%],F:2.5%,M:91.4%,n:954 ### Gene Prediction - After repeat modeling and masking, WebAUGUSTUS predicted 3806 protein-coding genes. > While [BRAKER](https://github.com/Gaius-Augustus/BRAKER) has worked for me in the past on [UConn's Xanadu HPC cluster](https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/xanadu/), it would not run with the *Ascidiella* assembly, for reasons that are not yet clear. So I'm trying a different approach using [WebAUGUSTUS](https://bioinf.uni-greifswald.de/webaugustus/index), which is maintained by the same group that created BRAKER. I started by [training](https://bioinf.uni-greifswald.de/webaugustus/training/create) AUGUSTUS on gene models from *[Ciona intestinalis](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/224/145/GCF_000224145.3_KH/)*, using genomic, protein and RNA data. Among the output was `parameters.tar.gz` which is the necessary input for the next step. I ran gene [prediction](https://bioinf.uni-greifswald.de/webaugustus/prediction/create) using the parameters obtained from training on *Ciona* and the *Ascidiella* assembly from flye. - BUSCO: C:**3.7%**[S:3.7%,D:0.0%],F:4.3%,M:92.0%,n:954 - [EnTAP](https://entap.readthedocs.io/en/latest/Getting_Started/introduction.html) provided annotations for 1145 predicted proteins. - Some random proteins you might be interested in: FGF13, FGFR, beta-catenin, exportin-4 - Sadly no claudins! ## Potential Follow-ups If/when we have several complete(-ish) genomes from multiple species, [OrthoFinder](https://github.com/davidemms/OrthoFinder?tab=readme-ov-file#what-does-orthofinder-do) can be used to identify gene families across those species. This can include species' genomes pulled from NCBI. [CAFE5](https://github.com/hahnlab/CAFE5?tab=readme-ov-file#what-cafe5-does) can then be used to test for families with significant expansion or loss across the tree. Combining that information with the annotations of genes in those families can make for some cool biological insights into genome evolution. ```mermaid graph LR; F["gene predictions\n(FASTA or 'aa')"]--OrthoFinder-->H["gene family prediction\n(TSV, Newick)"]; H--CAFE5-->I["evolutionary rates\n(TSV, Newick)"] style F fill:#ffe,stroke:#333,stroke-width:3px style H fill:#ffe,stroke:#333,stroke-width:3px style I fill:#ffe,stroke:#333,stroke-width:3px ``` ---