# DNAZoo Annotation Project Having genomes is good - chromosome level genomes are better... But it's the annotation of those genomes that is really enabling. Knowing something about the full complement of genes, and their order on chromosomes, is exeptionally powerful, and allows direct comparisons of genomes across the tree of life. To begin to understand this, we set out to create a set of annotations for all the mammals of the DNAZoo. tl;dr, the entire set, from 67 genomes, can be found here: https://www.dropbox.com/sh/xt300ht42mihjov/AADoENW7RTvR3jTh1a8qUOmRa. Each species' folder contains a set of transcripts, proteins, and a gff3 file. A few important stats. On average, we identified 16445 genes per species. We do know that we're missing about 10% of genes. We have a clear analytical path to recover most of these (currently missing) genes, so stay tuned! We recovered 1,101,834 genes, 98.1% of these genes are contained in about 22,156 orthogroups, as defined by orthofinder. What does this mean - that although we may be missing some genes (more on this below), with <2% of genes being unassigned, we are not predicting a bunch of junk. Yay! The false positive rate is very low. ### What did we do: **Maker-based annotation**: Because we do not have transcriptome data for the large collection of species, we devised a strategy to leverage the power of homology, along with the fact that we have extensive knowledge of gene content in mammals. Specifically, we elected to generate use the subset of SwissProt that included just mammals. We believe this is a reasonable 1st approach, given the broad coverage of this dataset. That file is available here: https://www.dropbox.com/s/nedwdu4w6klq7yx/swissprot_mammals.cdhit.fasta So.. the annotations contained here are based on SwissProt mammals - this does a pretty good job and identifies the vast majority of genes. To reproduce these runs, see https://github.com/macmanes-lab/dnazoo_annotation. Each annotation took between 18 and 35 hours to run across 48 cores. **orthofinder**: In adddition to the annotations, we aimed to generate orthogroups using orthofinder2 (Emms, 2018) - this is incredibly useful information for compartive biologists. We've even included a tree. This tree was constructed within OrthoFinder using the defauls settings (e.g., using FastTree). It was made using 1023 orthogroups. Note that there may be some inconsistancies in the topology, as it's out 1st stab at this, and further refinements are upcoming! ![](https://i.imgur.com/K9msBfH.png) ### Whats next?? 1. We know there are some genes that we've missed, and we have been developing a better approach, without sacrificing too much time. In addition to this, our current approach is missing ncRNA - stay tuned, because v2 annotation will contain these critically important elements! 2. Tell us what you want? What makes this even more useful? Let us know! 3. Is your favorite gene missing? Let us know and we can see where it went.