Annotathon 2024 instruction manual

## Introduction ## Confidentiality and publication plans We are gathering together and will look and discuss about a lot of unpublished data and ideas. Let's remember: - Be respectful of the other participants' existing projects. We have so much to do, so if one of us says “_actually, I am already doing this…_” let's just move to the next task. Remember what what you just found was maybe already known by others. - We may sometimes disagree on some conclusions. Let's take the time to reach consensus. If some questions do not have a definitive answer at the end of the hackathon it is fine! ## Things to do: Check the list of genes to update/annotate: https://trello.com/b/NwrYLyI7/annotation-hackathon-2024 ## Main resources ### JBrowse Genome Browser private.oikobrowser.jnicolaus.com ### GitHub links Link for shared annotations on Github: https://github.com/oist/Oikopleuradioica_genomeannotation GitHub Issues, where manual annotations will be eventually pushed: https://github.com/oist/Oikopleuradioica_genomeannotation/issues ### Data sources #### Dropbox with shared resources https://www.dropbox.com/scl/fo/wvx9p1pnfqqltpf17vj9g/h?rlkey=pqqacl5xh7ryn5ohk6ipiaz8l&dl=0 Below are the description of the files in the dropbox link 1. **eggnog**: gene name and functional annotation based on orthology 2. **gff files**: gene annotation files used as a base #### Orthogroups inferred by OrthoFinder2 Sets of homologous genes inferred by OrthoFinder2. There are two important subsets of this data: * *O. dioica*-specific orthogroups: https://docs.google.com/spreadsheets/d/1LZfYO-X2jH7hnSn_7aswIo8bdT2AfiyNCmvvPL6i-AQ/edit?usp=drive_link * These are best for inferring orthologues (i.e., same gene different species) and paralogs (i.e., genes that could be duplicated, etc.) within *O. dioica*. * Orthogroups shared between all chordates: https://docs.google.com/spreadsheets/d/1baGifDVMb5yYau4O8MY9ReWDrB6O95mRipS1PJ3HWqU/edit?usp=sharing * These are best for inferring ancient orthologues (i.e., highly conserved genes that are conserved from the chordate ancestor). ## Everybody can contribute at their own level! - _Beginners_: assign a name to a gene, leave a note that for that gene it is not easy to assign the name. - _Avanced_: edit the annotation of a gene, for instance adding UTRs. Add a missing gene. - _Experts_: Resolve erroneously merged genes. ## Instructions on editing/adding a gene Gene editing spreadsheet: https://docs.google.com/spreadsheets/d/1ThjjS3HFhshExWN338VpSXok94b0tkl_tHS-qoQNjM8/edit?usp=sharing ### If gene is already annotated but needed polishing or gene name 1. Find the gene on the annotation - BLAST the gene from ciona to odioica - Check mapping of uniprot odioica proteomes to odioica - Check orthofinder results to see orthologous gene with current annotation 2. Open gff3 file `annotathon 2024/gff files` - Using gene id, find it on the browser - ctrl+F and find it using text editor - Copy and paste to the sheet 3. Editing - Edit the genes based on the genome browser annotation - Keep in mind that the index is 1-based - Add the 5' UTR based on the CAGE peaks - Add the 3' UTR based on the RNA-seq annotation and polyadenylation signals (`TTATTT` on the - strand and `AAATAA` on the + strand). - ID can be removed for exon/intron/CDS/UTR of new models. However, when adding new gene models, we can add `_1` or `_2`, and `.t1` `.t2` etc for new transcript models. The `ID` and `Parent` attributes have to be assigned properly. - Note that when expanding a gene annotation with UTRs the `gene` and `transcript` features must be updated accordingly. - Splice site reminder: `GT|AG` (+) / `CT|AC` (-) is canonical and `GA|AG` (+) / `CT|TC` (-) is non-canonical. 4. Validation - Check RNA-seq data - Check proteome alignment - Check liftoff - Check synteny Tips on editing genes: 3. 5' UTR and 3' UTR lie on exons, 5' UTR needs to be labeled as SL or non SL 4. Double check if the data works on the genome browser Once done, let's ### If the gene is non-existent 1. Inspect liftoff and proteome alignment 2. Base the gene off these data ## Searching the genome assemblies with LAST There is an experimental server at <http://last.oikobrowser.jnicolaus.com>, written by Charles and ChatGPT. Expect crashes and follow up on the annotathon channel. Feel free to suggest simple enhancements. ## Some notes about corner cases * Always flag any ambiguities. We can help one another. * If you encounter a case where a gene is not present in the annotation because of a premature stop codon, which is related to a sequencing error in the genome: * ***Flag it!*** * Then: * remove the problematic exon containing the stop codon from the transcript annotation, or; * split the transcript into 2 halves with the nearest in-frame codon. So AAAAAA becomes exon([AAA])AA exon([AAA]) * Because a stop codon might be a real, recently-acquired pseudogene, it would be best to ascertain that an uninterrupted, error-free CDS or protein actually exists in that location, rather than inferring its existence by homology. This will require more work. Thus, *flag it!* * Later, if we have good reason to believe that a gene really exists and is not interrupted by a stop codon, then the reference genome sequence is *erroneous* - one can never correctly represent that transcript relative to an incorrect reference genome. * A solution is to insert a few *N*s as ambiguous nucleotides at those positions within the genome to restore the reading frame of the transcript. * Introducing a letter into a reference genome throws off all related genomic data: every other gene needs to change its position, every RNAseq run needs to be adjusted, all synteny tracks, and so on. Changing a reference genome is an expensive operation and is best left for new releases of a genome, when everything can be re-ran later. So please *flag these cases.* 😊 ## If you are getting sleepy Computer work in a dark room is not the best receipe to stay awake. We have a long way to go, do not worry about taking a break. Take a look at the: - ping-pong table at the ground level of Lab 4 (where the hackathon takes place). - terrace and vending machines at the end of Level F in Lab 4. - hammac in the unit's office in Lab3 (be careful that naps longer than 15 minutes can induce deep sleep instead of refreshing you). - ice cream vending machine in front of our office in Lab 3. - Café Tancha and its ocean view in the central building.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.