# TE annotation ###### tags: `TE`, `annotation`, `Repeat Masker` Once we have a nice valid TE custom library of our new species, we can take a closer look at how it is distributed (across contigs/cosmids) in the genome. Along with the proportions (%), a way to look at this is generating annotation files like [.gff](https://m.ensembl.org/info/website/upload/gff3.html) ``` ##gff-version 3 ctg123 . mRNA 1300 9000 . + . ID=mrna0001;Name=sonichedgehog ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mrna0001 ctg123 . exon 1050 1500 . + . ID=exon00002;Parent=mrna0001 ctg123 . exon 3000 3902 . + . ID=exon00003;Parent=mrna0001 ctg123 . exon 5000 5500 . + . ID=exon00004;Parent=mrna0001 ctg123 . exon 7000 9000 . + . ID=exon00005;Parent=mrna0001 ``` Of course there are versions and another variants (.bed, genebank, etc). I like gff3 btw. Repeat Masker allows you to, first map you TElib in your genome and export this as .gff file format. We need to specify the `-gff` option. Like: `RepeatMasker -dir RM -pa 6 -a -low -no_is -e ncbi -gff -lib /groups/arklab/Repbase/invrep25.12.ref.fasta contig.fa` `-dir RM` is to better organize the output in folders I would like to do this for: 1. The custom TE library 2. The RepeatMasker/Repbase library Custom TE sould be `Dscosmid_denovoLibTEs_filtered_MCL.fa.classified`, cos the extra #TEtype in each fasta header will by use by RepeatMasker ``` >RIX-comp_MCL1_Dscosmid-B-G12-Map3#LINE/CR1 AGGCTTTGAGTCGCTGTAGAAGTCCGATGGCTAGGGATGGAGTACATTTT TCGAAAGAGGGAGCGGTAGCAGTAGGATTGGCTATCATGAAGGAAAATGA GCCTTTTTTAGGATTGTAGATGGGGGGGCAAGGGATCACAAGCAAGGGAC AGCCACTACGAGCAGCTTTAGACCGCCGCGTACTGGTGTGCAGGAAAGGC ``` The Repbase is just not using -lib option. It will pick `/groups/arklab/bin/RepeatMasker/Libraries/RepeatMasker.lib` Let's do first the RM/Repbase lib (no -lib option). Just type: `RepeatMasker -dir RM -pa 6 -a -low -no_is -e ncbi -gff cosmid.fa` ## Repeat Protein masker Uses a protein TE library: `/groups/arklab/bin/RepeatMasker/Libraries` ALWAYS symlink the input fasta file (assembly.fa) to the working dir; otherways it's going to place output files where the fasta file is located! `RepeatProteinMask -engine ncbi -pvalue 0.001 -noLowSimple contig.fa` This will create two files: .annot: table with TE annotation .masked: input fasta file masked (regions with TE are N's). Not useful for us.