Progress Report

--- title: Progress Report tags: MosquiTE description: --- Material and Methods Discovery and Annotation of TEs From the genome assembly Aalbo_primary.1 (Palatini et al. 2020), three distinct TE annotation algorithms were employed: EDTA, RepeatModeler2, and MITE-Tracker. The resulting TE insertions were consolidated into a single multifasta file, which was then utilised to mask the genome by filtering out sequences shorter than 80 bp using the RepeatMasker command RepeatMasker -pa 64 -s -a -inv -nolow -gff -dir . -lib $WORK_DIR/TE_Aealb.raw.fa -cutoff 250 $ASSEMBLY. Following this, the OneCodeToFindThemAll pipeline (Bailly-Bechet, Haudry, and Lerat 2014) was applied to reconstruct the Long Tandem Repeat (LTR) of each insertion while retaining only those longer than 80 bp. This involved utilising the following command: build_dictionary.pl --rm Aalb.out --unknown --fuzzy > dico_fuzzy.txt one_code_to_find_them_all.pl --rm Aalb.out --ltr dico_fuzzy.txt --fasta --flanking 100 --strict --unknown --insert 80. To diminish redundancy and group copies within the same family, cd-hit (Huang et al. 2010; Li and Godzik 2006) was employed utilising the following command: cd-hit est -i copies.fasta -o consensi.fasta -c 0.8 -G 0 -aS 0.8 -M 90000 -T 64 -n 5 -d 0 To cluster similar sequences and generate families based on the 80-80-80 rule (Flutre et al. 2011). Clusters with less than 7 sequences were excluded. For each cluster, a consensus was created using Refiner (Flynn et al. 2020). [](https://) To improve the annotation, manual curation was performed on each TE family. The TE database was used on RepeatMasker and the genome assembly Aalbo_primary.1 to get each insertion position. Following this, using the inhouse script GetMultipleAln.sh, each insertion was extended to 2,000 bp on both flanks, extracted and the 100 longest insertions clustered together using ClustalO (Sievers and Higgins 2014) for examination of the characteristic component of the respective family of TE. The examination of each TE family was done using a set of manual curation and identification as TE-Aid (Goubert et al. 2022), Repbase (Jurka et al. 2005; Kohany et al. 2006; Kapitonov and Jurka 2008), RepeatClassifier (Flynn et al. 2020), CDD protein domains (Lu et al. 2020), and alignment visualisation using Aliview (Larsson 2014) following the recommendations for annotation of Goubert et al. (2022). In a nutshell, the TE-Aid of each consensus for every TE family was analysed to guide TE annotation. Subsequently, protein domains were sought within the consensus using the CDD protein domain. Later on, each cluster of insertions was manually curated, investigating the clusterization through Aliview, aiming to identify Long Tandem Repeats (LTR), Target Site Duplications (TSD), and Terminal Inverted Repeats (TIR).