Beluga/Narwhal TEs

# Beluga/Narwhal TEs Working with a Martin and Manuel to do a quick de novo analysis of TEs in the most recent Narwhal assembly (and I think he wants the beluga as well but I can't tell from the e-mail thread. Doesn't hurt to do it.) Transfer the assembly and set up directories. latest beluga - https://www.ncbi.nlm.nih.gov/bioproject/520934 latest narwhal - https://www.ncbi.nlm.nih.gov/assembly/GCA_005190385.3 Set up directory structure ``` interactive -p nocona -c 1 cd /lustre/scratch/mhoyosro mkdir whale_curations cd whale_curations/ mkdir assemblies cd assemblies ##download and transfer assemblies to this directory ## rename fa.gz files to mMon.fa.gz for narwhal and dLeu for beluga cd .. mkdir dLeu mkdir mMon ``` Check that conda is installed ``` . ~/conda/etc/profile.d/conda.sh conda activate ``` Start process of copying necessary files and setting up analysis runs. This will likely fail due to inaccurate paths that need to be specified for Chris. ``` mkdir /lustre/scratch/chpeery/spider_curations/templates cp -r /lustre/scratch/daray/curation_templates/* /lustre/scratch/chpeery/spider_curations/templates ``` Check the templates that were transferred for correct paths and ensure they're appropriate for Chris. ``` cd /lustre/scratch/chpeery/spider_curations/templates for FILE in *; do sed -i "s|<PATH_TO_ASSEMBLIES_DIRECTORY>|/lustre/scratch/chpeery/spider_curations/assemblies|g" $FILE sed -i "s|<PATH_TO_WORKING_DIRECTORY>|/lustre/scratch/chpeery/spider_curations|g" $FILE done ``` Generate the necessary conda environments. ``` cd /lustre/scratch/chpeery/spider_curations cp /lustre/scratch/daray/curation_templates/*.conda.txt . . ~/conda/etc/profile.d/conda.sh conda activate conda create --name extend_env --file extend_env.conda.txt conda create --name repeatmodeler --file repeatmodeler.conda.txt conda install biopython ``` Once templates have been modified, generate genome specific files using commands below. ``` ### for asssemblies in list... LIST="dPla" cd /lustre/scratch/chpeery/spider_curations for NAME in $LIST; do sed "s/<NAME>/$NAME/g" templates/rmodel_template.sh >${NAME}_rmodel.sh; done for NAME in $LIST; do sed "s/<NAME>/$NAME/g" templates/extend_align_template.sh >${NAME}_extend_align.sh; done for NAME in $LIST; do sed "s/<NAME>/$NAME/g" templates/repeatclassifier_template.sh >${NAME}_repeatclassifier.sh; done for NAME in $LIST; do sed "s/<NAME>/$NAME/g" templates/TEcurate_template.sh >${NAME}_TEcurate.sh; done for NAME in $LIST; do sbatch ${NAME}_rmodel.sh; done ``` Place squeue output here ``` cd /lustre/scratch/chpeery/spider_curations sbatch dPla_extend_align.sh ``` #for example: 8088376 nocona dPla_ext daray R 0:02 1 cpu-23-15 Use the job ID to queue the next job. ``` sbatch --dependency=afterok:8114866 dPla_repeatclassifier.sh ``` 8114884 nocona dPla_cla daray PD 0:00 1 (Dependency) Use the job ID to queue the next job. ``` sbatch --dependency=afterok:8114884 dPla_TEcurate.sh ``` Result should be a series of queued jobs that will run in succession. 8114885 nocona dPla_TEc daray PD 0:00 1 (Dependency) 8114884 nocona dPla_cla daray PD 0:00 1 (Dependency) 8114866 xlquanah dPla_ext daray R 13:14 1 cpu-19-6 At this point, will have a set of curated TEs. We will need to examine those curated TEs to determine which ones have already been characterized. It should be MOST if not all of them. To do this, we will need to compare the output of this curation to the known library of mammal TEs using uclust. David can provide instructions on this when the time comes. If any new TEs are identified, we add them to our existing mammal library and run repeatmasker. Give the results to Martin and, I hope, be done. # Notes from other work Decided to include additional TE curation files from TE-Aid directories when sharing with collaborators. To include those: ``` ``` :: Notes from another collaboration but may be useful here. Sending the following readme with the data. :::info Analyzes run by D Ray October 19-21, 2022. The pipeline does the following: - Performs a de novo RepeatModeler analysis of the genome assembly. --> consensi.fa.classified - The resulting consensis.fa.classified is processed to remove putative consensus sequences <100 bp and to modify the headers to include the species identifier (tAlp or vEme in these cases). --> tAlp-families.mod.fa and vEme-families.mod.fa - The families.mod.fa files are subjected to a RepeatAfterMe (RAM) analysis to produce extended versions, hopefully solving problems associated with truncated consensus sequences produced by RepeatModeler. Details on this analysis are available at https://github.com/Dfam-consortium/RepeatAfterMe --> a variety of files detailing extended consensus sequences. See below. - Extended consensus sequences are subject to classification/categorization by TEcurate.sh. This is a pipeline that I put together using various tools including TE-Aid from https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-021-00259-7 --> a variety of files detailing the automated curation process. I'm sending .tgz files containing the output of the TEcurate pipeline. - DNA.tgz - fewhits.tgz - LTR.tgz - check_orientation.tgz - zerohits.tgz - LINE.tgz - NOHIT.tgz - RC.tgz Most of the names are self-explanatory. For the rest.... few_hits - these are putative consensus sequences that are represented in the assembly by fewer than 10 copies >90% of full-length check_orientation - putative consensus sequences that lacked a blastp hit and for which the script was unable to determine likely orientation zerohits - identified hits with zero copies >90% of full-length NOHIT - these are putative consensus sequences with no hit to a known TE ORF. This group in arthropods is usually full of nonautonomous DNA transposons, solo LTRs, a few SINEs, and various other repetitive DNA fragments that may or may not be actual TEs. This is where you will spend most of your time. I have plans to incorporate additional processes to categorize these when summer rolls around again. It's generally not worth spending any time with elements from the few and zero hits categories. Within each archive are the following for each putative consensus sequence. NOT ALL file types will be present for all putative consensus sequences. - .pdf - output from TE-Aid. Details on the four panels are available in the manuscript. This is your most useful file. - .png - an image file to quickly identify 'odd' alignments. I generally use this file to identify alignments that are not useful (flanks are not unique sequence) and cull those from the curation. - MSA_extended.fa - a multiple sequence alignment that contains hits from the assembly aligned to the input consensus from RepeatModeler. A sequence-based version of the .png file. Useful for identifying sequence hallmarks of certain TE families. - rep.fa - the final consensus from RAM. - rep.fa.blastp.out - blastp output. - rep.fa.genome.blastn.out - blastn output fed into TE-Aid to generate one of the plots - rep.fa.orfs.fasta - fasta of ORFs identified by blastp. - rep.fa.orftable.txt - blastp output in table form. - rep.fa.self-blast.pairs.txt - input for dotplot used in self-alignment for TE-Aid output. - rep_mod.fa - final consensus with shortened name and Class/family information. - rep_RC.fa - fasta file with consensus and reverse complement of the consensus. This is useful for quick checks of the TIRs of DNA tranposons. Once I get these data, I usually will perform several downstream analyses, focusing on the NOHIT folder. USEARCH - using a known TE library from previous curations or curations of known TEs from related species, I use this to eliminate duplicate/previously discovered TEs. This can also be run on before TEcurate to reduce processing duplicated consensus sequences. BLASTN - using a query of known TEs from relatively closely related species, I can identify known SINEs or other short, nonautonomous TEs. LTR post-processing - DFAM, the current best TE database, asks that autonomous LTR retrotransposons be processed such that the LTR and internal segments are deposited separately. I have some scripts that will do this. Penelopes - If present, these are problematic, as they often occur in tandem arrays. I've got some scripts that will remove the downstream tandems. Long consensus sequences - One of the drawbacks of RepeatModeler/RAM is that they will sometimes detect segmental duplications that are NOT actual TEs. These are usually quite long, >20kb, and low in copy number. Using these criteria I usually just get rid of them. I'm sure there are some other things that I'm forgetting but I'm always available for discussion. :::