For de novo genome assemblies for which little TE work has been performed.

# For de novo genome assemblies for which little TE work has been performed. ## 1. Check dfam and RepBase for known TEs. Download what is available as a library for later inclusion. Details ## 2. Run RepeatModeler. --> a .consensus file. headers = rnd-X_family-Y#Class/Family Details ## 3. If dealing with multiple assemblies, use sed to replace 'rnd' with 'gSpe-rnd'. gSpe = genusSpecies Assuming .consensus file is 'gSpe_consensus.fa'.... ``` sed "s|rnd|gSpe-rnd|g" gSpe_consensus.fa >gSpe_tagged.fa ``` ## 4. Rename headers to get rid of # and /. For example, iSca-rnd-5_family-1683#DNA/piggyBac --> iSca-rnd-5_family-1683__DNA___piggyBac. Assuming file is 'gSpe_tagged.fa'.... ``` sed "s|#|__|g" gSpe_tagged.fa | sed "s|/|___|g" >gSpe_tagged_plus.fa ``` ## 5. Run extractAlign/RAM pipeline to generate extensions. --> output in likleyTEs directory Edit https://github.com/davidaray/bioinfo_tools/blob/master/template_extend_align.sh to match your paths and nomenclature. Output consists of three files in path/to/images_and_alignments/likely_TEs .png - image file of MSA MSA_extended.fa - multiple sequence alignment of extracts rep.fa - extended consensus file. ## 6. Re-run repeatClassifier from RepeatModeler and rename files as necessary. Collate all rep.fa files into gSpe_rep.fa. ``` cat *rep.fa >gSpe_rep.fa ``` Use this script. ``` #!/bin/bash #SBATCH --job-name=classifier #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=1 . ~/conda/etc/profile.d/conda.sh conda activate repeatmodeler cd $WORKDIR RepeatClassifier -consensi path/to/gSpe_rep.fa ``` Output will be 'classification.fa' file with pattern of "originalheader#Newclass/newfamily" for each entry. Use bash one-liner to rename to new class/family headers. [Need to write.] ## 7. Use bash one-liner and seqkit to generate a new file consisting of the rep.fa sequence and the reverse complement in a single file. Useful for checking for TIRs. ``` . ~/conda/etc/profile.d/conda.sh conda activate seqkit WORKDIR=path/to/working/directory cd $WORKDIR for FILE in *rep.fa do TENAME=$(basename $FILE rep.fa) seqkit seq ${TENAME}_rep.fa -r -p -t DNA >${TENAME}rep_rc.fa cat ${TENAME}-rep* >${TENAME}rep_RC.fa rm ${TENAME}-rep_rc.fa done ``` ## 8a. Examine .png files visually? Not sure if this is necessary any longer. Details ## 8b. Run TE-aid on all rep.fa files. --> output = .pdf Modify and use code below as necessary: ``` #!/bin/bash #SBATCH --job-name=te-aid #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem-per-cpu=60G module --ignore-cache load gcc/10.1.0 r/4.0.2 . ~/conda/etc/profile.d/conda.sh conda activate curate WORKDIR=/path/to/working/directory # Should contain at least the rep.fa files cd $WORKDIR AIDPATH=/lustre/work/daray/software/TE-Aid ASSEMBLY=/path/to/assembly/directory # Should contain at assembly in .fasta format, gSpe.fa. for QUERY in *rep.fa do $AIDPATH/TE-Aid -q $QUERY -g $ASSEMBLY/gSpe.fa -o . done ``` *Each output .pdf has four panels.* UL - Alignment plot with #FL, divergence, consensus size, #fragments UR - genomic coverage plot LL - dotplot LR - consensus structure and protein hits ## 9 Remove redundancies with cd-hit-est Modify script below as needed: ``` #!/bin/bash #SBATCH --job-name=cdhit #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=36 #SBATCH --mail-user david.a.ray@ttu.edu WORKDIR=/path/to/working/directory IDVALUE="90" CDHIT=/lustre/work/daray/software/cdhit-4.8.1 cd $WORKDIR $CDHIT/cd-hit-est -i gSpe_4_cdhit.fa -o gSpe_4_cdhit${IDVALUE}_aS09 -c 0.${IDVALUE} -d 70 -aS 0.9 -n 10 -T 36 -M 2200 echo "Creating directories" mkdir -p clusters_${IDVALUE}_aS09 cd clusters_${DIVALUE}_aS09 mkdir -p small_clusters_${IDVALUE}_aS09 mkdir gSpe_lineage mkdir unsorted echo "Creating individual files for each cluster" awk '/>Cluster/ {x="F"++i".txt";}{print >x;}' ../gSpe_4_cdhit${IDVALUE}_aS09.clstr #Renames each cluster file and moves into unsorted directory echo "Renaming cluster files and moving to unsorted directory" for F in *.txt; do CLUSTER=$(grep ">Cluster" $F); N=${CLUSTER:1}; N=${N/ /_}; mv $F unsorted/${N}.txt; done #filters files in unsorted and sends new consensi to lineage-specific directory echo "Filtering all files to find new consensi and send to lineage-specific directory" for FILE in unsorted/*.txt; do grep -q "rnd-" $FILE && mv $FILE gSpe_lineage; done echo "Migrating to lineage-specific directory" cd gSpe_lineage #create a table of cluster sizes echo "Creating table of cluster sizes" for F in *.txt; do COUNT=$(grep -v ">Cluster" $F | wc -l); echo ${F::-4} $COUNT >>cluster_count_table${IDVALUE}_aS09.txt; done #sort table based on size of cluster echo "Sorting the table" sort -k2 -n -r -o cluster_count_table${IDVALUE}_aS09.txt cluster_count_table${IDVALUE}_aS09.txt #Parse cluster files and sort by size of TE echo "Alter content of cluster files and sort based on TE sizes" for FILE in Cluster*; do F=$(basename $FILE .txt); sed "s/nt,//g" $FILE | sed "s/>//g" | sed "s/... at//g" | awk 'NR > 1 {print $2, $3, $4}' | sort -k1 -r -n > ../${F}_sort.txt; done echo "Migrating to main working directory" cd $WORKDIR pwd ``` output --> file with no extension. Will deal with later. Keep all * named files from output directory. Move all un-* files to archive. Save for if needed. ## 10. Use bash one-liner to sort into major categories and subcategories. [Need to write.] Unknown No real way to subcategorize LINE CR1, Penelope, RTE, Tx, L1, L2, Tad, I, unknown, add as needed LTR Gypsy, Copia, Pao, unknown, add as needed DNA piggyBac, PiggyBac, TcMar, hAT, Maverick, Chapaev, P, unknown, add as needed SINE No real way to subcategorize Satellite No real way to subcategorize RC No real way to subcategorize rRNA No real way to subcategorize tRNA No real way to subcategorize Simple_repeat No real way to subcategorize ## 11. Within each category, triage by examining .pdfs Each potential TE has 5 files. .pdf rep.fa MSA_extended.fa rep_RC.fa .png ### Things to look for..... **Tandem repeats** - will need to be examined visually and manipulated in subsequent steps to remove the duplicate copies. Common for penelope elements and some LTR elements, among other types. Examples: ![](https://i.imgur.com/K1t7cay.png) ![](https://i.imgur.com/gvtcBQN.png) [Add another example.] Toss any of these into a 'to_trim' directory for later. **Rainbow satellites** - You'll see. Toss into Satellite folder. ![](https://i.imgur.com/q534kPv.png) **Overextended consensus sequences** - Note the presence of two long hits in the upper left panel while most hits are about 3/4 the length. Will need to be trimmed in subsequent steps. ![](https://i.imgur.com/Mzjffk4.png) ### LINE CR1, Penelope, RTE, Tx, L1, L2, Tad, I, unknown, add as needed #### RTE or RTE_RC This is what you should see with an RTE ![](https://i.imgur.com/ioXrhtP.png) Notes: - the stairstep pattern in the upper right panel that declines right to left - the presence of an RTE ORF in the lower right panel If you see this, move all files to RTE folder. This is an RTE that's in the reverse orientation (RC) All the same as above but the stairstep is reversed. ![](https://i.imgur.com/aDZMvQ0.png) If you see this, move all files to the RTE_RC folder. #### CR1/L2 and CR1_RC/L2_RC Same basic patterns as above but these often (but not always) have satellite repeats at the 5' end. This is a CR1. Note the satellite repeats (parallel lines) in the lower left corner of the lower left panel. ![CR1](https://i.imgur.com/CeIzU4I.png) This is a CR1_RC. Note that this one doesn't have the satellite repeats but that's ok. They don't always have them. ![](https://i.imgur.com/cwEdhin.png) Same pattern for L2 elements. Only showing one here. ![](https://i.imgur.com/4uCUAk9.png) This is an L2 with satellite repeats (not always present.) #### Penelope and Penelope_RC Similar to CR1 and RTE but these elements sometimes have a tandem repeat pattern with the same element showing up once or with 1.5 - 3 copies in tandem. Here's a single copy insertion. I has the expected decreasing right-to-left stairstep pattern (upper right), a clear penelope ORF (lower right). ![](https://i.imgur.com/B2f6rQW.png) Here's one with 1.5 copies. Note that the second copy is identified by the presence of two penelope ORFs, one of them partial (lower right panel), the repeated and nearly identical peaks (upper right panel), and the terminal repeats (short lines parallel to the center line in the lower left panel). ![](https://i.imgur.com/CA0e1Px.png) Here's one with 2.5 copies. One copy (at the left ends of the right panels) is partial. ![](https://i.imgur.com/eyywvTb.png) Just like the previous ones, if the stairstep pattern is reversed, the element is reverse complemented (RC). Please sort accordingly. #### Other LINEs Other LINEs will follow the same patterns as above. Just toss anything else into a generic LINEs folder. I'll sort them later. ### LTR Gypsy, Copia, Pao, unknown, add as needed #### Gypsy/Copia/Pao Presence of a clear Gypsy/Copia/Pao ORF (lower right panel), high 5' and 3' peaks in the upper right panel, the presence of long terminal repeats (short parallel lines in the corners of the lower left panel). ![](https://i.imgur.com/fcFxEZw.png) ### DNA piggyBac, PiggyBac, TcMar, hAT, Maverick, Chapaev, P, TIR, unknown, add as needed #### Autonomous elements At least two of the following: - Total length >1500 bp - ORFs >1000 bp with clearly named ORF in one of these categories. - Presence of a TIR - Short inward-pointing repeats (see the green lines in lower left of this figure) ![](https://i.imgur.com/0ANaSwQ.png) ) - Usually also has high coverage peaks at 5' and 3' ends. - Usually relatively few copies compared to non-autonomous members. #### Possible non-autonomous DNA transposons At least - Presence of TIR - No large ORF - <1500 bp - Usually many hundreds of copies or more (>autonomous) ### SINE - Gotta be short, <500 bp - No ORF - No TIRs ### Unknown If any unknown has clear characteristics of one of the above categories, sort into that directory. Otherwise ## 12. Examine the RC files for SINEs, DNAs, and Unknowns TIR-check Some TIRs are under 10 bp and won't be visible in the dotplot. They can be detected in the RC.fa files. - If TIR is present, not a SINE. Possibly a non-autonomous DNA transposon. Tail-check - SINEs will have some sort of repetitive tail at one end. LTR-termini-check - Check for solo LTRs. Have TG, TGT, or TGTT at the termini. ![](https://i.imgur.com/iGsXA9J.png) ## 13. Examine MSA for target site duplications (TSDs) in SINEs, DNAs, and Unknowns Learn how. ## 14. https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi ## Final naming Using information gleaned from above, name consensus sequences as appropriate. I have a python script to do this. iSca-rnd-5_family-1683__DNA___piggyBac ---> iSca.5.1638#DNA/piggyBac