# For de novo genome assemblies for which little TE work has been performed.
## 1. Check dfam and RepBase for known TEs. Download what is available as a library for later inclusion.
Details
## 2. Run RepeatModeler. --> a .consensus file. headers = rnd-X_family-Y#Class/Family
Details
## 3. If dealing with multiple assemblies, use sed to replace 'rnd' with 'gSpe-rnd'. gSpe = genusSpecies
Assuming .consensus file is 'gSpe_consensus.fa'....
```
sed "s|rnd|gSpe-rnd|g" gSpe_consensus.fa >gSpe_tagged.fa
```
## 4. Rename headers to get rid of # and /. For example, iSca-rnd-5_family-1683#DNA/piggyBac --> iSca-rnd-5_family-1683__DNA___piggyBac.
Assuming file is 'gSpe_tagged.fa'....
```
sed "s|#|__|g" gSpe_tagged.fa | sed "s|/|___|g" >gSpe_tagged_plus.fa
```
## 5. Run extractAlign/RAM pipeline to generate extensions. --> output in likleyTEs directory
Edit https://github.com/davidaray/bioinfo_tools/blob/master/template_extend_align.sh to match your paths and nomenclature.
Output consists of three files in path/to/images_and_alignments/likely_TEs
.png - image file of MSA
MSA_extended.fa - multiple sequence alignment of extracts
rep.fa - extended consensus file.
## 6. Re-run repeatClassifier from RepeatModeler and rename files as necessary.
Collate all rep.fa files into gSpe_rep.fa.
```
cat *rep.fa >gSpe_rep.fa
```
Use this script.
```
#!/bin/bash
#SBATCH --job-name=classifier
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=1
. ~/conda/etc/profile.d/conda.sh
conda activate repeatmodeler
cd $WORKDIR
RepeatClassifier -consensi path/to/gSpe_rep.fa
```
Output will be 'classification.fa' file with pattern of
"originalheader#Newclass/newfamily" for each entry.
Use bash one-liner to rename to new class/family headers.
[Need to write.]
## 7. Use bash one-liner and seqkit to generate a new file consisting of the rep.fa sequence and the reverse complement in a single file. Useful for checking for TIRs.
```
. ~/conda/etc/profile.d/conda.sh
conda activate seqkit
WORKDIR=path/to/working/directory
cd $WORKDIR
for FILE in *rep.fa
do TENAME=$(basename $FILE rep.fa)
seqkit seq ${TENAME}_rep.fa -r -p -t DNA >${TENAME}rep_rc.fa
cat ${TENAME}-rep* >${TENAME}rep_RC.fa
rm ${TENAME}-rep_rc.fa
done
```
## 8a. Examine .png files visually? Not sure if this is necessary any longer.
Details
## 8b. Run TE-aid on all rep.fa files. --> output = .pdf
Modify and use code below as necessary:
```
#!/bin/bash
#SBATCH --job-name=te-aid
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=60G
module --ignore-cache load gcc/10.1.0 r/4.0.2
. ~/conda/etc/profile.d/conda.sh
conda activate curate
WORKDIR=/path/to/working/directory # Should contain at least the rep.fa files
cd $WORKDIR
AIDPATH=/lustre/work/daray/software/TE-Aid
ASSEMBLY=/path/to/assembly/directory # Should contain at assembly in .fasta format, gSpe.fa.
for QUERY in *rep.fa
do $AIDPATH/TE-Aid -q $QUERY -g $ASSEMBLY/gSpe.fa -o .
done
```
*Each output .pdf has four panels.*
UL - Alignment plot with #FL, divergence, consensus size, #fragments
UR - genomic coverage plot
LL - dotplot
LR - consensus structure and protein hits
## 9 Remove redundancies with cd-hit-est
Modify script below as needed:
```
#!/bin/bash
#SBATCH --job-name=cdhit
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=36
#SBATCH --mail-user david.a.ray@ttu.edu
WORKDIR=/path/to/working/directory
IDVALUE="90"
CDHIT=/lustre/work/daray/software/cdhit-4.8.1
cd $WORKDIR
$CDHIT/cd-hit-est -i gSpe_4_cdhit.fa -o gSpe_4_cdhit${IDVALUE}_aS09 -c 0.${IDVALUE} -d 70 -aS 0.9 -n 10 -T 36 -M 2200
echo "Creating directories"
mkdir -p clusters_${IDVALUE}_aS09
cd clusters_${DIVALUE}_aS09
mkdir -p small_clusters_${IDVALUE}_aS09
mkdir gSpe_lineage
mkdir unsorted
echo "Creating individual files for each cluster"
awk '/>Cluster/ {x="F"++i".txt";}{print >x;}' ../gSpe_4_cdhit${IDVALUE}_aS09.clstr
#Renames each cluster file and moves into unsorted directory
echo "Renaming cluster files and moving to unsorted directory"
for F in *.txt; do CLUSTER=$(grep ">Cluster" $F); N=${CLUSTER:1}; N=${N/ /_}; mv $F unsorted/${N}.txt; done
#filters files in unsorted and sends new consensi to lineage-specific directory
echo "Filtering all files to find new consensi and send to lineage-specific directory"
for FILE in unsorted/*.txt; do grep -q "rnd-" $FILE && mv $FILE gSpe_lineage; done
echo "Migrating to lineage-specific directory"
cd gSpe_lineage
#create a table of cluster sizes
echo "Creating table of cluster sizes"
for F in *.txt; do COUNT=$(grep -v ">Cluster" $F | wc -l); echo ${F::-4} $COUNT >>cluster_count_table${IDVALUE}_aS09.txt; done
#sort table based on size of cluster
echo "Sorting the table"
sort -k2 -n -r -o cluster_count_table${IDVALUE}_aS09.txt cluster_count_table${IDVALUE}_aS09.txt
#Parse cluster files and sort by size of TE
echo "Alter content of cluster files and sort based on TE sizes"
for FILE in Cluster*; do F=$(basename $FILE .txt); sed "s/nt,//g" $FILE | sed "s/>//g" | sed "s/... at//g" | awk 'NR > 1 {print $2, $3, $4}' | sort -k1 -r -n > ../${F}_sort.txt; done
echo "Migrating to main working directory"
cd $WORKDIR
pwd
```
output --> file with no extension. Will deal with later.
Keep all * named files from output directory.
Move all un-* files to archive. Save for if needed.
## 10. Use bash one-liner to sort into major categories and subcategories.
[Need to write.]
Unknown
No real way to subcategorize
LINE
CR1, Penelope, RTE, Tx, L1, L2, Tad, I, unknown, add as needed
LTR
Gypsy, Copia, Pao, unknown, add as needed
DNA
piggyBac, PiggyBac, TcMar, hAT, Maverick, Chapaev, P, unknown, add as needed
SINE
No real way to subcategorize
Satellite
No real way to subcategorize
RC
No real way to subcategorize
rRNA
No real way to subcategorize
tRNA
No real way to subcategorize
Simple_repeat
No real way to subcategorize
## 11. Within each category, triage by examining .pdfs
Each potential TE has 5 files.
.pdf
rep.fa
MSA_extended.fa
rep_RC.fa
.png
### Things to look for.....
**Tandem repeats** - will need to be examined visually and manipulated in subsequent steps to remove the duplicate copies. Common for penelope elements and some LTR elements, among other types.
Examples:


[Add another example.]
Toss any of these into a 'to_trim' directory for later.
**Rainbow satellites** - You'll see. Toss into Satellite folder.

**Overextended consensus sequences** - Note the presence of two long hits in the upper left panel while most hits are about 3/4 the length. Will need to be trimmed in subsequent steps.

### LINE
CR1, Penelope, RTE, Tx, L1, L2, Tad, I, unknown, add as needed
#### RTE or RTE_RC
This is what you should see with an RTE

Notes:
- the stairstep pattern in the upper right panel that declines right to left
- the presence of an RTE ORF in the lower right panel
If you see this, move all files to RTE folder.
This is an RTE that's in the reverse orientation (RC)
All the same as above but the stairstep is reversed.

If you see this, move all files to the RTE_RC folder.
#### CR1/L2 and CR1_RC/L2_RC
Same basic patterns as above but these often (but not always) have satellite repeats at the 5' end.
This is a CR1. Note the satellite repeats (parallel lines) in the lower left corner of the lower left panel.

This is a CR1_RC. Note that this one doesn't have the satellite repeats but that's ok. They don't always have them.

Same pattern for L2 elements. Only showing one here.

This is an L2 with satellite repeats (not always present.)
#### Penelope and Penelope_RC
Similar to CR1 and RTE but these elements sometimes have a tandem repeat pattern with the same element showing up once or with 1.5 - 3 copies in tandem.
Here's a single copy insertion. I has the expected decreasing right-to-left stairstep pattern (upper right), a clear penelope ORF (lower right).

Here's one with 1.5 copies. Note that the second copy is identified by the presence of two penelope ORFs, one of them partial (lower right panel), the repeated and nearly identical peaks (upper right panel), and the terminal repeats (short lines parallel to the center line in the lower left panel).

Here's one with 2.5 copies. One copy (at the left ends of the right panels) is partial.

Just like the previous ones, if the stairstep pattern is reversed, the element is reverse complemented (RC). Please sort accordingly.
#### Other LINEs
Other LINEs will follow the same patterns as above. Just toss anything else into a generic LINEs folder. I'll sort them later.
### LTR
Gypsy, Copia, Pao, unknown, add as needed
#### Gypsy/Copia/Pao
Presence of a clear Gypsy/Copia/Pao ORF (lower right panel), high 5' and 3' peaks in the upper right panel, the presence of long terminal repeats (short parallel lines in the corners of the lower left panel).

### DNA
piggyBac, PiggyBac, TcMar, hAT, Maverick, Chapaev, P, TIR, unknown, add as needed
#### Autonomous elements
At least two of the following:
- Total length >1500 bp
- ORFs >1000 bp with clearly named ORF in one of these categories.
- Presence of a TIR - Short inward-pointing repeats (see the green lines in lower left of this figure) 
)
- Usually also has high coverage peaks at 5' and 3' ends.
- Usually relatively few copies compared to non-autonomous members.
#### Possible non-autonomous DNA transposons
At least
- Presence of TIR
- No large ORF
- <1500 bp
- Usually many hundreds of copies or more (>autonomous)
### SINE
- Gotta be short, <500 bp
- No ORF
- No TIRs
### Unknown
If any unknown has clear characteristics of one of the above categories, sort into that directory. Otherwise
## 12. Examine the RC files for SINEs, DNAs, and Unknowns
TIR-check
Some TIRs are under 10 bp and won't be visible in the dotplot. They can be detected in the RC.fa files.
- If TIR is present, not a SINE. Possibly a non-autonomous DNA transposon.
Tail-check
- SINEs will have some sort of repetitive tail at one end.
LTR-termini-check
- Check for solo LTRs. Have TG, TGT, or TGTT at the termini.

## 13. Examine MSA for target site duplications (TSDs) in SINEs, DNAs, and Unknowns
Learn how.
## 14. https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi
## Final naming
Using information gleaned from above, name consensus sequences as appropriate.
I have a python script to do this.
iSca-rnd-5_family-1683__DNA___piggyBac ---> iSca.5.1638#DNA/piggyBac