# Bat1k de novo TE curation
RepeatModeler on each assembly, masked using final mammal library - March, 2021 - see /lustre/scratch/daray/bat1k_TE_analyses/rmasker/<folder>_RM_Ns
```
mkdir /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns
cd /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns
```
Process RepeatModeler output to eliminate short (<100 bp) consensi. Cat with mammal TE library.
```
mkdir cdhit_analysis
find . -type f -name "*_N_consensi.fa" | xargs cp -t cdhit_analysis/
LIST="aJam aPal aSto eFus gSor hCyc hLar hMon nLep nTum pHas rAff rCfl rSed rTri sBil sHon tBra"
cd cdhit_analysis
SOFTWARE=/lustre/work/daray/software
GITPATH=~/gitrepositories/bioinfo_tools
for NAME in $LIST;
do $SOFTWARE/EMBOSS-6.6.0/emboss/sizeseq -sequences ${NAME}_N_consensi.fa -descending N -outseq ${NAME}_N_consensi_sort.fa;
perl $SOFTWARE/removesmalls.pl 100 ${NAME}_N_consensi_sort.fa >${NAME}_N_consensi_100.fa;
cat ${NAME}_N_consensi_100.fa final_mammal_library.fa >${NAME}_4_cdhit.fa;
done
for NAME in $LIST;
do cut -f1 -d" " ${NAME}_4_cdhit.fa >${NAME}_4_cdhit_cut.fa
done
for NAME in $LIST;
do sed "s/rnd-/${NAME}-rnd-/g" ${NAME}_4_cdhit_cut.fa >${NAME}_4_cdhit.fa
done
```
Run cd-hit-est on all concatenated libraries. 80% similarity over 80% of length with comparison.
```
sbatch cdhit_aS.sh
```
Find clusters existing only in target species.
Count number of lines with 'nt' in them.
Count number of instances of 'family'.
If they're equal, send file to new folder
```
WORKDIR=/lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall-Ns/cdhit_analysis
LIST="aJam aPal aSto eFus gSor hCyc hLar hMon nLep nTum pHas rAff rCfl rSed rTri sBil sHon tBra"
cd $WORKDIR
for NAME in $LIST
do mkdir $WORKDIR/${NAME}_novel
cd $WORKDIR/${NAME}_novel
for FILE in $WORKDIR/$NAME/clusters_80_aS08/${NAME}_lineage/Cluster_*.txt
do LINECOUNT=$(grep 'nt' $FILE | wc -l)
FAMILYCOUNT=$(grep 'family' $FILE | wc -l)
[ $FAMILYCOUNT -eq $LINECOUNT ] && cp $FILE .
done
done
```
Grab the longest hit from each cluster and append name of putative TE to file.
```
WORKDIR=/lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall-Ns/cdhit_analysis
cd $WORKDIR
LIST="aJam aPal aSto eFus gSor hCyc hLar hMon nLep nTum pHas rAff rCfl rSed rTri sBil sHon tBra"
for NAME in $LIST ; do for FILE in ${NAME}_novel/Cluster_*.txt; do fgrep "*" $FILE | cut -d">" -f2 | cut -d"." -f1 >> ${NAME}_novel.txt; done; done
for NAME in $LIST; do python $GITPATH/pull_seqs.py -l ${NAME}_novel.txt -f ${NAME}_4_cdhit.fa -o ${NAME}_novel.fa; done
```
Do I want to cat all the novel files and run a cd-hit analysis? I don't know but I think not.
Moved to /lustre/scratch/daray/bat1k_TE_analyses/te-extensions_N
Perform two extend align runs. One using the novel TEs querying the unmasked genome and another querying the genome masked with Ns.
Note, I began using a text file called list.txt. It's just the same list as before as a text file.
```
WORKDIR=/lustre/scratch/daray/bat1k_TE_analyses/te-extensions_N
cd $WORKDIR
cp /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall-Ns/cdhit_analysis/*_novel.fa .
cat list.txt | while read i; do sed "s/<NAME>/$i/g" template_extend_align.sh >${i}_extend_align.sh; done
cat list.txt | while read i; do sbatch ${i}_extend_align.sh; done
cat list.txt | while read i; do sed "s/<NAME>/$i/g" template_extend_align_masked.sh >${i}_extend_align_masked.sh; done
cat list.txt | while read i; do sbatch ${i}_extend_align_masked.sh; done
cat list.txt | while read i; do for NAME in ls ${i}_N/images_and_alignments/possible_SD/*.png; do FILENAME=$(basename $NAME .png); rm ${i}_N/images_and_alignments/likely_TEs/${FILENAME}*; done; done
cat list.txt | while read i; do for NAME in ls ${i}_N_masked/images_and_alignments/possible_SD/*.png; do FILENAME=$(basename $NAME .png); rm ${i}_N_masked/images_and_alignments/likely_TEs/${FILENAME}*; done; done
```
Create a table of hits from the masked run using the .out file from the extract_align run
```
cat list.txt | while read i; do grep Hit ${i}.masked.*.out | awk '{print $4, $6}' >>${i}_hit_counts_masked.txt; sort -k2 -rn -o ${i}_hit_counts_masked.txt ${i}_hit_counts_masked.txt; done
```
Get list of TEs to be considered. Only TEs with more than 40 hits.
```
cat list.txt | while read i; do awk '$2 >40' ${i}_hit_counts_masked.txt | awk '{print $1}' >${i}_4_examination_masked.txt; sort -n -o ${i}_4_examination_masked.txt ${i}_4_examination_masked.txt; done
```
Pull all of the relevant files and copy them to a folder for later download
```
cat list.txt | while read i; do mkdir -p ${i}_4_examination/masked; cat ${i}_4_examination_masked.txt | while read j; do find ${i}_N_masked/images_and_alignments/likely_TEs -name $j* -exec cp {} ${i}_4_examination/masked \; ; done; done
```
Create a table of hits from the unmasked run using the .out file from the extract_align run
```
cat list.txt | while read i; do grep Hit ${i}.*.out | awk '{print $4, $6}' >>${i}_hit_counts.txt; sort -k2 -rn -o ${i}_hit_counts.txt ${i}_hit_counts.txt; done
```
Get list of TEs to be considered. Only TEs with more than 40 hits.
```
cat list.txt | while read i; do awk '$2 >40' ${i}_hit_counts.txt | awk '{print $1}' >${i}_4_examination.txt; sort -n -o ${i}_4_examination.txt ${i}_4_examination.txt; done
```
Pull all of the relevant files and copy them to a folder for later download
```
cat list.txt | while read i; do mkdir -p ${i}_4_examination/unmasked; cat ${i}_4_examination.txt | while read j; do find ${i}_N/images_and_alignments -name $j* -exec cp {} ${i}_4_examination/unmasked \; ; done; done
```
Create list of potential TEs to be examined from concatenated rep files for each species
```
cat list.txt | while read i; do cat ${i}_4_examination/unmasked/*rep.fa > ${i}_4_examination/unmasked/${i}_cat_rep.fa ; done
cat list.txt | while read i; do cat ${i}_4_examination/masked/*rep.fa > ${i}_4_examination/masked/${i}_cat_rep.fa ; done
cat list.txt | while read i; do cd ${i}_4_examination/unmasked; grep ">" ${i}_cat_rep.fa | sed "s/>//g" > ${i}_querylist.txt; cd ../.. ; done
```
Run RepeatMasker on the concatenated list of consensus sequences
```
RMASKER=/lustre/work/daray/software/RepeatMasker-4.1.2-p1
for i in $LIST; do cd ${i}_4_examination/unmasked; $RMASKER/RepeatMasker ${i}_cat_rep.fa -species mammalia -dir . -pa 12; cd ../../; done
```
Create a new .out file for each putative TE for examination
```
cat list.txt | while read i; do cd ${i}_4_examination/unmasked; cat ${i}_querylist.txt | while read j; do grep "$j" ${i}_cat_rep.fa.out > ${j}.out; done; cd ../.. ; done
```
Add the length of each consensus to the filename
```
cat list.txt | while read i; do cd ${i}_4_examination/unmasked; cat ${i}_querylist.txt | while read j; do LEN=$(seqkit stats ${j}_rep.fa | tail -1 | awk '{print $7}' | sed "s/,//g"); mv ${j}_rep.fa ${j}_rep_${LEN}_bp.fa; done; cd ../.. ; done
```
Zip the folder for later download
```
cat list.txt | while read i; do tar -zcf ${i}_4_examination.tgz ${i}_4_examination/; done
```
Manually examined all TE files in the unmasked directories and categoriezed as 'good', 'long', bad', and 'odd'.
good - a clear example of a TE with obvious flanking sequences.
bad - A likely segmental duplication. Censor (later switched to RMasker) analysis suggests multiple small TEs in the long consensus when rep.fa is submitted. No blast hits to a TE-derived ORF using blastn or blastx.
odd - Possible TE but it's hard to say.
long - >10,000 bp and shows evidence of multiple TE insertions
This strategy is working. See notes in '[TE curation analysis](https://hackmd.io/uYYY39JkQlmypswNRFH8ow)' hackmd page.
## Time to name the TEs
All of the above work was done on local computers. Move everything back to HPCC in a new directory /lustre/scratch/daray/bat1k_TE_analyses/te-naming.
Here's the plan:
1. Concatenate all of the 'good' rep.fa files for each taxon.
2. Run DeepTE to get one initial analysis.
3. Run TEClass as a second opinion.
4. Compare results to get a consensus opinion.
5. Examine all results for ACTAG to find missed Helitrons.
```
cat list.txt | while read i; do cat ${i}_4_examination/unmasked/good/*_rep.fa >${i}_good_rep.fa; done
cat list.txt | while read i; do tar -zcf ${i}_4_examination_curated.tgz ${i}_4_examination; done
```
Move all .tgz files to /lustre/scratch/daray/bat1k_TE_analyses/te-naming. Copy all the 'good' consensus sequences to deepte folder.
```
cat list.txt | while read i; do cp /lustre/scratch/daray/bat1k_TE_analyses/te-naming/${i}_4_examination/unmasked/good/${i}_good_rep.fa /lustre/scratch/daray/bat1k_TE_analyses/deepte; done
```
Run DeepTE
```
cd /lustre/scratch/daray/bat1k_TE_analyses/deepte
sbatch deepte_domain.sh
cat list.txt | while read i; do cp ${i}_output/opt_DeepTE.fasta ${i}_DeepTE.fa; done
```
Run all DeepTE.fa files through TEClass using online interface. Download the results in two forms --> the library (as teclass.fa) and the table. Save the table as Excel csv (${i}_teclass.csv).
Create list of all comparisons between DeepTE and TEClass.
```
cat list.txt | while read i ; do grep ">" ../teclass/${i}_teclass.fa | sed "s/>//g" | sed "s/__/|/g" | sed "s/TEclass result://g"| sed "s/|/ /g"| awk '{print $1 "\t" $2 "\t" $3 "\t" $4}' >> ${i}_names.txt; done
```
Output:
```
aJam-rnd-5_family-1025 ClassI_LTR_Gypsy LINE forward
aJam-rnd-5_family-104 unknown LTR reverse
aJam-rnd-5_family-1334 ClassI_nLTR_LINE_L1 LTR reverse
aJam-rnd-5_family-3077 ClassII_MITE LTR reverse
aJam-rnd-5_family-367 ClassI LTR reverse
aJam-rnd-5_family-754 ClassI_LTR_ERV LTR forward
aJam-rnd-6_family-10581 ClassI_LTR_ERV LTR reverse
aJam-rnd-6_family-10967 ClassI_nLTR_LINE_L1 LTR forward
aJam-rnd-6_family-15619 ClassII_DNA_TcMar_MITE LTR reverse
aJam-rnd-6_family-1713 ClassI_LTR_ERV LTR forward
aJam-rnd-6_family-1873 ClassI_LTR_ERV LTR forward
aJam-rnd-6_family-2619 ClassI_nLTR_LINE_L1 LINE reverse
aJam-rnd-6_family-3796 ClassII_DNA_TcMar_nMITE DNA forward
aJam-rnd-6_family-4117 ClassI_LTR_ERV LTR reverse
aJam-rnd-6_family-4307 ClassI_nLTR LTR reverse
aJam-rnd-6_family-4974 ClassI_LTR_Gypsy LTR reverse
aJam-rnd-6_family-5043 ClassI_LTR_ERV LTR forward
aJam-rnd-6_family-5739 ClassI_LTR_ERV LTR forward
aJam-rnd-6_family-5806 ClassI_LTR_ERV LTR forward
aJam-rnd-6_family-615 ClassI_LTR_ERV LTR forward
aJam-rnd-6_family-7713 ClassII_DNA_TcMar_nMITE Retro forward
```
Manually examine all the files and check for correspondence. Add column indicating agreement or disagreement.
```
aJam-rnd-5_family-1025 ClassI_LTR_Gypsy LINE forward n
aJam-rnd-5_family-104 unknown LTR reverse n
aJam-rnd-5_family-1334 ClassI_nLTR_LINE_L1 LTR reverse n
aJam-rnd-5_family-3077 ClassII_MITE LTR reverse n
aJam-rnd-5_family-367 ClassI LTR reverse n
aJam-rnd-5_family-754 ClassI_LTR_ERV LTR forward y
aJam-rnd-6_family-10581 ClassI_LTR_ERV LTR reverse y
aJam-rnd-6_family-10967 ClassI_nLTR_LINE_L1 LTR forward n
aJam-rnd-6_family-15619 ClassII_DNA_TcMar_MITE LTR reverse n
aJam-rnd-6_family-1713 ClassI_LTR_ERV LTR forward y
aJam-rnd-6_family-1873 ClassI_LTR_ERV LTR forward y
aJam-rnd-6_family-2619 ClassI_nLTR_LINE_L1 LINE reverse y
aJam-rnd-6_family-3796 ClassII_DNA_TcMar_nMITE DNA forward y
aJam-rnd-6_family-4117 ClassI_LTR_ERV LTR reverse y
aJam-rnd-6_family-4307 ClassI_nLTR LTR reverse n
aJam-rnd-6_family-4974 ClassI_LTR_Gypsy LTR reverse y
aJam-rnd-6_family-5043 ClassI_LTR_ERV LTR forward y
aJam-rnd-6_family-5739 ClassI_LTR_ERV LTR forward y
aJam-rnd-6_family-5806 ClassI_LTR_ERV LTR forward y
aJam-rnd-6_family-615 ClassI_LTR_ERV LTR forward y
aJam-rnd-6_family-7713 ClassII_DNA_TcMar_nMITE Retro forward n
```
Pull matches and nonmatches for examination.
```
cat list.txt| while read i; do awk '{ if ($5 == "n") { print } }' ${i}_names.tsv | awk '{print $1}' >>${i}_nonmatches.txt; done
cat list.txt| while read i; do awk '{ if ($5 == "y") { print } }' ${i}_names.tsv | awk '{print $1}' >>${i}_matches.txt; done
```
Pull sequences from teclass.fa file.
```
cp ~/gitrepositories/bioinfo_tools/pull_seqs.py .
```
Modified to split on __ instead of #.
```
cat list.txt| while read i; do python pull_seqs.py -l ${i}_matches.txt -f ../teclass/${i}_teclass.fa -o ${i}_matches.fa; done
cat list.txt| while read i; do python pull_seqs.py -l ${i}_nonmatches.txt -f ../teclass/${i}_teclass.fa -o ${i}_nonmatches.fa; done
```
Also pull MSA_extended.fa files from the extend_align step.
```
mkdir nonmatched_MSA
cat list.txt| while read i; do cat ${i}_nonmatches.txt| while read j; do cp /lustre/scratch/daray/bat1k_TE_analyses/te-extensions_N/${i}_4_examination/unmasked/${j}_MSA_extended.fa nonmatched_MSA; done; done
```
Visually examined the nonmatched and evaluated for possible ID. All results saved as _names.tsv in /lustre/scratch/daray/bat1k_TE_analyses/te-naming
Final resuls for all TEs save in all_names.tsv. Also in that file is the old name/new name equivalency list.
Saved oldname/newname switch file as rename.list.txt.
Copy all rep.fa files and concatenate for renaming.
```
cat ../deepte/*rep.fa >all.rep.fa
```
Problem. sBil-1.1 read the same as sBil-1.100. This causes problems with the replacement step.
Solution. Add a character to the end of all headers in all.rep.fa using fabox and then try again.
Save as all.repx.fa
All headers in all.repx.fa now end with 'x'. Added 'x' to all headers in the rename.list.txt file.
Renamed and output final library.
```
python ~/gitrepositories/bioinfo_tools/rename_fasta.py -r rename.list.txt -f all.rep.fa -o bat1k_denovo_TE_modnames.fa
sed "s|___|/|g" bat1k_denovo_TE_modnames.fa | sed "s/__/#/g" >bat1k_denovo_TE.fa
```
# Four new bat assemblies
mNig
pHas2
tTri
tTri.alt
uBil
Create new directory and list.txt
```
mkdir /lustre/scratch/daray/bat1k_TE_analyses/newbat1k
cd /lustre/scratch/daray/bat1k_TE_analyses/newbat1k
nano list.txt
```
add
mNig
pHas2
tTri
tTri.alt
uBil
RepeatMasker using new library
```
mkdir rmasker
cd rmasker/
cp ../list.txt .
cp ../../denovo_rmask/template_RM.sh .
cp ../../denovo_rmask/template_rm2bed.sh .
```
Modify templates to use the new assemblies, new directory and to mask to Ns.
Run RepeatMasker
```
cat list.txt | while read i; do sed "s/<NAME>/$i/g" template_RM.sh >${i}_RM.sh; sed "s/<NAME>/$i/g" template_rm2bed.sh >${i}_rm2bed.sh; done
cat list.txt | while read i; do sbatch ${i}_RM.sh; done
```
RepeatModeler on each assembly, masked using final mammal library - March, 2021 - see /lustre/scratch/daray/bat1k_TE_analyses/rmasker/<folder>_RM_Ns
```
mkdir /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns
cd /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns
```
Process RepeatModeler output to eliminate short (less than 100 bp) consensi. Cat with mammal TE library.
```
mkdir cdhit_analysis
find . -type f -name "*_N_consensi.fa" | xargs cp -t cdhit_analysis/
LIST="mNig pHas2 tTri tTri.alt uBil"
cd cdhit_analysis
SOFTWARE=/lustre/work/daray/software
GITPATH=~/gitrepositories/bioinfo_tools
for NAME in $LIST;
do $SOFTWARE/EMBOSS-6.6.0/emboss/sizeseq -sequences ${NAME}_N_consensi.fa -descending N -outseq ${NAME}_N_consensi_sort.fa;
perl $SOFTWARE/removesmalls.pl 100 ${NAME}_N_consensi_sort.fa >${NAME}_N_consensi_100.fa;
cat ${NAME}_N_consensi_100.fa final_mammal_library.fa >${NAME}_4_cdhit.fa;
done
for NAME in $LIST;
do cut -f1 -d" " ${NAME}_4_cdhit.fa >${NAME}_4_cdhit_cut.fa
done
for NAME in $LIST;
do sed "s/rnd-/${NAME}-rnd-/g" ${NAME}_4_cdhit_cut.fa >${NAME}_4_cdhit.fa
done
```
Run cd-hit-est on all concatenated libraries. 80% similarity over 80% of length with comparison.
###### tags: `TEs`, `David`