# Protist TEs
Got a bunch of genome assemblies for holozoan protists from Alex Tice. Little has been done to examine these groups for TE content and impact. Will have Angela examine them for TEs. I will parallel her as a way of teaching her the basics.
She's reading the necessary papers and has some basic information.
In D:\Dropbox\Angela Ximena Correa Perdomo\
Grau-Bove.2017.elife.pdf
Hehenberger.et.al.2017.pdf
holozoan.metadata.tsv
project ideas.txt
Zipped the files and have renamed them as below. Work will be in /lustre/scratch/daray/protistan_holozoa
Abeoforma_whisleri.fa.gz = aWhi.fa.gz
Capsaspora_owczarzaki.fa.gz = cOwc.fa.gz
Chromosphaera_perkinsii.fa.gz = cPer.fa.gz
Corallochytrium_limacisporum.fa.gz = cLim.fa.gz
Creolimax_fragrantissima.fa.gz = cFra.fa.gz
Ichthyophonus_hoferi.fa.gz = iHof.fa.gz
Ministeria_vibrans.fa.gz = mVib.fa.gz
Monosiga_brevicollis.fa.gz = mBre.fa.gz
Pigoraptor_chileana.fa.gz = pChi.fa.gz
Pigoraptor_vietnamica.fa.gz = pVie.fa.gz
Pirum_gemmata.fa.gz = pGem.fa.gz
Salpingoeca_rosetta.fa.gz = sRos.fa.gz
Sphaeroforma_arctica.fa.gz = sArc.fa.gz
LIST="aWhi cOwc cPer cLim cFra iHof mVib mBre pChi pVie pGem sRos sArc"
edit curation_templates/rmodel_template.sh to:
```
#!/bin/bash
#SBATCH --job-name=<NAME>_rmod
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=64
. ~/conda/etc/profile.d/conda.sh
conda activate
TAXON=<NAME>
cd /lustre/scratch/daray/protistan_holozoa/te_curations
python ../curation_templates/rmodel.py -g /lustre/scratch/daray/protistan_holozoa/assemblies/${TAXON}.fa.gz -p 64 -b 50 -w /lustre/scratch/daray/protistan_holozoa/te_curations/${TAXON}/
```
Do the following to set up.
```
cd /lustre/scratch/daray/protistan_holozoa
LIST="aWhi cOwc cPer cLim cFra iHof mVib mBre pChi pVie pGem sRos sArc"
for NAME in $LIST; do
mkdir -p te_curations/$NAME/te-aid
done
cp -r /lustre/scratch/daray/curation_templates/ .
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/rmodel_template.sh >te_curations/${NAME}_rmodel.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/RM_template_mod.sh >te_curations/${NAME}_RM.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/extend_align_template.sh >te_curations/${NAME}_extend_align.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/repeatclassifier_template.sh >te_curations/${NAME}_repeatclassifier.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/rm2bed_template_mod.sh >te_curations/${NAME}_rm2bed.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/TEcurate_template.sh >te_curations/${NAME}_TEcurate.sh; done
```
Angela will need to update her files to have her username instead of mine.
```
cd /lustre/scratch/daray/protistan_holozoa/te_curations
LIST="aWhi cOwc cPer cLim cFra iHof mVib mBre pChi pVie pGem sRos sArc"
for NAME in $LIST; do sed -i "s/daray/acorreap/g" ${NAME}_rmodel.sh; done
for NAME in $LIST; do sed -i "s/daray/acorreap/g" ${NAME}_RM.sh; done
for NAME in $LIST; do sed -i "s/daray/acorreap/g" ${NAME}_extend_align.sh; done
for NAME in $LIST; do sed -i "s/daray/acorreap/g" ${NAME}_repeatclassifier.sh; done
for NAME in $LIST; do sed -i "s/daray/acorreap/g" ${NAME}_rm2bed.sh; done
for NAME in $LIST; do sed -i "s/daray/acorreap/g" ${NAME}_TEcurate.sh; done
```
Start RepeatModeler
```
cd te_curations
for NAME in $LIST; do sbatch ${NAME}_rmodel.sh; done
```
Four more assemblies have been added. Need to add those as well.
pUst, hCla, nSp, nDam
```
LIST="pUst hCla nSp nDam"
cd /lustre/scratch/daray/protistan_holozoa
for NAME in $LIST; do
mkdir -p te_curations/$NAME/te-aid
done
cp -r /lustre/scratch/daray/curation_templates/ .
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/rmodel_template.sh >te_curations/${NAME}_rmodel.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/RM_template_mod.sh >te_curations/${NAME}_RM.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/extend_align_template.sh >te_curations/${NAME}_extend_align.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/repeatclassifier_template.sh >te_curations/${NAME}_repeatclassifier.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/rm2bed_template_mod.sh >te_curations/${NAME}_rm2bed.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/TEcurate_template.sh >te_curations/${NAME}_TEcurate.sh; done
cd te_curations
for NAME in $LIST; do sbatch ${NAME}_rmodel.sh; done
```
Updated list: LIST="aWhi cOwc cPer cLim cFra hCla iHof mVib mBre nSp nDam pChi pVie pGem pUst sRos sArc"
```
LIST="aWhi cOwc cPer cLim cFra hCla iHof mVib mBre nSp nDam pChi pVie pGem pUst sRos sArc"
```
Track progress using squeue and when all are done, run the extension script. This script runs RAM and generates the putative TE consensus sequences.
** Angela will need to set up her conda environment to match mine. **
Setting up extend_env conda environment:
```
cd ~
cp /home/daray/extend_env.yml .
conda env create -f extend_env.yml
```
Now, the extend_align script should work.
```
cd /lustre/scratch/daray/protistan_holozoa/te_curations
LIST="aWhi cOwc cPer cLim cFra hCla iHof mVib mBre nSp nDam pChi pVie pGem pUst sRos sArc"
for NAME in $LIST; do sbatch ${NAME}_extend_align.sh; done
```
Paste squeue output here.
I forgot to. All finished.
While the extend_align scripts are running, queue the next steps to run after they're completed.
```
cd /lustre/scratch/daray/protistan_holozoa/te_curations
LIST="aWhi cOwc cPer cLim cFra hCla iHof mVib mBre nSp nDam pChi pVie pGem pUst sRos sArc"
for NAME in $LIST; do sbatch ${NAME}_repeatclassifier.sh; done
```
12218469 nocona 654 PD daray aWhi_clas 0:00 1 1 normal (Priority)
12218470 nocona 654 PD daray cOwc_clas 0:00 1 1 normal (Priority)
12218471 nocona 654 PD daray cPer_clas 0:00 1 1 normal (Priority)
12218472 nocona 654 PD daray cLim_clas 0:00 1 1 normal (Priority)
12218473 nocona 654 PD daray cFra_clas 0:00 1 1 normal (Priority)
12218474 nocona 654 PD daray hCla_clas 0:00 1 1 normal (Priority)
12218475 nocona 654 PD daray iHof_clas 0:00 1 1 normal (Priority)
12218476 nocona 654 PD daray mVib_clas 0:00 1 1 normal (Priority)
12218477 nocona 654 PD daray mBre_clas 0:00 1 1 normal (Priority)
12218478 nocona 654 PD daray nSp_class 0:00 1 1 normal (Priority)
12218479 nocona 654 PD daray nDam_clas 0:00 1 1 normal (Priority)
12218480 nocona 654 PD daray pChi_clas 0:00 1 1 normal (Priority)
12218481 nocona 654 PD daray pVie_clas 0:00 1 1 normal (Priority)
12218482 nocona 654 PD daray pGem_clas 0:00 1 1 normal (Priority)
12218483 nocona 654 PD daray pUst_clas 0:00 1 1 normal (Priority)
12218484 nocona 654 PD daray sRos_clas 0:00 1 1 normal (Priority)
12218485 nocona 654 PD daray sArc_clas 0:00 1 1 normal (Priority)
While the repeatclassifier scripts are running, queue the next steps to run after they're completed.
```
sbatch --dependency=afterok:12218469 aWhi_TEcurate.sh
sbatch --dependency=afterok:12218470 cOwc_TEcurate.sh
sbatch --dependency=afterok:12218471 cPer_TEcurate.sh
sbatch --dependency=afterok:12218472 cLim_TEcurate.sh
sbatch --dependency=afterok:12218473 cFra_TEcurate.sh
sbatch --dependency=afterok:12218474 hCla_TEcurate.sh
sbatch --dependency=afterok:12218475 iHof_TEcurate.sh
sbatch --dependency=afterok:12218476 mVib_TEcurate.sh
sbatch --dependency=afterok:12218477 mBre_TEcurate.sh
sbatch --dependency=afterok:12218478 nSp_TEcurate.sh
sbatch --dependency=afterok:12218479 nDam_TEcurate.sh
sbatch --dependency=afterok:12218480 pChi_TEcurate.sh
sbatch --dependency=afterok:12218481 pVie_TEcurate.sh
sbatch --dependency=afterok:12218482 pGem_TEcurate.sh
sbatch --dependency=afterok:12218483 pUst_TEcurate.sh
sbatch --dependency=afterok:12218484 sRos_TEcurate.sh
sbatch --dependency=afterok:12218485 sArc_TEcurate.sh
```
12218528 nocona 633 PD daray aWhi_TEcu 0:00 1 1 normal (Dependency)
12218529 nocona 633 PD daray cOwc_TEcu 0:00 1 1 normal (Dependency)
12218530 nocona 633 PD daray cPer_TEcu 0:00 1 1 normal (Dependency)
12218531 nocona 633 PD daray cLim_TEcu 0:00 1 1 normal (Dependency)
12218532 nocona 633 PD daray cFra_TEcu 0:00 1 1 normal (Dependency)
12218533 nocona 633 PD daray hCla_TEcu 0:00 1 1 normal (Dependency)
12218534 nocona 633 PD daray iHof_TEcu 0:00 1 1 normal (Dependency)
12218535 nocona 633 PD daray mVib_TEcu 0:00 1 1 normal (Dependency)
12218536 nocona 633 PD daray mBre_TEcu 0:00 1 1 normal (Dependency)
12218537 nocona 633 PD daray nSp_TEcur 0:00 1 1 normal (Dependency)
12218538 nocona 633 PD daray nDam_TEcu 0:00 1 1 normal (Dependency)
12218539 nocona 633 PD daray pChi_TEcu 0:00 1 1 normal (Dependency)
12218540 nocona 633 PD daray pVie_TEcu 0:00 1 1 normal (Dependency)
12218541 nocona 633 PD daray pGem_TEcu 0:00 1 1 normal (Dependency)
12218542 nocona 633 PD daray pUst_TEcu 0:00 1 1 normal (Dependency)
12218543 nocona 633 PD daray sArc_TEcu 0:00 1 1 normal (Dependency)
12218544 nocona 633 PD daray sRos_TEcu 0:00 1 1 normal (Dependency)
### Oct 2, 2023
Met with Alex and he provided another set of assemblies. n=15. This time they're much more closely related and that could make this a bit more manageable.
All are from Heterolobosea and most are free-living human parasites.
14 are from genus Naeglaria and one from Acasia
From the metadata file provide:
Taxon Lifstyle Habitat Life Cycle States Present Accession
Neovahlkampfia damarascottae CCAP 1588/7 freeliving freshwater amoeba/cyst GCA_016618085.1 *illumina only
Heteramoeba clara N/A freeliving brackish water amoeba/cyst/flagellate N/A Not public
Paravahlkampfia ustiana N/A freeliving freshwater amoeba/cyst N/A Not public
Acrasis kona ATCC MYA-3509 freeliving dead plant tissue amoeba/cyst/sporocarp **exhibits aggregative multicellularity GCA_026419775.1
Naegleria gruberi NEG-M freeliving freshwater amoeba/cyst/flagellate GCA_000004985.1
Naegleria fowleri ATCC 30863 freeliving/opportunistic human pathogen freshwater amoeba/cyst/flagellate GCA_000499105.1
Naegleria fowleri NF_KFSHRC_1 freeliving/opportunistic human pathogen freshwater amoeba/cyst/flagellate GCA_902703645.1
Naegleria fowleri NF1 freeliving/opportunistic human pathogen freshwater amoeba/cyst/flagellate GCA_027563095.1
Naegleria fowleri AR12 freeliving/opportunistic human pathogen freshwater amoeba/cyst/flagellate GCA_027563075.1
Naegleria fowleri PA34 freeliving/opportunistic human pathogen freshwater amoeba/cyst/flagellate GCA_027563055.1
Naegleria fowleri kurume freeliving/opportunistic human pathogen freshwater amoeba/cyst/flagellate GCA_020891285.1
Naegleria fowleri Ty freeliving/opportunistic human pathogen freshwater amoeba/cyst/flagellate GCA_014843625.1
Naegleria fowleri ATCC 30894 freeliving/opportunistic human pathogen freshwater amoeba/cyst/flagellate GCA_008403515.1
Naegleria lovaniensis Lova6 freeliving freshwater amoeba/cyst/flagellate GCA_027563035.1
Naegleria lovaniensis Lova7 freeliving freshwater amoeba/cyst/flagellate GCA_027563015.1
Naegleria lovaniensis F9 freeliving freshwater amoeba/cyst/flagellate GCA_027562995.1
Naegleria lovaniensis NL_76_15_250 freeliving freshwater amoeba/cyst/flagellate GCA_022530875.1
Naegleria lovaniensis ATCC 30569 freeliving freshwater amoeba/cyst/flagellate GCA_003324165.2
new list
LIST="aKon nFow_AR12 nFow_ATCC30894 nFow_kurume nFow_NF1 nFow_NF_KFSHRC nFow_PA34 nFow_Ty nGru_NEGM nFow_ATCC30863 nLov_ATCC30569 nLov_F9 nLov_Lova6 nLov_Lova7 nLov_NL_76_15_250"
The plan is to concentrate on these for the moment because of the, presumed, reduced diversity in TE content. Will make it easier on Angela.
```
cd /lustre/scratch/daray/protistan_holozoa/
LIST="aKon nFow_AR12 nFow_ATCC30894 nFow_kurume nFow_NF1 nFow_NF_KFSHRC nFow_PA34 nFow_Ty nGru_NEGM nFow_ATCC30863 nLov_ATCC30569 nLov_F9 nLov_Lova6 nLov_Lova7 nLov_NL_76_15_250"
for NAME in $LIST; do
mkdir -p te_curations/$NAME/te-aid
done
cp -r /lustre/scratch/daray/curation_templates/ .
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/rmodel_template.sh >te_curations/${NAME}_rmodel.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/RM_template_mod.sh >te_curations/${NAME}_RM.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/extend_align_template.sh >te_curations/${NAME}_extend_align.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/repeatclassifier_template.sh >te_curations/${NAME}_repeatclassifier.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/rm2bed_template_mod.sh >te_curations/${NAME}_rm2bed.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/TEcurate_template.sh >te_curations/${NAME}_TEcurate.sh; done
cd te_curations
for NAME in $LIST; do sbatch ${NAME}_rmodel.sh; done
```
12249694 nocona 1078 PD daray aKon_rmod 0:00 1 64 normal (Priority)
12249695 nocona 1078 PD daray nFow_AR12 0:00 1 64 normal (Priority)
12249696 nocona 1078 PD daray nFow_ATCC 0:00 1 64 normal (Priority)
12249697 nocona 1078 PD daray nFow_kuru 0:00 1 64 normal (Priority)
12249698 nocona 1078 PD daray nFow_NF1_ 0:00 1 64 normal (Priority)
12249699 nocona 1078 PD daray nFow_NF_K 0:00 1 64 normal (Priority)
12249700 nocona 1078 PD daray nFow_PA34 0:00 1 64 normal (Priority)
12249701 nocona 1078 PD daray nFow_Ty_r 0:00 1 64 normal (Priority)
12249702 nocona 1078 PD daray nGru_NEGM 0:00 1 64 normal (Priority)
12249703 nocona 1078 PD daray nFow_ATCC 0:00 1 64 normal (Priority)
12249704 nocona 1078 PD daray nLov_ATCC 0:00 1 64 normal (Priority)
12249705 nocona 1078 PD daray nLov_F9_r 0:00 1 64 normal (Priority)
12249706 nocona 1078 PD daray nLov_Lova 0:00 1 64 normal (Priority)
12249707 nocona 1078 PD daray nLov_Lova 0:00 1 64 normal (Priority)
12249708 nocona 1078 PD daray nLov_NL_7 0:00 1 64 normal (Priority)
## Thought about it and decided that 10 copies is way too few for these organisms. Reset MINCOPY variable in the TEcurate_template file to MINCOPY=2.
Now need to re-run all of my curations for the species that are already complete.
```
cd /lustre/scratch/daray/protistan_holozoa/te_curations
LIST="aWhi cOwc cPer cLim cFra hCla iHof mVib mBre nSp nDam pChi pVie pGem pUst sRos sArc"
cd /lustre/scratch/daray/protistan_holozoa/
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/TEcurate_template.sh >te_curations/${NAME}_TEcurate.sh; done
cd /lustre/scratch/daray/protistan_holozoa/te_curations
for NAME in $LIST; do rm -rf te-aid/*/; sbatch ${NAME}_TEcurate.sh; done
```
Also need to set up my curation scripts for the new genome assemblies
```
cd /lustre/scratch/daray/protistan_holozoa/
LIST="aKon nFow_AR12 nFow_ATCC30894 nFow_kurume nFow_NF1 nFow_NF_KFSHRC nFow_PA34 nFow_Ty nGru_NEGM nFow_ATCC30863 nLov_ATCC30569 nLov_F9 nLov_Lova6 nLov_Lova7 nLov_NL_76_15_250"
cd /lustre/scratch/daray/protistan_holozoa/
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/TEcurate_template.sh >te_curations/${NAME}_TEcurate.sh; done
```
Those will actually be run later, after all the preparation steps are complete.
Ran into trouble with permissions. Copied everything to a new folder "protists". Make appropriate changes to paths for downstream work.
```
cd /lustre/scratch/daray/protists/te_curations
LIST="aWhi cOwc cPer cLim cFra hCla iHof mVib mBre nSp nDam pChi pVie pGem pUst sRos sArc"
cd /lustre/scratch/daray/protists/
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/TEcurate_template.sh >te_curations/${NAME}_TEcurate.sh; done
cd /lustre/scratch/daray/protists/te_curations
for NAME in $LIST; do rm -rf ${NAME}/te-aid/*/; done
for NAME in $LIST; do sbatch ${NAME}_TEcurate.sh; done
```
Also need to set up my curation scripts for the new genome assemblies
```
cd /lustre/scratch/daray/protists/
LIST="aKon nFow_AR12 nFow_ATCC30894 nFow_kurume nFow_NF1 nFow_NF_KFSHRC nFow_PA34 nFow_Ty nGru_NEGM nFow_ATCC30863 nLov_ATCC30569 nLov_F9 nLov_Lova6 nLov_Lova7 nLov_NL_76_15_250"
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/rmodel_template.sh >te_curations/${NAME}_rmodel.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/RM_template_mod.sh >te_curations/${NAME}_RM.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/extend_align_template.sh >te_curations/${NAME}_extend_align.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/repeatclassifier_template.sh >te_curations/${NAME}_repeatclassifier.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/rm2bed_template_mod.sh >te_curations/${NAME}_rm2bed.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" curation_templates/TEcurate_template.sh >te_curations/${NAME}_TEcurate.sh; done
```
RepeatModeler was already run before copying all the files to the new directory. Start with extend align.
```
cd /lustre/scratch/daray/protists/te_curations
LIST="aKon nFow_AR12 nFow_ATCC30894 nFow_kurume nFow_NF1 nFow_NF_KFSHRC nFow_PA34 nFow_Ty nGru_NEGM nFow_ATCC30863 nLov_ATCC30569 nLov_F9 nLov_Lova6 nLov_Lova7 nLov_NL_76_15_250"
for NAME in $LIST; do sbatch ${NAME}_extend_align.sh; done
```
12251717 nocona nLov_NL_ daray PD 0:00 1 (Priority)
12251716 nocona nLov_Lov daray PD 0:00 1 (Priority)
12251715 nocona nLov_Lov daray PD 0:00 1 (Priority)
12251714 nocona nLov_F9_ daray PD 0:00 1 (Priority)
12251713 nocona nLov_ATC daray PD 0:00 1 (Priority)
12251712 nocona nFow_ATC daray PD 0:00 1 (Priority)
12251711 nocona nGru_NEG daray PD 0:00 1 (Priority)
12251710 nocona nFow_Ty_ daray PD 0:00 1 (Priority)
12251709 nocona nFow_PA3 daray PD 0:00 1 (Priority)
12251708 nocona nFow_NF_ daray PD 0:00 1 (Priority)
12251707 nocona nFow_NF1 daray PD 0:00 1 (Priority)
12251706 nocona nFow_kur daray PD 0:00 1 (Priority)
12251705 nocona nFow_ATC daray PD 0:00 1 (Priority)
12251704 nocona nFow_AR1 daray PD 0:00 1 (Priority)
12251703 nocona aKon_ext daray PD 0:00 1 (Priority)
```
sbatch --dependency=afterok:12251703 aKon_repeatclassifier.sh
sbatch --dependency=afterok:12251704 nFow_AR12_repeatclassifier.sh
sbatch --dependency=afterok:12251705 nFow_ATCC30894_repeatclassifier.sh
sbatch --dependency=afterok:12251706 nFow_kurume_repeatclassifier.sh
sbatch --dependency=afterok:12251707 nFow_NF1_repeatclassifier.sh
sbatch --dependency=afterok:12251708 nFow_NF_KFSHRC_repeatclassifier.sh
sbatch --dependency=afterok:12251709 nFow_PA34_repeatclassifier.sh
sbatch --dependency=afterok:12251710 nFow_Ty_repeatclassifier.sh
sbatch --dependency=afterok:12251711 nGru_NEGM_repeatclassifier.sh
sbatch --dependency=afterok:12251712 nFow_ATCC30863_repeatclassifier.sh
sbatch --dependency=afterok:12251713 nLov_ATCC30569_repeatclassifier.sh
sbatch --dependency=afterok:12251714 nLov_F9_repeatclassifier.sh
sbatch --dependency=afterok:12251715 nLov_Lova6_repeatclassifier.sh
sbatch --dependency=afterok:12251716 nLov_Lova7_repeatclassifier.sh
sbatch --dependency=afterok:12251717 nLov_NL_76_15_250_repeatclassifier.sh
```
12251746 nocona nLov_NL_ daray PD 0:00 1 (Dependency)
12251745 nocona nLov_Lov daray PD 0:00 1 (Dependency)
12251744 nocona nLov_Lov daray PD 0:00 1 (Dependency)
12251743 nocona nLov_F9_ daray PD 0:00 1 (Dependency)
12251742 nocona nLov_ATC daray PD 0:00 1 (Dependency)
12251741 nocona nFow_ATC daray PD 0:00 1 (Dependency)
12251740 nocona nGru_NEG daray PD 0:00 1 (Dependency)
12251739 nocona nFow_Ty_ daray PD 0:00 1 (Dependency)
12251738 nocona nFow_PA3 daray PD 0:00 1 (Dependency)
12251737 nocona nFow_NF_ daray PD 0:00 1 (Dependency)
12251736 nocona nFow_NF1 daray PD 0:00 1 (Dependency)
12251735 nocona nFow_kur daray PD 0:00 1 (Dependency)
12251734 nocona nFow_ATC daray PD 0:00 1 (Dependency)
12251733 nocona nFow_AR1 daray PD 0:00 1 (Dependency)
12251732 nocona aKon_cla daray PD 0:00 1 (Dependency)
LIST="aKon nFow_AR12 nFow_ATCC30894 nFow_kurume nFow_NF1 nFow_NF_KFSHRC nFow_PA34 nFow_Ty nGru_NEGM nFow_ATCC30863 nLov_ATCC30569 nLov_F9 nLov_Lova6 nLov_Lova7 nLov_NL_76_15_250"
```
sbatch --dependency=afterok:12251732 aKon_TEcurate.sh
sbatch --dependency=afterok:12251733 nFow_AR12_TEcurate.sh
sbatch --dependency=afterok:12251734 nFow_ATCC30894_TEcurate.sh
sbatch --dependency=afterok:12251735 nFow_kurume_TEcurate.sh
sbatch --dependency=afterok:12251736 nFow_NF1_TEcurate.sh
sbatch --dependency=afterok:12251737 nFow_NF_KFSHRC_TEcurate.sh
sbatch --dependency=afterok:12251738 nFow_PA34_TEcurate.sh
sbatch --dependency=afterok:12251739 nFow_Ty_TEcurate.sh
sbatch --dependency=afterok:12251740 nGru_NEGM_TEcurate.sh
sbatch --dependency=afterok:12251741 nFow_ATCC30863_TEcurate.sh
sbatch --dependency=afterok:12251742 nLov_ATCC30569_TEcurate.sh
sbatch --dependency=afterok:12251743 nLov_F9_TEcurate.sh
sbatch --dependency=afterok:12251744 nLov_Lova6_TEcurate.sh
sbatch --dependency=afterok:12251745 nLov_Lova7_TEcurate.sh
sbatch --dependency=afterok:12251746 nLov_NL_76_15_250_TEcurate.sh
```
Not a lot of stuff is being recognized. Suggested to Angela that she start more targeted searches. Will begin with an LTR search using MegaLTR.
https://bioinformatics.um6p.ma/MegaLTR/documentation
megaltr_template.sh
```
#!/bin/bash
#SBATCH --job-name=<NAME>_mLTR
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=12
. ~/conda/etc/profile.d/conda.sh
conda activate MegaLTR
TAXON=<NAME>
WORKDIR=/lustre/scratch/daray/protists/te_curations/$TAXON
mkdir -p $WORKDIR/megaltr
cd $WORKDIR/megaltr
SOFTWARE=/lustre/work/daray/software
#Run megaltr
bash $SOFTWARE/MegaLTR/MegaLTR.sh -A 1 -t 12 -P ${TAXON}_results -F $WORKDIR/assemblies_dir/${TAXON}.fa
```
```
cp megaltr_template.sh ../curation_templates/
LIST="aKon nFow_AR12 nFow_ATCC30894 nFow_kurume nFow_NF1 nFow_NF_KFSHRC nFow_PA34 nFow_Ty nGru_NEGM nFow_ATCC30863 nLov_ATCC30569 nLov_F9 nLov_Lova6 nLov_Lova7 nLov_NL_76_15_250"
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" megaltr_template.sh >${NAME}_megaltr.sh; done
for NAME in $LIST; do sbatch ${NAME}_megaltr.sh; done
LIST="aWhi cOwc cPer cLim cFra hCla iHof mVib mBre nSp nDam pChi pVie pGem pUst sRos sArc"
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" megaltr_template.sh >${NAME}_megaltr.sh; done
for NAME in $LIST; do sbatch ${NAME}_megaltr.sh; done
```
Forgot that this software requires a tRNA database. I can't find anything for protists so I'm going to make my own using tRNAscan-SE.
trnascan_template.sh
```
#!/bin/bash
#SBATCH --job-name=<NAME>_tRNA
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=1
. ~/conda/etc/profile.d/conda.sh
conda activate trnascan
TAXON=<NAME>
WORKDIR=/lustre/scratch/daray/protists/te_curations/$TAXON
mkdir -p $WORKDIR/trnascan
cd $WORKDIR/trnascan
SOFTWARE=/lustre/work/daray/software
#Run trnascan
tRNAscan-SE $WORKDIR/assemblies_dir/${TAXON}.fa
```
```
LIST="aKon nFow_AR12 nFow_ATCC30894 nFow_kurume nFow_NF1 nFow_NF_KFSHRC nFow_PA34 nFow_Ty nGru_NEGM nFow_ATCC30863 nLov_ATCC30569 nLov_F9 nLov_Lova6 nLov_Lova7 nLov_NL_76_15_250"
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" trnascan_template.sh >${NAME}_trnascan.sh; done
for NAME in $LIST; do sbatch ${NAME}_trnascan.sh; done
64363 LIST="aWhi cOwc cPer cLim cFra hCla iHof mVib mBre nSp nDam pChi pVie pGem pUst sRos sArc"
64364 for NAME in $LIST; do sed "s/<NAME>/$NAME/g" trnascan_template.sh >${NAME}_trnascan.sh; done
for NAME in $LIST; do sbatch ${NAME}_trnascan.sh; done
```
Forgot LOTS of things about MegaLTR. It needs:
1. to be modified to take a custom directory to the tRNA database generated by trnascan-se
2. to be directed to rexdb-metazoa rather than to a plant database by default.
To solve these problems I modified the MegaLTR script in the github cloned directory. New version is below. Specific changes are on lines
1 - commented out original
56 - Added documentation line
91 - Added option "I"
99 - Added option "I"
New script is:
```
#!/bin/bash
##Last update 31-01-2023 by Morad Mokhtar
if [ $# -eq 0 ]
then
echo "Parameters -A and -F is required, use [bash MegaLTR.sh -help] for more detalis"
exit
fi
#tRNAdb= DRay changed this to an option. See options in text below.
LTR_FINDER=/lustre/work/daray/software/MegaLTR/bin/LTR_FINDER_parallel/LTR_FINDER_parallel
LTR_HARVEST=/lustre/work/daray/software/MegaLTR/bin/LTR_HARVEST_parallel/LTR_HARVEST_parallel
LTRretriever=/lustre/work/daray/software/MegaLTR/bin/LTR_retriever/LTR_retriever
chmod 775 $LTRretriever #Give execulting permissions for LTRretriever
RUN=/lustre/work/daray/software/MegaLTR/bin/RUN
chmod 775 $RUN/usearch11.0.667_i86linux32 #Give execulting permissions
eval "$(conda shell.bash hook)"
############################################################
# Set variables #
############################################################
Analysistype=3
Fastafilepath=""
gffpath=""
trna=Arabidopsis_thaliana_trna.fa
outfileprefix=results
minlenltr=100
maxlenltr=7000
mindisltr=1000
maxdisltr=15000
similar=85
matchpairs=20
TEsorterdb=rexdb
TEsortercov=20
TEsortereval=0.001
TEsorterrule=80-80-80
TEsorterhmm=rexdb
RateOfEvolution=0.000000015
up=5000
down=5000
density1=1000000
numberofchromosom=9
threads=4
############################################################
# Help #
############################################################
Help()
{
# Display Help
echo "MegaLTR is a pipeline designed for high-throughput identification and classification of LTR-retrotransposons, insertion age estimation, detection of LTR-gene chimeras, and determination of nearby genes."
echo
echo " MegaLTR v2.0."
echo
echo " Options:"
echo " -h Print this Help."
echo " -A The analysis type [1 or 2 or 3] 1 (for Intact LTR-RT identification and annotation of internal domains 'This analysis needs FASTA file only') 2 (for Intact LTR-RT Identification and annotation of internal domains plus determination of insertion time 'This analysis needs FASTA file only') 3 (for Intact LTR-RT Identification, annotation of internal domains, determination of insertion time, LTR-RT gene-chimera analysis and visualization of gene density and LTR-RTs across chromosomes 'This analysis needs FASTA and GFF files') , default is 3"
echo " -F Your path to Fasta file."
echo " -G Your path to GFF file."
echo " -I Path to fasta file with tRNA sequences."
echo " -T tRNA sequence file (Locate the filename from the tRNA folder or provide your own tRNA sequence in FASTA format, default is Arabidopsis_thaliana_trna.fa)."
echo " -P Outfileprefix, default is results."
echo " -l Min length of 5'&3'LTR, default is 100."
echo " -L Max length of 5'&3'LTR, default is 7000."
echo " -d Min distance between 5'&3'LTR, default is 1000."
echo " -D Max distance between 5'&3'LTR, default is 15000."
echo " -S Specify similaritythreshold, default is 85."
echo " -M Min length of exact match pair, default is 20."
echo " -B TE Database that TEsorter will use it {gydb,rexdb,rexdb-plant,rexdb-metazoa,sine}, default is rexdb."
echo " -C mininum coverage for protein domains in HMMScan output, default is 20."
echo " -V maxinum E-value for protein domains in HMMScan output, default is 0.001."
echo " -Q classifying rule [identity-coverage-length] based on similarity, default is 80-80-80]."
echo " -E hmm-database that TEsorter will use it {gydb,rexdb,rexdb-plant,rexdb-metazoa,sine}, default is rexdb]."
echo " -R neutral mutation rate of the target species (per bp per ya), e.g., rice: 1.3e-8 [0.000000013] (Ma and Bennetzen 2004); mammal: 2.2e-9 [0.0000000022] (S. Kumar 2002); Drosophila: 1.6e-8 [0.000000016](Bowen and McDonald 2001), default is 0.000000013."
echo " -U the distance upstream LTR retrotransposons, default is 5000."
echo " -X the distance downstream LTR retrotransposons, default is 5000."
echo " -W window size to extract gene density from the GFF file, default is 1000000."
echo " -N Number of chromosomes specified in FASTA file to visualize density of genes and LTRs, default is 12."
echo " -v Print MegaLTR version and exit."
echo " -t Indicate how many CPU/threads you want to run MegaLTR, default is 4."
echo
echo "Default parameters:bash MegaLTR.sh -A $Analysistype -F Your_path_to_FASTA_file -G Your_path_to_GFF_file -T $trna -P $outfileprefix -l $minlenltr -L $maxlenltr -d $mindisltr -D $maxdisltr -S $similar -M $matchpairs -B $TEsorterdb -C $TEsortercov -V $TEsortereval -Q $TEsorterrule -E $TEsorterhmm -R $RateOfEvolution -U $up -X $down -W $density1 -N $numberofchromosom -t $threads"
exit
}
version()
{
echo "MegaLTR v1.06."
echo
exit
}
############################################################
# Process the input options. #
############################################################
check=0
while getopts h:A:F:G:T:P:l:L:d:D:S:M:B:C:V:Q:E:R:U:X:W:N:v:t:I: options
do
case $options in
h) Help;;
v) version;;
A) Analysistype=$OPTARG;((check=check+1));;
F) Fastafilepath=$OPTARG;((check=check+1));;
G) gffpath=$OPTARG;;
I) tRNAdb=$OPTARG;;
T) trna=$OPTARG;;
P) outfileprefix=$OPTARG;;
l) minlenltr=$OPTARG;;
L) maxlenltr=$OPTARG;;
d) mindisltr=$OPTARG;;
D) maxdisltr=$OPTARG;;
S) similar=$OPTARG;;
M) matchpairs=$OPTARG;;
B) TEsorterdb=$OPTARG;;
C) TEsortercov=$OPTARG;;
V) TEsortereval=$OPTARG;;
Q) TEsorterrule=$OPTARG;;
E) TEsorterhmm=$OPTARG;;
R) RateOfEvolution=$OPTARG;;
U) up=$OPTARG;;
X) down=$OPTARG;;
W) density1=$OPTARG;;
N) numberofchromosom=$OPTARG;;
t) threads=$OPTARG;;
*) echo "Error: Invalid option" ;;
esac
done
########################################################
division=100
similarFinder=$(perl -e "print $similar/$division") ###convert the similarity to LTR_FINDER format
process_id=$outfileprefix
userpath= mkdir -p $(pwd)/$process_id
userpath=$(pwd)/$process_id
LAI= mkdir -p $userpath/LAI
LAI=$userpath/LAI
ltrdigest= mkdir -p $userpath/LTRdigest
ltrdigest=$userpath/LTRdigest
mkdir -p $userpath/TEsorter
TEsorter=$userpath/TEsorter
mkdir -p $userpath/Others
Others=$userpath/Others
mkdir -p $userpath/Time
time=$userpath/Time
mkdir -p $userpath/R_plots
Rplots=$userpath/R_plots
mkdir -p $userpath/inside_genes
inside_genes=$userpath/inside_genes
mkdir -p $userpath/near_genes
near_genes=$userpath/near_genes
mkdir -p $userpath/density
densitypath=$userpath/density
process_id=$outfileprefix
mkdir -p $userpath/Collected_Files
Collected_Files=$userpath/Collected_Files
mkdir -p $userpath/FASTA
FASTA=$userpath/FASTA
mkdir -p $userpath/LTR_FINDER
FINDER=$userpath/LTR_FINDER
mkdir -p $userpath/LTR_HARVEST
LTRHARVEST=$userpath/LTR_HARVEST
mkdir -p $userpath/LTRFiles
LTRFiles=$userpath/LTRFiles
USERCH= mkdir -p $userpath/USERCH
USERCH=$userpath/USERCH
###############################################################
#### MegaLTR process the start. #
###############################################################
now="$(date)"
echo
printf "\t#############################################\n\t############## MegaLTR v2.0 ##############\n\t#############################################\n\n\tContributors: Morad M Mokhtar, Achraf El Allali\n\n"
printf "\t$now \t Start time %s\n" ### print current date
echo
printf "\tParameters: -A $Analysistype -F $Fastafilepath -G $gffpath -T $trna -I $tRNAdb-P $outfileprefix -l $minlenltr -L $maxlenltr -d $mindisltr -D $maxdisltr -S $similar -M $matchpairs -B $TEsorterdb -C $TEsortercov -V $TEsortereval -Q $TEsorterrule -E $TEsorterhmm -R $RateOfEvolution -U $up -X $down -W $density1 -N $numberofchromosom -t $threads\n\n"
conda activate MegaLTR
if [ $check -ne 2 ]
then
printf "\tParameters -A and -F is required, use [bash MegaLTR.sh -help] for more detalis\n"
exit
fi
conda activate MegaLTR
if [[ $Fastafilepath =~ \.gz$ ]]; then
cd $userpath
echo
printf "\tCheck the FASTA File format.\n"
gunzip -c "$Fastafilepath" >$userpath/$process_id.fna
perl $RUN/checkfasta.pl $userpath/$process_id.fna ### Check the FASTA File format
elif [[ $Fastafilepath =~ \.zip$ ]]; then
echo
printf "\tCheck the FASTA File format.\n"
gunzip -c "$Fastafilepath" >$userpath/$process_id.fna
perl $RUN/checkfasta.pl $userpath/$process_id.fna ### Check the FASTA File format
elif ([ $(stat -c%s "$Fastafilepath") -gt 500 ]); then
printf "\tCheck the FASTA File format.\n"
perl $RUN/checkfasta.pl $Fastafilepath ### Check the FASTA File format
cp $Fastafilepath $userpath/$process_id.fna #### copy fasta file tRNA files
else
printf "\n\tCheck the FASTA File format.\n"
printf "\n\tPlease use -F to provide a valid FASTA file [-F FASTA_file_path].\n"
exit
fi
if ([ $Analysistype -eq 3 ] && [[ $gffpath == "" ]]); then
printf "\n\tPlease use -G to provide a valid GFF file [-G GFF_file_path].\n"
exit
fi
if ([[ $Analysistype -eq 3 ]] && [ $(stat -c%s "$gffpath") -lt 500 ]); then
printf "\n\tPlease use -G to provide a valid GFF file [-G GFF_file_path].\n"
exit
fi
if ([ $Analysistype -eq 3 ] && [[ $gffpath =~ \.gz$ ]]); ### mode 3
then
gunzip -c "$gffpath" >$userpath/$process_id.gff
elif ([ $Analysistype -eq 3 ] && [[ $gffpath =~ \.zip$ ]]); then
gunzip -c "$gffpath" >$userpath/$process_id.gff
elif ([[ $Analysistype -eq 3 ]] && [ $(stat -c%s "$gffpath") -gt 500 ]); then
cp $gffpath $userpath/$process_id.gff #### copy gff file
fi
if ([ $Analysistype -eq 1 ] || [ $Analysistype -eq 2 ] || [ $Analysistype -eq 3 ]); ### mode 2
then
# : <<'END' END
# printf "\t$now \t Start time %s\n" ### print current date
# echo
############## copy and split fasta #############
cp $tRNAdb/$trna $userpath/$trna #### copy tRNA file
sed -i 's/|/_/' $userpath/$process_id.fna ### remove unnecessary details from Fasta sequence headers
sed -i 's/|/ /' $userpath/$process_id.fna ### remove unnecessary details from Fasta sequence headers
sed -i 's/ .*//' $userpath/$process_id.fna ### remove unnecessary details from Fasta sequence headers
sed -i 's/\t.*//g' $userpath/$trna ### remove unnecessary details from tRNA sequence headers
now2="$(date)"
printf "\n\t$now2 \tLTR_FINDER Started %s\n"
################## LTR_Finder & LTR_HARVEST ########### updated 31-01-2023
cd $FASTA
perl $LTR_FINDER -seq $userpath/$process_id.fna -threads $threads -harvest_out -size 1000000 -time 500 $tRNAdb/$trna $trna $maxdisltr $mindisltr $maxlenltr $minlenltr $matchpairs $similarFinder $FASTA $similar $process_id > /dev/null 2>/dev/null
now51="$(date)"
printf "\n\t$now51 \tLTR_HARVEST Started %s\n"
conda config --show envs_dirs >$densitypath/condapath
sed -i '1d' $densitypath/condapath
sed -i 's/ - //g' $densitypath/condapath
for condapath in `less $densitypath/condapath`
do
perl $LTR_HARVEST -seq $userpath/$process_id.fna -threads $threads -size 1000000 -time 500 -gt $condapath/MegaLTR/bin/gt $minlenltr $maxlenltr $similar > /dev/null 2>/dev/null
done
cat $FASTA/$process_id.fna.harvest.combine.scn $FASTA/$process_id.fna.finder.combine.scn >$FASTA/$process_id.all.harvest.finder.combine
now5="$(date)"
echo
printf "\t$now5 \tLTR_FINDER & LTR_HARVEST Done %s\n"
now50="$(date)"
echo
printf "\t$now50 \tLTR_retriever Started %s\n"
#######LTR_retriever ##########
cd $LAI
$LTRretriever -genome $userpath/$process_id.fna -inharvest $FASTA/$process_id.all.harvest.finder.combine -threads $threads -minlen $minlenltr >>$userpath/screen.txt
if [ -f "$LAI/$process_id.fna.pass.list" ]; then
cp $LAI/$process_id.fna.pass.list $Collected_Files
fi
if [ -f "$LAI/$process_id.fna.mod.pass.list" ]; then
cp $LAI/$process_id.fna.mod.pass.list $Collected_Files/$process_id.fna.pass.list
fi
if [ -f "$LAI/$process_id.fna.nmtf.pass.list" ]; then
cp $LAI/$process_id.fna.nmtf.pass.list $Collected_Files
fi
if [ -f "$LAI/$process_id.fna.mod.nmtf.pass.list" ]; then
cp $LAI/$process_id.fna.mod.nmtf.pass.list $Collected_Files/$process_id.fna.nmtf.pass.list
fi
if [ -f "$LAI/$process_id.fna.pass.list.gff3" ]; then
cp $LAI/$process_id.fna.pass.list.gff3 $Collected_Files
fi
if [ -f "$LAI/$process_id.fna.mod.pass.list.gff3" ]; then
cp $LAI/$process_id.fna.mod.pass.list.gff3 $Collected_Files/$process_id.fna.pass.list.gff3
fi
if [ -f "$LAI/$process_id.fna.out" ]; then
cp $LAI/$process_id.fna.out $Collected_Files
fi
if [ -f "$LAI/$process_id.fna.mod.out" ]; then
cp $LAI/$process_id.fna.out $Collected_Files/$process_id.fna.out
fi
now7="$(date)"
echo
printf "\t$now7 \tLTR_retriever Done %s\n"
#######ltrdigest ###########
conda activate MegaLTR
now8="$(date)"
echo
printf "\t$now8 \tLTRdigest Started %s\n"
bash $RUN/ltrdigest.sh $process_id $userpath $trna $LAI $ltrdigest
now9="$(date)"
echo
printf "\t$now9 \tLTRdigest Done %s\n"
#######TEsorter ###########
now10="$(date)"
echo
printf "\t$now10 \tTEsorter Started %s\n"
cd $TEsorter
bash $RUN/TEsorter.sh $TEsorter $ltrdigest/"$process_id"_complete.fas $TEsorterdb $TEsortercov $TEsortereval $TEsorterrule $TEsorterhmm $RUN $process_id $Collected_Files $userpath $threads
now11="$(date)"
echo
printf "\t$now11 \tTEsorter Done %s\n"
now12="$(date)"
echo
printf "\t$now12 \tFiltering TEsorter results Started %s\n"
awk -F '\t' '$2 == "LTR"' $TEsorter/"$process_id"_complete.fas.$TEsorterhmm.cls.tsv >$Others/$process_id.LTR.tsv ##### search inside a specific column
awk -F '\t' '$2 != "LTR"' $TEsorter/"$process_id"_complete.fas.$TEsorterhmm.cls.tsv >$Others/$process_id.others.tsv ##### search inside a specific column remove any other elements (like nested, LINE, ... any thing eles LTR)
perl $RUN/print.pl $ltrdigest/"$process_id"_tabout.csv >$Others/$process_id.tabout.tsv ###### $Others/$"$process_id".tabout.tsv header ########## element id element start element end element length sequence lLTR start lLTR end lLTR length rLTR start rLTR end rLTR length lTSD start lTSD end lTSD motif rTSD start rTSD end rTSD motif PPT start PPT end PPT motif PPT strand PPT offset PBS start PBS end PBS strand tRNA tRNA motif PBS offset tRNA offset PBS/tRNA edist
now13="$(date)"
echo
printf "\t$now13 \tMergeing of LTR_retriever, LTRdigest, and TEsorter results %s\n"
perl $RUN/TEsorter_Digest.pl $Others/$process_id.tabout.tsv $Others/$process_id.LTR.tsv >$Others/2LTR_Table_TEsorter_Digest.tsv
python3 $RUN/classification_NEW_LTR_2.py $Others/2LTR_Table_TEsorter_Digest.tsv $Others/new_LTR_Table_TEsorter_Digest.tsv
awk -F'\t' '{print $2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16"\t"$17"\t"$18"\t"$19"\t"$20"\t"$21"\t"$22"\t"$23"\t"$24"\t"$25"\t"$26"\t"$27"\t"$28"\t"$29"\t"$30"\t"$31"\t"$32"\t"$33"\t"$1"\t"$35"\t"$36"\t"$37"\t"$38}' $Others/new_LTR_Table_TEsorter_Digest.tsv > $Others/LTR_Table_TEsorter_Digest.tsv
cp $Others/LTR_Table_TEsorter_Digest.tsv $Collected_Files ### LTR_retriever, LTRdigest, and TEsorter results in one file
awk -F "\t" '{ print $2"\t"$33"\t"$3"\t"$4"\t"$5 }' $Others/LTR_Table_TEsorter_Digest.tsv > $Others/$process_id.length.ids.forstat
python3 $RUN/super_familly_stat.py $Others/$process_id.length.ids.forstat $Collected_Files/$process_id.statistics.tsv ##LTR superfamily summary
awk -F "\t" '{ print $2"\t"$3"\t"$4"\t"$5"\t"$32"\t"$33"\t"$34 }' $Others/LTR_Table_TEsorter_Digest.tsv > $Others/$process_id.ids.extract_seq
cd $LTRFiles
split -n l/100 $Others/$process_id.ids.extract_seq
python3 $RUN/LTR_Seq_threads.py $userpath/$process_id.fna $LTRFiles $Collected_Files $threads $RUN/extractseq-id-start-end.pl
cp $ltrdigest/"$process_id"_pbs.fas $Collected_Files/$process_id.PBS.Sequence.fa
cp $ltrdigest/"$process_id"_ppt.fas $Collected_Files/$process_id.PPT.Sequence.fa
sed -i '1i LTR-RT id\tPseudomolecules/scaffolds\tLTR-RT start\tLTR-RT end\tLTR-RT length\tlLTR start\tlLTR end\tlLTR length\trLTR start\trLTR end\trLTR length\tlTSD start\tlTSD end\tlTSD sequence\trTSD start\trTSD end\trTSD sequence\tPPT start\tPPT end\tPPT motif\tStrand\tPPT offset\tPBS start\tPBS end\tStrand\ttRNA id\ttRNA motif\tPBS offset\ttRNA offset\tPBS/tRNA\t\tClass\tSuperfamily\tClade\tComplete\tStrand\tDomains' $Collected_Files/LTR_Table_TEsorter_Digest.tsv
######## for LTR.non-redundant.fa #################
cp $Collected_Files/LTR-RT_Sequence.fa $USERCH/LTR-RT_Sequence.fa
$RUN/usearch11.0.667_i86linux32 -sortbylength $USERCH/LTR-RT_Sequence.fa --fastaout $USERCH/LTR-RT_Sequence_sorted.fa --log $USERCH/usearch.log
$RUN/usearch11.0.667_i86linux32 -cluster_fast $USERCH/LTR-RT_Sequence_sorted.fa --id 0.9 --centroids $USERCH/LTR-RTs_non-redundant.fa --uc $USERCH/result.uc -consout $USERCH/LTR-RTs_conses.fa -msaout $USERCH/aligned.fasta --log $USERCH/usearch2.log
rm $USERCH/aligned.*
cp $USERCH/LTR-RTs_non-redundant.fa $Collected_Files/LTR-RTs_non-redundant_library.fasta
now100="$(date)"
printf "\n\n\t$now101 \tNon-redundant LTR library done%s\n\n"
fi
if ([ $Analysistype -eq 2 ] || [ $Analysistype -eq 3 ]) ### mode 2
then
########LTR insertion time ###########
now14="$(date)"
echo
printf "\t$now14 \tCalculation of the LTR-RT insertion time %s\n"
python3 $RUN/get_region.py $userpath/$process_id.fna $Others/LTR_Table_TEsorter_Digest.tsv $time
echo "" > $Others/$process_id.time.txt
echo "LTRNAME K_Kimura K_Ksd K_TajimaNei K_TNsd timeK timeKsd numComparedSites transitions numComparedSites transversions numComparedSites timeTN transitions_numComparedSites transversions_numComparedSites" >>$Others/$process_id.time.txt
for fst in $time/*.fasta
do
clustalw -infile="$fst" >>$Others/$process_id.clustalw.txt ### Alignment of the LTR using clustalw
fname=$(basename $fst ".fasta")
ltrk=$(perl $RUN/estimate_K.pl $time/$fname".aln" $RateOfEvolution) ## estimate LTR-RT insertion time using Kimura and Tajima&Nei methods (this script is part of REannotate program)
echo -e $fname "\t""$ltrk" >> $Others/$process_id.time.txt
done
perl $RUN/TEsorterandtable_time.pl $Others/LTR_Table_TEsorter_Digest.tsv $Others/$process_id.time.txt >$Others/$process_id.Digest_TEsorter_Time.tsv ############# $Others/$process_id.Digest_TEsorter_Time (header) org_name acc element_start element_end element_length strand lTSD_start lTSD_end lLTR_start lLTR_end rLTR_start rLTR_end rTSD_start rTSD_end EDTA_infoo PPT start PPT end PPT motif PPT offset PBS start PBS end tRNA tRNA motif PBS offset tRNA offset PBS/tRNA edist Order Superfamily Clade Complete Strand Domains LTRNAME K_Kimura K_Ksd K_TajimaNei K_TNsd timeK timeKsd numComparedSites transitions numComparedSites transversions numComparedSites timeTN transitions_numComparedSites transversions_numComparedSites
cp $Others/$process_id.Digest_TEsorter_Time.tsv $Collected_Files ### LTR_retriever, LTRdigest, TEsorter, and insertion time results in one file
sed -i '1i LTR-RT id\tPseudomolecules/scaffolds\tLTR-RT start\tLTR-RT end\tLTR-RT length\tlLTR start\tlLTR end\tlLTR length\trLTR start\trLTR end\trLTR length\tlTSD start\tlTSD end\tlTSD sequence\trTSD start\trTSD end\trTSD sequence\tPPT start\tPPT end\tPPT motif\tStrand\tPPT offset\tPBS start\tPBS end\tStrand\ttRNA id\ttRNA motif\tPBS offset\ttRNA offset\tPBS/tRNA\t\tClass\tSuperfamily\tClade\tComplete\tStrand\tDomains\tK_Kimura\tK_Ksd\tK_TajimaNei\tK_TNsd\ttimeK\ttimeKsd\tnumComparedSites\ttransitions\tnumComparedSites\ttransversions\tnumComparedSites\ttimeTN\ttransitions_numComparedSites\ttransversions_numComparedSites' $Collected_Files/$process_id.Digest_TEsorter_Time.tsv
###### preparing the files for R plots #
now15="$(date)"
echo
printf "\t$now15 \tPreparing R plots files %s\n"
awk -F "\t" '{ print $1"\t"$33"\t"$42 }' $Others/$process_id.Digest_TEsorter_Time.tsv > $Rplots/$process_id.time.ids
awk -F "\t" '{ print $2"\t"$33"\t"$3"\t"$4"\t"$5 }' $Others/$process_id.Digest_TEsorter_Time.tsv > $Rplots/$process_id.length.ids
awk -F "\t" '{ print $1"\t""Autonomous and Nonautonomous""\t"$3 }' $Rplots/$process_id.time.ids > $Rplots/$process_id.time.ids2
awk -F "\t" '{ print $1"\t""Autonomous and Nonautonomous""\t"$2"\t"$3"\t"$4"\t"$5 }' $Rplots/$process_id.length.ids >$Rplots/$process_id.length.ids2
Rscript $RUN/chart_plot.r $Rplots/$process_id.length.ids $Rplots/$process_id.time.ids MegaLTR..$process_id $RUN 2>/dev/null ### Both LTR-RT insertion time and LTR-RT length destributions plots based on the LTR-RT superfamilies
Rscript $RUN/chart_plot.r $Rplots/$process_id.length.ids2 $Rplots/$process_id.time.ids2 MegaLTR..$process_id $RUN 2>/dev/null ### Both LTR-RT insertion time and LTR-RT length destributions plots for the whole genome
cp $Rplots/*.png $Collected_Files ## Both LTR-RT insertion time and LTR-RT length destributions plots (.png)
now16="$(date)"
echo
printf "\t$now16 \tLTR-RT insertion time plots done %s\n"
##exit
fi
if [ $Analysistype -eq 3 ] ### mode 3 # ##### TE-gene chimeras #########
then
now17="$(date)"
echo
printf "\t$now17 \tLTR-RT-gene chimeras started %s\n"
grep -P "\tgene\t" $userpath/$process_id.gff > $userpath/$process_id.grep.gene ## retrieve gene start and end from GFF file
grep -P "\tpseudogene\t" $userpath/$process_id.gff > $userpath/$process_id.grep.pseudogene ## retrieve pseudogene start and end from GFF file
cat $userpath/$process_id.grep.gene $userpath/$process_id.grep.pseudogene > $userpath/$process_id.gene_pseudogene.gff ## combine gene and pseudogene in one file
perl $RUN/get-TE-within-gene.pl $Others/$process_id.Digest_TEsorter_Time.tsv $userpath/$process_id.gene_pseudogene.gff $process_id $inside_genes > $inside_genes/$process_id.LTR_inside_genes.table ## determine the LTR-RT located inside the gene start and end
awk -F "\t" '{ print $1 }' $inside_genes/$process_id.LTR_inside_genes.table >$inside_genes/$process_id.LTR_inside_genes.table.ids
awk -F "\t" '{ print $1 }' $Others/$process_id.Digest_TEsorter_Time.tsv >$inside_genes/$process_id.LTR_Table_Digest_TEsorter_Time.ids
grep -F -x -v -f $inside_genes/$process_id.LTR_inside_genes.table.ids $inside_genes/$process_id.LTR_Table_Digest_TEsorter_Time.ids >$inside_genes/LTR_Table_Digest_TEsorter_Time_process ###### print found in file 2 an not found in file 1
grep -f $inside_genes/LTR_Table_Digest_TEsorter_Time_process $Others/$process_id.Digest_TEsorter_Time.tsv > $inside_genes/LTR_Table_Digest_TEsorter_Time_nongene1
perl $RUN/No.pl $inside_genes/LTR_Table_Digest_TEsorter_Time_nongene1 >$inside_genes/LTR_Table_Digest_TEsorter_Time_nongene ## determine which LTR-RT located outside the gene start and end
cat $inside_genes/$process_id.LTR_inside_genes.table $inside_genes/LTR_Table_Digest_TEsorter_Time_nongene >$inside_genes/LTR_Table_Digest_TEsorter_Time_nongene_and_gene.tsv ## combine the LTR-RT located insid and outside the gene start and end in one file
cp $inside_genes/LTR_Table_Digest_TEsorter_Time_nongene_and_gene.tsv $Collected_Files
sed -i '1i LTR-RT id\tPseudomolecules/scaffolds\tLTR-RT start\tLTR-RT end\tLTR-RT length\tlLTR start\tlLTR end\tlLTR length\trLTR start\trLTR end\trLTR length\tlTSD start\tlTSD end\tlTSD sequence\trTSD start\trTSD end\trTSD sequence\tPPT start\tPPT end\tPPT motif\tStrand\tPPT offset\tPBS start\tPBS end\tStrand\ttRNA id\ttRNA motif\tPBS offset\ttRNA offset\tPBS/tRNA\tClass\tSuperfamily\tClade\tComplete\tStrand\tDomains\tK_Kimura\tK_Ksd\tK_TajimaNei\tK_TNsd\ttimeK\ttimeKsd\tnumComparedSites\ttransitions\tnumComparedSites\ttransversions\tnumComparedSites\ttimeTN\ttransitions_numComparedSites\ttransversions_numComparedSites\tinside gene status\tgene/pseudogene\tGene start\tGene end\tstrand\tgene annotation' $Collected_Files/LTR_Table_Digest_TEsorter_Time_nongene_and_gene.tsv
now18="$(date)"
echo
printf "\t$now18 \tLTR-RT-gene chimeras done %s\n"
#####*** TE near genes***
now19="$(date)"
echo
printf "\t$now19 \tLTR-RT near genes started %s\n"
sed 's/\t?\t/\t+\t/g' $userpath/$process_id.gene_pseudogene.gff >$near_genes/$process_id.gene_pseudogene.gff2
sed 's/\t?\t/\t+\t/g' $Others/$process_id.Digest_TEsorter_Time.tsv >$near_genes/$process_id.Digest_TEsorter_Time2
awk -F '\t' '$36=="+"' $near_genes/$process_id.Digest_TEsorter_Time2 >$near_genes/$process_id.Digest_TEsorter_Time5+
awk -F '\t' '$36=="-"' $near_genes/$process_id.Digest_TEsorter_Time2 >$near_genes/$process_id.Digest_TEsorter_Time5-
perl $RUN/get-TE-near-gene-minuse1k.pl $near_genes/$process_id.Digest_TEsorter_Time5+ $near_genes/$process_id.gene_pseudogene.gff2 $up $down >$near_genes/up_genes+
perl $RUN/get-TE-near-gene-pluse-1k.pl $near_genes/$process_id.Digest_TEsorter_Time5+ $near_genes/$process_id.gene_pseudogene.gff2 $up $down >$near_genes/down_genes+
perl $RUN/get-TE-near-gene-minuse-1k.pl $near_genes/$process_id.Digest_TEsorter_Time5- $near_genes/$process_id.gene_pseudogene.gff2 $up $down >$near_genes/down_genes-
perl $RUN/get-TE-near-gene-pluse+1k.pl $near_genes/$process_id.Digest_TEsorter_Time5- $near_genes/$process_id.gene_pseudogene.gff2 $up $down >$near_genes/up_genes-
cat $near_genes/up_genes+ $near_genes/down_genes+ $near_genes/down_genes- $near_genes/up_genes- >$near_genes/genes_up_and_down_LTR.tsv
cp $near_genes/genes_up_and_down_LTR.tsv $Collected_Files/$process_id.genes_up_and_down_LTR.tsv
sed -i '1i LTR-RT id\tUp/Downstream\tPseudomolecules/scaffolds\tLTR-RT start\tLTR-RT end\tLTR-RT length\tlLTR start\tlLTR end\tlLTR length\trLTR start\trLTR end\trLTR length\tlTSD start\tlTSD end\tlTSD sequence\trTSD start\trTSD end\trTSD sequence\tPPT start\tPPT end\tPPT motif\tStrand\tPPT offset\tPBS start\tPBS end\tStrand\ttRNA id\ttRNA motif\tPBS offset\ttRNA offset\tPBS/tRNA\t\tClass\tSuperfamily\tClade\tComplete\tStrand\tDomains\tK_Kimura\tK_Ksd\tK_TajimaNei\tK_TNsd\ttimeK\ttimeKsd\tnumComparedSites\ttransitions\tnumComparedSites\ttransversions\tnumComparedSites\ttimeTN\ttransitions_numComparedSites\ttransversions_numComparedSites\tgene/pseudogene\tGene start\tGene end\tgene length\tstrand\tgene annotation' $Collected_Files/$process_id.genes_up_and_down_LTR.tsv
now20="$(date)"
echo
printf "\t$now20 \tLTR-RT near genes done %s\n"
################### For chromosome LTR-RT distribution ###########
now20="$(date)"
echo
printf "\t$now20 \tVisualization of gene density and LTR-RTs across chromosomes %s\n"
awk -F "\t" '{ print $33"\t"$33"\t"$2"\t"$3"\t"$4"\t"$33 }' $Others/$process_id.Digest_TEsorter_Time.tsv > $densitypath/$process_id.figure.distrbution1
sed -i -E 's/Autonomous://g' $densitypath/$process_id.figure.distrbution1
sed -i -E 's/Nonautonomous://g' $densitypath/$process_id.figure.distrbution1
awk -F '\t' 'BEGIN { OFS=FS } { gsub("Copia", "triangle", $2); print }' $densitypath/$process_id.figure.distrbution1 > $densitypath/$process_id.figure.distrbution2
awk -F '\t' 'BEGIN { OFS=FS } { gsub("Copia", "6a3d9a", $6); print }' $densitypath/$process_id.figure.distrbution2 > $densitypath/$process_id.figure.distrbution3
awk -F '\t' 'BEGIN { OFS=FS } { gsub("Gypsy", "triangle", $2); print }' $densitypath/$process_id.figure.distrbution3 > $densitypath/$process_id.figure.distrbution4
awk -F '\t' 'BEGIN { OFS=FS } { gsub("Gypsy", "33a02c", $6); print }' $densitypath/$process_id.figure.distrbution4 > $densitypath/$process_id.figure.distrbution5
awk -F '\t' 'BEGIN { OFS=FS } { gsub("BARE-2", "triangle", $2); print }' $densitypath/$process_id.figure.distrbution5 > $densitypath/$process_id.figure.distrbution6
awk -F '\t' 'BEGIN { OFS=FS } { gsub("BARE-2", "ff7f00", $6); print }' $densitypath/$process_id.figure.distrbution6 > $densitypath/$process_id.figure.distrbution7
awk -F '\t' 'BEGIN { OFS=FS } { gsub("TR-GAG", "triangle", $2); print }' $densitypath/$process_id.figure.distrbution7 > $densitypath/$process_id.figure.distrbution8
awk -F '\t' 'BEGIN { OFS=FS } { gsub("TR-GAG", "ff0080", $6); print }' $densitypath/$process_id.figure.distrbution8 > $densitypath/$process_id.figure.distrbution9
awk -F '\t' 'BEGIN { OFS=FS } { gsub("Unknown", "triangle", $2); print }' $densitypath/$process_id.figure.distrbution9 > $densitypath/$process_id.figure.distrbution10
awk -F '\t' 'BEGIN { OFS=FS } { gsub("Unknown", "9d8477", $6); print }' $densitypath/$process_id.figure.distrbution10 > $densitypath/$process_id.figure.distrbution11
grep "##sequence-region" $userpath/$process_id.gff > $densitypath/"$process_id"_karyotype
sed -i 's/##sequence-region //g' $densitypath/"$process_id"_karyotype
sed -i 's/ /\t/g' $densitypath/"$process_id"_karyotype
awk -F "\t" '{print $1"\t"$4"\t"$5"\t""1"}' $userpath/$process_id.gene_pseudogene.gff >$densitypath/gene_anno
python3 $RUN/counter.py $densitypath/gene_anno $density1 $densitypath/gene_anno_counter
ulimit -i >$densitypath/limts
for limt in `less $densitypath/limts`
do
ulimit -s $limt
cd $densitypath
if [ $numberofchromosom -le 8 ] ### the numbers of chromosomes less than or equal 8 to justfay the image
then
sort -k3,3nr $densitypath/"$process_id"_karyotype >$densitypath/"$process_id"_karyotype_sort
echo "Chr Start End" >$densitypath/"$process_id"_karyotype_sort_head
head -8 $densitypath/"$process_id"_karyotype_sort >>$densitypath/"$process_id"_karyotype_sort_head
awk -F "\t" '{print $1}' $densitypath/"$process_id"_karyotype_sort_head >$densitypath/"$process_id"_karyotype_sort_head.ids
echo "Chr Start End Value" >$densitypath/gene_anno_counter_ok
grep -f $densitypath/"$process_id"_karyotype_sort_head.ids $densitypath/gene_anno_counter >>$densitypath/gene_anno_counter_ok
echo "Type Shape Chr Start End color" >$densitypath/figure_distrbution5_ok
grep -f $densitypath/"$process_id"_karyotype_sort_head.ids $densitypath/$process_id.figure.distrbution11 >>$densitypath/figure_distrbution5_ok
python3 $RUN/figure_legend.py $densitypath/figure_distrbution5_ok $densitypath/gene_anno_counter_ok $densitypath/"$process_id"_karyotype_sort_head $Collected_Files/"Map of Gene density and LTR-RTs distribution Figure.tsv"
Rscript $RUN/density_width.r $densitypath/"$process_id"_karyotype_sort_head $densitypath/gene_anno_counter_ok $densitypath/figure_distrbution5_ok
cp $densitypath/chromosome.svg $Collected_Files/"Gene density and LTR-RTs distribution.svg"
cp $densitypath/chromosome.png $Collected_Files/"Gene density and LTR-RTs distribution.png"
fi
if [ $numberofchromosom -ge 8 ] ### the numbers of chromosomes more than or equal 8 to justfay the image
then
sort -k3,3nr $densitypath/"$process_id"_karyotype >$densitypath/"$process_id"_karyotype_sort
echo "Chr Start End" >$densitypath/"$process_id"_karyotype_sort_head
head -20 $densitypath/"$process_id"_karyotype_sort >>$densitypath/"$process_id"_karyotype_sort_head
awk -F "\t" '{print $1}' $densitypath/"$process_id"_karyotype_sort_head >$densitypath/"$process_id"_karyotype_sort_head.ids
echo "Chr Start End Value" >$densitypath/gene_anno_counter_ok
grep -f $densitypath/"$process_id"_karyotype_sort_head.ids $densitypath/gene_anno_counter >>$densitypath/gene_anno_counter_ok
echo "Type Shape Chr Start End color" >$densitypath/figure_distrbution5_ok
grep -f $densitypath/"$process_id"_karyotype_sort_head.ids $densitypath/$process_id.figure.distrbution11 >>$densitypath/figure_distrbution5_ok
python3 $RUN/figure_legend.py $densitypath/figure_distrbution5_ok $densitypath/gene_anno_counter_ok $densitypath/"$process_id"_karyotype_sort_head $Collected_Files/"Map of Gene density and LTR-RTs distribution Figure.tsv"
Rscript $RUN/density.r $densitypath/"$process_id"_karyotype_sort_head $densitypath/gene_anno_counter_ok $densitypath/figure_distrbution5_ok
cp $densitypath/chromosome.svg $Collected_Files/"Gene density and LTR-RTs distribution.svg"
cp $densitypath/chromosome.png $Collected_Files/"Gene density and LTR-RTs distribution.png"
fi
done
fi
rm $userpath/$process_id.fna
rm -rf $FASTA
rm $userpath/$process_id.gff
now21="$(date)"
echo
printf "\t$now21 \tMegaLTR Done, The results saved in ($Collected_Files) %s\n"
```
Also modified the megaltr_template.sh script to accommodate these changes:
```
#!/bin/bash
#SBATCH --job-name=<NAME>_mLTR
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=12
. ~/conda/etc/profile.d/conda.sh
conda activate MegaLTR
TAXON=<NAME>
WORKDIR=/lustre/scratch/daray/protists/te_curations/$TAXON
mkdir -p $WORKDIR/megaltr
cd $WORKDIR/megaltr
SOFTWARE=/lustre/work/daray/software
#Run megaltr
bash $SOFTWARE/MegaLTR/MegaLTR.sh \
-A 1 \
-t 12 \
-P ${TAXON}_results \
-T ${TAXON}.tRNA.fa \
-F $WORKDIR/assemblies_dir/${TAXON}.fa \
-I $WORKDIR/trnascan \
-B rexdb-metazoa \
-E rexdb-metazoa
```
###### tags: `Protist TEs`