# Bat1k de novo TE curation RepeatModeler on each assembly, masked using final mammal library - March, 2021 - see /lustre/scratch/daray/bat1k_TE_analyses/rmasker/<folder>_RM_Ns ``` mkdir /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns cd /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns ``` Process RepeatModeler output to eliminate short (<100 bp) consensi. Cat with mammal TE library. ``` mkdir cdhit_analysis find . -type f -name "*_N_consensi.fa" | xargs cp -t cdhit_analysis/ LIST="aJam aPal aSto eFus gSor hCyc hLar hMon nLep nTum pHas rAff rCfl rSed rTri sBil sHon tBra" cd cdhit_analysis SOFTWARE=/lustre/work/daray/software GITPATH=~/gitrepositories/bioinfo_tools for NAME in $LIST; do $SOFTWARE/EMBOSS-6.6.0/emboss/sizeseq -sequences ${NAME}_N_consensi.fa -descending N -outseq ${NAME}_N_consensi_sort.fa; perl $SOFTWARE/removesmalls.pl 100 ${NAME}_N_consensi_sort.fa >${NAME}_N_consensi_100.fa; cat ${NAME}_N_consensi_100.fa final_mammal_library.fa >${NAME}_4_cdhit.fa; done for NAME in $LIST; do cut -f1 -d" " ${NAME}_4_cdhit.fa >${NAME}_4_cdhit_cut.fa done for NAME in $LIST; do sed "s/rnd-/${NAME}-rnd-/g" ${NAME}_4_cdhit_cut.fa >${NAME}_4_cdhit.fa done ``` Run cd-hit-est on all concatenated libraries. 80% similarity over 80% of length with comparison. ``` sbatch cdhit_aS.sh ``` Find clusters existing only in target species. Count number of lines with 'nt' in them. Count number of instances of 'family'. If they're equal, send file to new folder ``` WORKDIR=/lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall-Ns/cdhit_analysis LIST="aJam aPal aSto eFus gSor hCyc hLar hMon nLep nTum pHas rAff rCfl rSed rTri sBil sHon tBra" cd $WORKDIR for NAME in $LIST do mkdir $WORKDIR/${NAME}_novel cd $WORKDIR/${NAME}_novel for FILE in $WORKDIR/$NAME/clusters_80_aS08/${NAME}_lineage/Cluster_*.txt do LINECOUNT=$(grep 'nt' $FILE | wc -l) FAMILYCOUNT=$(grep 'family' $FILE | wc -l) [ $FAMILYCOUNT -eq $LINECOUNT ] && cp $FILE . done done ``` Grab the longest hit from each cluster and append name of putative TE to file. ``` WORKDIR=/lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall-Ns/cdhit_analysis cd $WORKDIR LIST="aJam aPal aSto eFus gSor hCyc hLar hMon nLep nTum pHas rAff rCfl rSed rTri sBil sHon tBra" for NAME in $LIST ; do for FILE in ${NAME}_novel/Cluster_*.txt; do fgrep "*" $FILE | cut -d">" -f2 | cut -d"." -f1 >> ${NAME}_novel.txt; done; done for NAME in $LIST; do python $GITPATH/pull_seqs.py -l ${NAME}_novel.txt -f ${NAME}_4_cdhit.fa -o ${NAME}_novel.fa; done ``` Do I want to cat all the novel files and run a cd-hit analysis? I don't know but I think not. Moved to /lustre/scratch/daray/bat1k_TE_analyses/te-extensions_N Perform two extend align runs. One using the novel TEs querying the unmasked genome and another querying the genome masked with Ns. Note, I began using a text file called list.txt. It's just the same list as before as a text file. ``` WORKDIR=/lustre/scratch/daray/bat1k_TE_analyses/te-extensions_N cd $WORKDIR cp /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall-Ns/cdhit_analysis/*_novel.fa . cat list.txt | while read i; do sed "s/<NAME>/$i/g" template_extend_align.sh >${i}_extend_align.sh; done cat list.txt | while read i; do sbatch ${i}_extend_align.sh; done cat list.txt | while read i; do sed "s/<NAME>/$i/g" template_extend_align_masked.sh >${i}_extend_align_masked.sh; done cat list.txt | while read i; do sbatch ${i}_extend_align_masked.sh; done cat list.txt | while read i; do for NAME in ls ${i}_N/images_and_alignments/possible_SD/*.png; do FILENAME=$(basename $NAME .png); rm ${i}_N/images_and_alignments/likely_TEs/${FILENAME}*; done; done cat list.txt | while read i; do for NAME in ls ${i}_N_masked/images_and_alignments/possible_SD/*.png; do FILENAME=$(basename $NAME .png); rm ${i}_N_masked/images_and_alignments/likely_TEs/${FILENAME}*; done; done ``` Create a table of hits from the masked run using the .out file from the extract_align run ``` cat list.txt | while read i; do grep Hit ${i}.masked.*.out | awk '{print $4, $6}' >>${i}_hit_counts_masked.txt; sort -k2 -rn -o ${i}_hit_counts_masked.txt ${i}_hit_counts_masked.txt; done ``` Get list of TEs to be considered. Only TEs with more than 40 hits. ``` cat list.txt | while read i; do awk '$2 >40' ${i}_hit_counts_masked.txt | awk '{print $1}' >${i}_4_examination_masked.txt; sort -n -o ${i}_4_examination_masked.txt ${i}_4_examination_masked.txt; done ``` Pull all of the relevant files and copy them to a folder for later download ``` cat list.txt | while read i; do mkdir -p ${i}_4_examination/masked; cat ${i}_4_examination_masked.txt | while read j; do find ${i}_N_masked/images_and_alignments/likely_TEs -name $j* -exec cp {} ${i}_4_examination/masked \; ; done; done ``` Create a table of hits from the unmasked run using the .out file from the extract_align run ``` cat list.txt | while read i; do grep Hit ${i}.*.out | awk '{print $4, $6}' >>${i}_hit_counts.txt; sort -k2 -rn -o ${i}_hit_counts.txt ${i}_hit_counts.txt; done ``` Get list of TEs to be considered. Only TEs with more than 40 hits. ``` cat list.txt | while read i; do awk '$2 >40' ${i}_hit_counts.txt | awk '{print $1}' >${i}_4_examination.txt; sort -n -o ${i}_4_examination.txt ${i}_4_examination.txt; done ``` Pull all of the relevant files and copy them to a folder for later download ``` cat list.txt | while read i; do mkdir -p ${i}_4_examination/unmasked; cat ${i}_4_examination.txt | while read j; do find ${i}_N/images_and_alignments -name $j* -exec cp {} ${i}_4_examination/unmasked \; ; done; done ``` Create list of potential TEs to be examined from concatenated rep files for each species ``` cat list.txt | while read i; do cat ${i}_4_examination/unmasked/*rep.fa > ${i}_4_examination/unmasked/${i}_cat_rep.fa ; done cat list.txt | while read i; do cat ${i}_4_examination/masked/*rep.fa > ${i}_4_examination/masked/${i}_cat_rep.fa ; done cat list.txt | while read i; do cd ${i}_4_examination/unmasked; grep ">" ${i}_cat_rep.fa | sed "s/>//g" > ${i}_querylist.txt; cd ../.. ; done ``` Run RepeatMasker on the concatenated list of consensus sequences ``` RMASKER=/lustre/work/daray/software/RepeatMasker-4.1.2-p1 for i in $LIST; do cd ${i}_4_examination/unmasked; $RMASKER/RepeatMasker ${i}_cat_rep.fa -species mammalia -dir . -pa 12; cd ../../; done ``` Create a new .out file for each putative TE for examination ``` cat list.txt | while read i; do cd ${i}_4_examination/unmasked; cat ${i}_querylist.txt | while read j; do grep "$j" ${i}_cat_rep.fa.out > ${j}.out; done; cd ../.. ; done ``` Add the length of each consensus to the filename ``` cat list.txt | while read i; do cd ${i}_4_examination/unmasked; cat ${i}_querylist.txt | while read j; do LEN=$(seqkit stats ${j}_rep.fa | tail -1 | awk '{print $7}' | sed "s/,//g"); mv ${j}_rep.fa ${j}_rep_${LEN}_bp.fa; done; cd ../.. ; done ``` Zip the folder for later download ``` cat list.txt | while read i; do tar -zcf ${i}_4_examination.tgz ${i}_4_examination/; done ``` Manually examined all TE files in the unmasked directories and categoriezed as 'good', 'long', bad', and 'odd'. good - a clear example of a TE with obvious flanking sequences. bad - A likely segmental duplication. Censor (later switched to RMasker) analysis suggests multiple small TEs in the long consensus when rep.fa is submitted. No blast hits to a TE-derived ORF using blastn or blastx. odd - Possible TE but it's hard to say. long - >10,000 bp and shows evidence of multiple TE insertions This strategy is working. See notes in '[TE curation analysis](https://hackmd.io/uYYY39JkQlmypswNRFH8ow)' hackmd page. ## Time to name the TEs All of the above work was done on local computers. Move everything back to HPCC in a new directory /lustre/scratch/daray/bat1k_TE_analyses/te-naming. Here's the plan: 1. Concatenate all of the 'good' rep.fa files for each taxon. 2. Run DeepTE to get one initial analysis. 3. Run TEClass as a second opinion. 4. Compare results to get a consensus opinion. 5. Examine all results for ACTAG to find missed Helitrons. ``` cat list.txt | while read i; do cat ${i}_4_examination/unmasked/good/*_rep.fa >${i}_good_rep.fa; done cat list.txt | while read i; do tar -zcf ${i}_4_examination_curated.tgz ${i}_4_examination; done ``` Move all .tgz files to /lustre/scratch/daray/bat1k_TE_analyses/te-naming. Copy all the 'good' consensus sequences to deepte folder. ``` cat list.txt | while read i; do cp /lustre/scratch/daray/bat1k_TE_analyses/te-naming/${i}_4_examination/unmasked/good/${i}_good_rep.fa /lustre/scratch/daray/bat1k_TE_analyses/deepte; done ``` Run DeepTE ``` cd /lustre/scratch/daray/bat1k_TE_analyses/deepte sbatch deepte_domain.sh cat list.txt | while read i; do cp ${i}_output/opt_DeepTE.fasta ${i}_DeepTE.fa; done ``` Run all DeepTE.fa files through TEClass using online interface. Download the results in two forms --> the library (as teclass.fa) and the table. Save the table as Excel csv (${i}_teclass.csv). Create list of all comparisons between DeepTE and TEClass. ``` cat list.txt | while read i ; do grep ">" ../teclass/${i}_teclass.fa | sed "s/>//g" | sed "s/__/|/g" | sed "s/TEclass result://g"| sed "s/|/ /g"| awk '{print $1 "\t" $2 "\t" $3 "\t" $4}' >> ${i}_names.txt; done ``` Output: ``` aJam-rnd-5_family-1025 ClassI_LTR_Gypsy LINE forward aJam-rnd-5_family-104 unknown LTR reverse aJam-rnd-5_family-1334 ClassI_nLTR_LINE_L1 LTR reverse aJam-rnd-5_family-3077 ClassII_MITE LTR reverse aJam-rnd-5_family-367 ClassI LTR reverse aJam-rnd-5_family-754 ClassI_LTR_ERV LTR forward aJam-rnd-6_family-10581 ClassI_LTR_ERV LTR reverse aJam-rnd-6_family-10967 ClassI_nLTR_LINE_L1 LTR forward aJam-rnd-6_family-15619 ClassII_DNA_TcMar_MITE LTR reverse aJam-rnd-6_family-1713 ClassI_LTR_ERV LTR forward aJam-rnd-6_family-1873 ClassI_LTR_ERV LTR forward aJam-rnd-6_family-2619 ClassI_nLTR_LINE_L1 LINE reverse aJam-rnd-6_family-3796 ClassII_DNA_TcMar_nMITE DNA forward aJam-rnd-6_family-4117 ClassI_LTR_ERV LTR reverse aJam-rnd-6_family-4307 ClassI_nLTR LTR reverse aJam-rnd-6_family-4974 ClassI_LTR_Gypsy LTR reverse aJam-rnd-6_family-5043 ClassI_LTR_ERV LTR forward aJam-rnd-6_family-5739 ClassI_LTR_ERV LTR forward aJam-rnd-6_family-5806 ClassI_LTR_ERV LTR forward aJam-rnd-6_family-615 ClassI_LTR_ERV LTR forward aJam-rnd-6_family-7713 ClassII_DNA_TcMar_nMITE Retro forward ``` Manually examine all the files and check for correspondence. Add column indicating agreement or disagreement. ``` aJam-rnd-5_family-1025 ClassI_LTR_Gypsy LINE forward n aJam-rnd-5_family-104 unknown LTR reverse n aJam-rnd-5_family-1334 ClassI_nLTR_LINE_L1 LTR reverse n aJam-rnd-5_family-3077 ClassII_MITE LTR reverse n aJam-rnd-5_family-367 ClassI LTR reverse n aJam-rnd-5_family-754 ClassI_LTR_ERV LTR forward y aJam-rnd-6_family-10581 ClassI_LTR_ERV LTR reverse y aJam-rnd-6_family-10967 ClassI_nLTR_LINE_L1 LTR forward n aJam-rnd-6_family-15619 ClassII_DNA_TcMar_MITE LTR reverse n aJam-rnd-6_family-1713 ClassI_LTR_ERV LTR forward y aJam-rnd-6_family-1873 ClassI_LTR_ERV LTR forward y aJam-rnd-6_family-2619 ClassI_nLTR_LINE_L1 LINE reverse y aJam-rnd-6_family-3796 ClassII_DNA_TcMar_nMITE DNA forward y aJam-rnd-6_family-4117 ClassI_LTR_ERV LTR reverse y aJam-rnd-6_family-4307 ClassI_nLTR LTR reverse n aJam-rnd-6_family-4974 ClassI_LTR_Gypsy LTR reverse y aJam-rnd-6_family-5043 ClassI_LTR_ERV LTR forward y aJam-rnd-6_family-5739 ClassI_LTR_ERV LTR forward y aJam-rnd-6_family-5806 ClassI_LTR_ERV LTR forward y aJam-rnd-6_family-615 ClassI_LTR_ERV LTR forward y aJam-rnd-6_family-7713 ClassII_DNA_TcMar_nMITE Retro forward n ``` Pull matches and nonmatches for examination. ``` cat list.txt| while read i; do awk '{ if ($5 == "n") { print } }' ${i}_names.tsv | awk '{print $1}' >>${i}_nonmatches.txt; done cat list.txt| while read i; do awk '{ if ($5 == "y") { print } }' ${i}_names.tsv | awk '{print $1}' >>${i}_matches.txt; done ``` Pull sequences from teclass.fa file. ``` cp ~/gitrepositories/bioinfo_tools/pull_seqs.py . ``` Modified to split on __ instead of #. ``` cat list.txt| while read i; do python pull_seqs.py -l ${i}_matches.txt -f ../teclass/${i}_teclass.fa -o ${i}_matches.fa; done cat list.txt| while read i; do python pull_seqs.py -l ${i}_nonmatches.txt -f ../teclass/${i}_teclass.fa -o ${i}_nonmatches.fa; done ``` Also pull MSA_extended.fa files from the extend_align step. ``` mkdir nonmatched_MSA cat list.txt| while read i; do cat ${i}_nonmatches.txt| while read j; do cp /lustre/scratch/daray/bat1k_TE_analyses/te-extensions_N/${i}_4_examination/unmasked/${j}_MSA_extended.fa nonmatched_MSA; done; done ``` Visually examined the nonmatched and evaluated for possible ID. All results saved as _names.tsv in /lustre/scratch/daray/bat1k_TE_analyses/te-naming Final resuls for all TEs save in all_names.tsv. Also in that file is the old name/new name equivalency list. Saved oldname/newname switch file as rename.list.txt. Copy all rep.fa files and concatenate for renaming. ``` cat ../deepte/*rep.fa >all.rep.fa ``` Problem. sBil-1.1 read the same as sBil-1.100. This causes problems with the replacement step. Solution. Add a character to the end of all headers in all.rep.fa using fabox and then try again. Save as all.repx.fa All headers in all.repx.fa now end with 'x'. Added 'x' to all headers in the rename.list.txt file. Renamed and output final library. ``` python ~/gitrepositories/bioinfo_tools/rename_fasta.py -r rename.list.txt -f all.rep.fa -o bat1k_denovo_TE_modnames.fa sed "s|___|/|g" bat1k_denovo_TE_modnames.fa | sed "s/__/#/g" >bat1k_denovo_TE.fa ``` # Four new bat assemblies mNig pHas2 tTri tTri.alt uBil Create new directory and list.txt ``` mkdir /lustre/scratch/daray/bat1k_TE_analyses/newbat1k cd /lustre/scratch/daray/bat1k_TE_analyses/newbat1k nano list.txt ``` add mNig pHas2 tTri tTri.alt uBil RepeatMasker using new library ``` mkdir rmasker cd rmasker/ cp ../list.txt . cp ../../denovo_rmask/template_RM.sh . cp ../../denovo_rmask/template_rm2bed.sh . ``` Modify templates to use the new assemblies, new directory and to mask to Ns. Run RepeatMasker ``` cat list.txt | while read i; do sed "s/<NAME>/$i/g" template_RM.sh >${i}_RM.sh; sed "s/<NAME>/$i/g" template_rm2bed.sh >${i}_rm2bed.sh; done cat list.txt | while read i; do sbatch ${i}_RM.sh; done ``` RepeatModeler on each assembly, masked using final mammal library - March, 2021 - see /lustre/scratch/daray/bat1k_TE_analyses/rmasker/<folder>_RM_Ns ``` mkdir /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns cd /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns ``` Process RepeatModeler output to eliminate short (less than 100 bp) consensi. Cat with mammal TE library. ``` mkdir cdhit_analysis find . -type f -name "*_N_consensi.fa" | xargs cp -t cdhit_analysis/ LIST="mNig pHas2 tTri tTri.alt uBil" cd cdhit_analysis SOFTWARE=/lustre/work/daray/software GITPATH=~/gitrepositories/bioinfo_tools for NAME in $LIST; do $SOFTWARE/EMBOSS-6.6.0/emboss/sizeseq -sequences ${NAME}_N_consensi.fa -descending N -outseq ${NAME}_N_consensi_sort.fa; perl $SOFTWARE/removesmalls.pl 100 ${NAME}_N_consensi_sort.fa >${NAME}_N_consensi_100.fa; cat ${NAME}_N_consensi_100.fa final_mammal_library.fa >${NAME}_4_cdhit.fa; done for NAME in $LIST; do cut -f1 -d" " ${NAME}_4_cdhit.fa >${NAME}_4_cdhit_cut.fa done for NAME in $LIST; do sed "s/rnd-/${NAME}-rnd-/g" ${NAME}_4_cdhit_cut.fa >${NAME}_4_cdhit.fa done ``` Run cd-hit-est on all concatenated libraries. 80% similarity over 80% of length with comparison. ###### tags: `TEs`, `David`