Bat1k de novo TE curation

RepeatModeler on each assembly, masked using final mammal library - March, 2021 - see /lustre/scratch/daray/bat1k_TE_analyses/rmasker/<folder>_RM_Ns

mkdir /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns
cd /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns

Process RepeatModeler output to eliminate short (<100 bp) consensi. Cat with mammal TE library.

mkdir cdhit_analysis

find . -type f -name "*_N_consensi.fa" | xargs cp -t cdhit_analysis/

LIST="aJam aPal aSto eFus gSor hCyc hLar hMon nLep nTum pHas rAff rCfl rSed rTri sBil sHon tBra"

cd cdhit_analysis

SOFTWARE=/lustre/work/daray/software
GITPATH=~/gitrepositories/bioinfo_tools

for NAME in $LIST;  
 do  $SOFTWARE/EMBOSS-6.6.0/emboss/sizeseq -sequences ${NAME}_N_consensi.fa -descending N -outseq ${NAME}_N_consensi_sort.fa;  
 perl $SOFTWARE/removesmalls.pl 100 ${NAME}_N_consensi_sort.fa >${NAME}_N_consensi_100.fa;    
 cat ${NAME}_N_consensi_100.fa final_mammal_library.fa >${NAME}_4_cdhit.fa;   
done

for NAME in $LIST;  
	do cut  -f1 -d" " ${NAME}_4_cdhit.fa >${NAME}_4_cdhit_cut.fa
done

for NAME in $LIST;  
	do sed "s/rnd-/${NAME}-rnd-/g" ${NAME}_4_cdhit_cut.fa >${NAME}_4_cdhit.fa
done

Run cd-hit-est on all concatenated libraries. 80% similarity over 80% of length with comparison.

sbatch cdhit_aS.sh

Find clusters existing only in target species.
Count number of lines with 'nt' in them.
Count number of instances of 'family'.
If they're equal, send file to new folder

WORKDIR=/lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall-Ns/cdhit_analysis

LIST="aJam aPal aSto eFus gSor hCyc hLar hMon nLep nTum pHas rAff rCfl rSed rTri sBil sHon tBra"

cd $WORKDIR

for NAME in $LIST
 do mkdir $WORKDIR/${NAME}_novel 
 cd $WORKDIR/${NAME}_novel
 for FILE in $WORKDIR/$NAME/clusters_80_aS08/${NAME}_lineage/Cluster_*.txt
  do LINECOUNT=$(grep 'nt' $FILE | wc -l)
  FAMILYCOUNT=$(grep 'family' $FILE | wc -l)
  [ $FAMILYCOUNT -eq $LINECOUNT ] && cp $FILE .
 done
done

Grab the longest hit from each cluster and append name of putative TE to file.

WORKDIR=/lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall-Ns/cdhit_analysis

cd $WORKDIR

LIST="aJam aPal aSto eFus gSor hCyc hLar hMon nLep nTum pHas rAff rCfl rSed rTri sBil sHon tBra"

for NAME in $LIST ; do for FILE in ${NAME}_novel/Cluster_*.txt; do fgrep "*" $FILE | cut -d">" -f2 | cut -d"." -f1 >> ${NAME}_novel.txt; done; done

for NAME in $LIST; do python $GITPATH/pull_seqs.py -l ${NAME}_novel.txt -f ${NAME}_4_cdhit.fa -o ${NAME}_novel.fa; done

Do I want to cat all the novel files and run a cd-hit analysis? I don't know but I think not.

Moved to /lustre/scratch/daray/bat1k_TE_analyses/te-extensions_N
Perform two extend align runs. One using the novel TEs querying the unmasked genome and another querying the genome masked with Ns.
Note, I began using a text file called list.txt. It's just the same list as before as a text file.

WORKDIR=/lustre/scratch/daray/bat1k_TE_analyses/te-extensions_N
cd $WORKDIR
cp /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall-Ns/cdhit_analysis/*_novel.fa .

cat list.txt | while read i; do sed "s/<NAME>/$i/g" template_extend_align.sh >${i}_extend_align.sh; done
cat list.txt | while read i; do sbatch ${i}_extend_align.sh; done

cat list.txt | while read i; do sed "s/<NAME>/$i/g" template_extend_align_masked.sh >${i}_extend_align_masked.sh; done

cat list.txt | while read i; do sbatch ${i}_extend_align_masked.sh; done

cat list.txt | while read i; do for NAME in ls ${i}_N/images_and_alignments/possible_SD/*.png; do FILENAME=$(basename $NAME .png); rm ${i}_N/images_and_alignments/likely_TEs/${FILENAME}*; done; done

cat list.txt | while read i; do for NAME in ls ${i}_N_masked/images_and_alignments/possible_SD/*.png; do FILENAME=$(basename $NAME .png); rm ${i}_N_masked/images_and_alignments/likely_TEs/${FILENAME}*; done; done

Create a table of hits from the masked run using the .out file from the extract_align run

cat list.txt | while read i; do grep Hit ${i}.masked.*.out | awk '{print $4, $6}' >>${i}_hit_counts_masked.txt; sort -k2 -rn -o ${i}_hit_counts_masked.txt ${i}_hit_counts_masked.txt; done

Get list of TEs to be considered. Only TEs with more than 40 hits.

cat list.txt | while read i; do awk '$2 >40' ${i}_hit_counts_masked.txt | awk '{print $1}' >${i}_4_examination_masked.txt; sort -n -o ${i}_4_examination_masked.txt ${i}_4_examination_masked.txt; done

Pull all of the relevant files and copy them to a folder for later download

cat list.txt | while read i; do mkdir -p ${i}_4_examination/masked; cat ${i}_4_examination_masked.txt | while read j; do find ${i}_N_masked/images_and_alignments/likely_TEs -name $j* -exec cp {} ${i}_4_examination/masked \; ; done; done

Create a table of hits from the unmasked run using the .out file from the extract_align run

cat list.txt | while read i; do grep Hit ${i}.*.out | awk '{print $4, $6}' >>${i}_hit_counts.txt; sort -k2 -rn -o ${i}_hit_counts.txt ${i}_hit_counts.txt; done

Get list of TEs to be considered. Only TEs with more than 40 hits.

cat list.txt | while read i; do awk '$2 >40' ${i}_hit_counts.txt | awk '{print $1}' >${i}_4_examination.txt; sort -n -o ${i}_4_examination.txt ${i}_4_examination.txt; done

Pull all of the relevant files and copy them to a folder for later download

cat list.txt | while read i; do mkdir -p ${i}_4_examination/unmasked; cat ${i}_4_examination.txt | while read j; do find ${i}_N/images_and_alignments -name $j* -exec cp {} ${i}_4_examination/unmasked \; ; done; done

Create list of potential TEs to be examined from concatenated rep files for each species

cat list.txt | while read i; do cat ${i}_4_examination/unmasked/*rep.fa > ${i}_4_examination/unmasked/${i}_cat_rep.fa  ; done
cat list.txt | while read i; do cat ${i}_4_examination/masked/*rep.fa > ${i}_4_examination/masked/${i}_cat_rep.fa  ; done
cat list.txt | while read i; do cd ${i}_4_examination/unmasked; grep ">" ${i}_cat_rep.fa | sed "s/>//g" > ${i}_querylist.txt; cd ../.. ; done

Run RepeatMasker on the concatenated list of consensus sequences

RMASKER=/lustre/work/daray/software/RepeatMasker-4.1.2-p1
for i in $LIST; do cd ${i}_4_examination/unmasked; $RMASKER/RepeatMasker ${i}_cat_rep.fa -species mammalia -dir . -pa 12; cd ../../; done

Create a new .out file for each putative TE for examination

cat list.txt | while read i; do cd ${i}_4_examination/unmasked; cat ${i}_querylist.txt | while read j; do grep "$j" ${i}_cat_rep.fa.out > ${j}.out; done; cd ../.. ; done

Add the length of each consensus to the filename

cat list.txt | while read i; do cd ${i}_4_examination/unmasked; cat ${i}_querylist.txt | while read j; do LEN=$(seqkit stats ${j}_rep.fa | tail -1 | awk '{print $7}' | sed "s/,//g"); mv ${j}_rep.fa ${j}_rep_${LEN}_bp.fa; done; cd ../.. ; done

Zip the folder for later download

cat list.txt | while read i; do tar -zcf ${i}_4_examination.tgz ${i}_4_examination/; done

Manually examined all TE files in the unmasked directories and categoriezed as 'good', 'long', bad', and 'odd'.
good - a clear example of a TE with obvious flanking sequences.
bad - A likely segmental duplication. Censor (later switched to RMasker) analysis suggests multiple small TEs in the long consensus when rep.fa is submitted. No blast hits to a TE-derived ORF using blastn or blastx.
odd - Possible TE but it's hard to say.
long - >10,000 bp and shows evidence of multiple TE insertions
This strategy is working. See notes in 'TE curation analysis' hackmd page.

Time to name the TEs

All of the above work was done on local computers. Move everything back to HPCC in a new directory /lustre/scratch/daray/bat1k_TE_analyses/te-naming.

Here's the plan:

Concatenate all of the 'good' rep.fa files for each taxon.
Run DeepTE to get one initial analysis.
Run TEClass as a second opinion.
Compare results to get a consensus opinion.
Examine all results for ACTAG to find missed Helitrons.

cat list.txt | while read i; do cat ${i}_4_examination/unmasked/good/*_rep.fa >${i}_good_rep.fa; done

cat list.txt | while read i; do tar -zcf ${i}_4_examination_curated.tgz ${i}_4_examination; done

Move all .tgz files to /lustre/scratch/daray/bat1k_TE_analyses/te-naming. Copy all the 'good' consensus sequences to deepte folder.

cat list.txt | while read i; do cp /lustre/scratch/daray/bat1k_TE_analyses/te-naming/${i}_4_examination/unmasked/good/${i}_good_rep.fa /lustre/scratch/daray/bat1k_TE_analyses/deepte; done

Run DeepTE

cd /lustre/scratch/daray/bat1k_TE_analyses/deepte
sbatch deepte_domain.sh
cat list.txt | while read i; do cp ${i}_output/opt_DeepTE.fasta ${i}_DeepTE.fa; done

Run all DeepTE.fa files through TEClass using online interface. Download the results in two forms –> the library (as teclass.fa) and the table. Save the table as Excel csv (${i}_teclass.csv).

Create list of all comparisons between DeepTE and TEClass.

cat list.txt | while read i ; do grep ">" ../teclass/${i}_teclass.fa | sed "s/>//g" | sed "s/__/|/g" |  sed "s/TEclass result://g"| sed "s/|/ /g"| awk '{print $1 "\t" $2 "\t" $3 "\t" $4}' >> ${i}_names.txt; done

Output:

aJam-rnd-5_family-1025	ClassI_LTR_Gypsy	LINE	forward
aJam-rnd-5_family-104	unknown	LTR	reverse
aJam-rnd-5_family-1334	ClassI_nLTR_LINE_L1	LTR	reverse
aJam-rnd-5_family-3077	ClassII_MITE	LTR	reverse
aJam-rnd-5_family-367	ClassI	LTR	reverse
aJam-rnd-5_family-754	ClassI_LTR_ERV	LTR	forward
aJam-rnd-6_family-10581	ClassI_LTR_ERV	LTR	reverse
aJam-rnd-6_family-10967	ClassI_nLTR_LINE_L1	LTR	forward
aJam-rnd-6_family-15619	ClassII_DNA_TcMar_MITE	LTR	reverse
aJam-rnd-6_family-1713	ClassI_LTR_ERV	LTR	forward
aJam-rnd-6_family-1873	ClassI_LTR_ERV	LTR	forward
aJam-rnd-6_family-2619	ClassI_nLTR_LINE_L1	LINE	reverse
aJam-rnd-6_family-3796	ClassII_DNA_TcMar_nMITE	DNA	forward
aJam-rnd-6_family-4117	ClassI_LTR_ERV	LTR	reverse
aJam-rnd-6_family-4307	ClassI_nLTR	LTR	reverse
aJam-rnd-6_family-4974	ClassI_LTR_Gypsy	LTR	reverse
aJam-rnd-6_family-5043	ClassI_LTR_ERV	LTR	forward
aJam-rnd-6_family-5739	ClassI_LTR_ERV	LTR	forward
aJam-rnd-6_family-5806	ClassI_LTR_ERV	LTR	forward
aJam-rnd-6_family-615	ClassI_LTR_ERV	LTR	forward
aJam-rnd-6_family-7713	ClassII_DNA_TcMar_nMITE	Retro	forward

Manually examine all the files and check for correspondence. Add column indicating agreement or disagreement.

aJam-rnd-5_family-1025	ClassI_LTR_Gypsy	LINE	forward	n
aJam-rnd-5_family-104	unknown	LTR	reverse	n
aJam-rnd-5_family-1334	ClassI_nLTR_LINE_L1	LTR	reverse	n
aJam-rnd-5_family-3077	ClassII_MITE	LTR	reverse	n
aJam-rnd-5_family-367	ClassI	LTR	reverse	n
aJam-rnd-5_family-754	ClassI_LTR_ERV	LTR	forward	y
aJam-rnd-6_family-10581	ClassI_LTR_ERV	LTR	reverse	y
aJam-rnd-6_family-10967	ClassI_nLTR_LINE_L1	LTR	forward	n
aJam-rnd-6_family-15619	ClassII_DNA_TcMar_MITE	LTR	reverse	n
aJam-rnd-6_family-1713	ClassI_LTR_ERV	LTR	forward	y
aJam-rnd-6_family-1873	ClassI_LTR_ERV	LTR	forward	y
aJam-rnd-6_family-2619	ClassI_nLTR_LINE_L1	LINE	reverse	y
aJam-rnd-6_family-3796	ClassII_DNA_TcMar_nMITE	DNA	forward	y
aJam-rnd-6_family-4117	ClassI_LTR_ERV	LTR	reverse	y
aJam-rnd-6_family-4307	ClassI_nLTR	LTR	reverse	n
aJam-rnd-6_family-4974	ClassI_LTR_Gypsy	LTR	reverse	y
aJam-rnd-6_family-5043	ClassI_LTR_ERV	LTR	forward	y
aJam-rnd-6_family-5739	ClassI_LTR_ERV	LTR	forward	y
aJam-rnd-6_family-5806	ClassI_LTR_ERV	LTR	forward	y
aJam-rnd-6_family-615	ClassI_LTR_ERV	LTR	forward	y
aJam-rnd-6_family-7713	ClassII_DNA_TcMar_nMITE	Retro	forward	n

Pull matches and nonmatches for examination.

cat list.txt| while read i; do awk '{ if ($5 == "n") { print } }' ${i}_names.tsv | awk '{print $1}' >>${i}_nonmatches.txt; done
cat list.txt| while read i; do awk '{ if ($5 == "y") { print } }' ${i}_names.tsv | awk '{print $1}' >>${i}_matches.txt; done

Pull sequences from teclass.fa file.

cp ~/gitrepositories/bioinfo_tools/pull_seqs.py .

Modified to split on __ instead of #.

cat list.txt| while read i; do python pull_seqs.py -l ${i}_matches.txt -f ../teclass/${i}_teclass.fa -o ${i}_matches.fa; done
cat list.txt| while read i; do python pull_seqs.py -l ${i}_nonmatches.txt -f ../teclass/${i}_teclass.fa -o ${i}_nonmatches.fa; done

Also pull MSA_extended.fa files from the extend_align step.

mkdir nonmatched_MSA
cat list.txt| while read i; do cat ${i}_nonmatches.txt| while read j; do cp /lustre/scratch/daray/bat1k_TE_analyses/te-extensions_N/${i}_4_examination/unmasked/${j}_MSA_extended.fa nonmatched_MSA; done; done

Visually examined the nonmatched and evaluated for possible ID. All results saved as _names.tsv in /lustre/scratch/daray/bat1k_TE_analyses/te-naming

Final resuls for all TEs save in all_names.tsv. Also in that file is the old name/new name equivalency list.

Saved oldname/newname switch file as rename.list.txt.

Copy all rep.fa files and concatenate for renaming.

cat ../deepte/*rep.fa >all.rep.fa

Problem. sBil-1.1 read the same as sBil-1.100. This causes problems with the replacement step.

Solution. Add a character to the end of all headers in all.rep.fa using fabox and then try again.

Save as all.repx.fa

All headers in all.repx.fa now end with 'x'. Added 'x' to all headers in the rename.list.txt file.

Renamed and output final library.

python ~/gitrepositories/bioinfo_tools/rename_fasta.py -r rename.list.txt -f all.rep.fa -o bat1k_denovo_TE_modnames.fa
sed "s|___|/|g" bat1k_denovo_TE_modnames.fa | sed "s/__/#/g" >bat1k_denovo_TE.fa

Four new bat assemblies

mNig
pHas2
tTri
tTri.alt
uBil
Create new directory and list.txt

mkdir /lustre/scratch/daray/bat1k_TE_analyses/newbat1k
cd /lustre/scratch/daray/bat1k_TE_analyses/newbat1k
nano list.txt

add
mNig
pHas2
tTri
tTri.alt
uBil

RepeatMasker using new library

mkdir rmasker
cd rmasker/
cp ../list.txt .
cp ../../denovo_rmask/template_RM.sh .
cp ../../denovo_rmask/template_rm2bed.sh .

Modify templates to use the new assemblies, new directory and to mask to Ns.

Run RepeatMasker

cat list.txt | while read i; do sed "s/<NAME>/$i/g" template_RM.sh >${i}_RM.sh; sed "s/<NAME>/$i/g" template_rm2bed.sh >${i}_rm2bed.sh; done
cat list.txt | while read i; do sbatch ${i}_RM.sh; done

RepeatModeler on each assembly, masked using final mammal library - March, 2021 - see /lustre/scratch/daray/bat1k_TE_analyses/rmasker/<folder>_RM_Ns

mkdir /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns
cd /lustre/scratch/daray/bat1k_TE_analyses/rmodeler-newinstall_Ns

Process RepeatModeler output to eliminate short (less than 100 bp) consensi. Cat with mammal TE library.

mkdir cdhit_analysis

find . -type f -name "*_N_consensi.fa" | xargs cp -t cdhit_analysis/

LIST="mNig pHas2 tTri tTri.alt uBil"

cd cdhit_analysis

SOFTWARE=/lustre/work/daray/software
GITPATH=~/gitrepositories/bioinfo_tools

for NAME in $LIST;  
 do  $SOFTWARE/EMBOSS-6.6.0/emboss/sizeseq -sequences ${NAME}_N_consensi.fa -descending N -outseq ${NAME}_N_consensi_sort.fa;  
 perl $SOFTWARE/removesmalls.pl 100 ${NAME}_N_consensi_sort.fa >${NAME}_N_consensi_100.fa;    
 cat ${NAME}_N_consensi_100.fa final_mammal_library.fa >${NAME}_4_cdhit.fa;   
done

for NAME in $LIST;  
	do cut  -f1 -d" " ${NAME}_4_cdhit.fa >${NAME}_4_cdhit_cut.fa
done

for NAME in $LIST;  
	do sed "s/rnd-/${NAME}-rnd-/g" ${NAME}_4_cdhit_cut.fa >${NAME}_4_cdhit.fa
done

Run cd-hit-est on all concatenated libraries. 80% similarity over 80% of length with comparison.

Bat1k de novo TE curation

Time to name the TEs

Four new bat assemblies

tags: TEs, David

Read more

Multiple Genomes Alignment

TE curation analysis, post RAM

**Genome assembly: Flye software**

README

tags: `TEs`, `David`

Genome assembly: Flye software