RepeatModeler on each assembly, masked using final mammal library - March, 2021 - see /lustre/scratch/daray/bat1k_TE_analyses/rmasker/<folder>_RM_Ns
Process RepeatModeler output to eliminate short (<100 bp) consensi. Cat with mammal TE library.
Run cd-hit-est on all concatenated libraries. 80% similarity over 80% of length with comparison.
Find clusters existing only in target species.
Count number of lines with 'nt' in them.
Count number of instances of 'family'.
If they're equal, send file to new folder
Grab the longest hit from each cluster and append name of putative TE to file.
Do I want to cat all the novel files and run a cd-hit analysis? I don't know but I think not.
Moved to /lustre/scratch/daray/bat1k_TE_analyses/te-extensions_N
Perform two extend align runs. One using the novel TEs querying the unmasked genome and another querying the genome masked with Ns.
Note, I began using a text file called list.txt. It's just the same list as before as a text file.
Create a table of hits from the masked run using the .out file from the extract_align run
Get list of TEs to be considered. Only TEs with more than 40 hits.
Pull all of the relevant files and copy them to a folder for later download
Create a table of hits from the unmasked run using the .out file from the extract_align run
Get list of TEs to be considered. Only TEs with more than 40 hits.
Pull all of the relevant files and copy them to a folder for later download
Create list of potential TEs to be examined from concatenated rep files for each species
Run RepeatMasker on the concatenated list of consensus sequences
Create a new .out file for each putative TE for examination
Add the length of each consensus to the filename
Zip the folder for later download
Manually examined all TE files in the unmasked directories and categoriezed as 'good', 'long', bad', and 'odd'.
good - a clear example of a TE with obvious flanking sequences.
bad - A likely segmental duplication. Censor (later switched to RMasker) analysis suggests multiple small TEs in the long consensus when rep.fa is submitted. No blast hits to a TE-derived ORF using blastn or blastx.
odd - Possible TE but it's hard to say.
long - >10,000 bp and shows evidence of multiple TE insertions
This strategy is working. See notes in 'TE curation analysis' hackmd page.
All of the above work was done on local computers. Move everything back to HPCC in a new directory /lustre/scratch/daray/bat1k_TE_analyses/te-naming.
Here's the plan:
Move all .tgz files to /lustre/scratch/daray/bat1k_TE_analyses/te-naming. Copy all the 'good' consensus sequences to deepte folder.
Run DeepTE
Run all DeepTE.fa files through TEClass using online interface. Download the results in two forms –> the library (as teclass.fa) and the table. Save the table as Excel csv (${i}_teclass.csv).
Create list of all comparisons between DeepTE and TEClass.
Output:
Manually examine all the files and check for correspondence. Add column indicating agreement or disagreement.
Pull matches and nonmatches for examination.
Pull sequences from teclass.fa file.
Modified to split on __ instead of #.
Also pull MSA_extended.fa files from the extend_align step.
Visually examined the nonmatched and evaluated for possible ID. All results saved as _names.tsv in /lustre/scratch/daray/bat1k_TE_analyses/te-naming
Final resuls for all TEs save in all_names.tsv. Also in that file is the old name/new name equivalency list.
Saved oldname/newname switch file as rename.list.txt.
Copy all rep.fa files and concatenate for renaming.
Problem. sBil-1.1 read the same as sBil-1.100. This causes problems with the replacement step.
Solution. Add a character to the end of all headers in all.rep.fa using fabox and then try again.
Save as all.repx.fa
All headers in all.repx.fa now end with 'x'. Added 'x' to all headers in the rename.list.txt file.
Renamed and output final library.
mNig
pHas2
tTri
tTri.alt
uBil
Create new directory and list.txt
add
mNig
pHas2
tTri
tTri.alt
uBil
RepeatMasker using new library
Modify templates to use the new assemblies, new directory and to mask to Ns.
Run RepeatMasker
RepeatModeler on each assembly, masked using final mammal library - March, 2021 - see /lustre/scratch/daray/bat1k_TE_analyses/rmasker/<folder>_RM_Ns
Process RepeatModeler output to eliminate short (less than 100 bp) consensi. Cat with mammal TE library.
Run cd-hit-est on all concatenated libraries. 80% similarity over 80% of length with comparison.
TEs
, David