# I ricinus TE analysis
Beginning by running RepeatModeler on two assembly versions. Both are females but one is several hundred megabases larger than the other. Haven't recieved and answer from Rodrigo or Travis as to why this could be the case.
Regardless, I will run RepeatModeler and then collapse the library using cdhit set to 95% simililarity.
RepeatModeler completed (3/10/2022) on both assemblies. Both runs yeilded a LOT of elements - around the same number as resulted from the iSca run (n=4527). iRic1-families.fa, n=4675. iRic2-families.fa, n=4723.
Not sure exactly what is the best option for moving forward.
1. Run RAM and then use the output? I don't think so. RAM has the potential to generate some odd consensus sequences. We've already been through what will likely be most of those in iSca.
2. Run cdhit on one output at a time. Combine with newlib4cdhit_trimmed.fa (n=2059), which contains all of the stuff we felt confident naming and all of the stuff that was uninterpretable or 'satellite'. Another option is the use lib4cdhit_trimmmed.fa (n=1534. It contains all of the stuff we felt comfortable naming. Regardless, would need to run iRic1(&2)-families.fa with the library I choose and then get rid of the duplicates. Do this for each of the files, iRic1 and iRic2.
3. Combine iRic1 and iRic2 and then run cdhit on the combined version using either newlib4cdhit_trimmed.fa or lib4cdhit_trimmmed.fa.
I'm leaning toward 3 at the moment. Will proceed and see how it goes.
###Option 3
```
cd /lustre/scratch/daray/ixodes/iRic2/rmodeler_dir
cut -d" " -f1 iRic2-families.fa > iRic2-families.mod.fa
sed -i "s/rnd/iRic2-rnd/g" iRic2-families.mod.fa
cd ../../iRic1/rmodeler_dir
cut -d" " -f1 iRic1-families.fa > iRic1-families.mod.fa
sed -i "s/rnd/iRic1-rnd/g" iRic1-families.mod.fa
cd ../..
mkdir iRic_cdhit_work
cd iRic_cdhit_work/
cat ../iRic1/rmodeler_dir/iRic1-families.mod.fa ../iRic2/rmodeler_dir/iRic2-families.mod.fa >iRic1_2-families.mod.fa
cat iRic1_2-families.mod.fa newlib4cdhit_trimmed.fa >iRic1_2_newlib_4cdhit.fa
i12
cd /lustre/scratch/daray/ixodes/iRic_cdhit_work
CDHIT=/lustre/work/daray/software/cdhit-4.8.1
$CDHIT/cd-hit-est -i iRic1_2_newlib_4cdhit.fa -o iRic1_2_newlib.cdhit95_sS09 -c 0.95 -d 70 -aS 0.9 -n 12 -M 2200 -l 100
```
To process output (iRic1_2_newlib.cdhit95_sS09.clstr), will use bash to filter the files as follows:
Pull all hits that are from iRic1 or iRic2 and are NOT the longest hit in the cluster.
In other words, find clusters like this.
>Cluster 62
0 201nt, >iRic1-rnd-1_family-140#Unknown... at -/95.52%
1 194nt, >iRic1-rnd-4_family-1394#Unknown... at +/95.36%
2 10545nt, >iSca.1.307#Unknown/Unknown... *
3 193nt, >iSca.1.288#DNA/TcMariner-TIR-GGG... at -/97.93%
Create a new list from lines 0 and 1. Target those for removal.
This will leave clusters like this:
0 8745nt, >iRic1-rnd-6_family-1316#DNA/MULE-MuDR... *
1 6905nt, >iRic2-rnd-5_family-3428#DNA/MULE-MuDR... at +/98.93%
and like this:
>Cluster 2665
0 1132nt, >iRic2-rnd-5_family-710#LINE/CR1... *
Line 0 will be retained in both cases.
```
cd /lustre/scratch/daray/ixodes/iRic_cdhit_work/cdhit1
cat iRic1_2_newlib.cdhit95_sS09.clstr | fgrep '*' | grep "iRic"
```
OUCH! That still leaves 6018 potentials.
What if I do 90%?
```
i12
cd /lustre/scratch/daray/ixodes/iRic_cdhit_work/cdhit1
CDHIT=/lustre/work/daray/software/cdhit-4.8.1
$CDHIT/cd-hit-est -i iRic1_2_newlib_4cdhit.fa -o iRic1_2_newlib.cdhit90_sS09 -c 0.90 -d 70 -aS 0.9 -n 9 -M 2200 -l 100
```
That takes the total down to 4677. That's still a LOT.
Going to work with that, run it through RAM and then see what happens when I run cdhit again.
```
cat iRic1_2_newlib.cdhit90_sS09.clstr | fgrep '*' | grep "iRic" | cut -d">" -f2 | cut -d"." -f1 > novel-iRic.txt
python ~/gitrepositories/bioinfo_tools/pull_seqs2.py -l novel-iRic.txt -f iRic1_2-families.mod.fa -o novel_iRic.fa
cut -d"#" -f1 novel_iRic.fa >novel_iRic.mod.fa
cd /lustre/scratch/daray/ixodes/iRic1
nano iRic1_extend_align.sh
sbatch iRic1_extend_align.sh
cd ../iRic2
cp ../iRic1/iRic1_extend_align.sh iRic2_extend_align.sh
nano iRic2_extend_align.sh
sbatch iRic2_extend_align.sh
```
Runs finished overnight. Now, collect all of the new consensus sequences and re-run cd-hit to eliminate potential duplicates.
```
i12
cd /lustre/scratch/daray/ixodes/iRic_cdhit_work/cdhit2
cat /lustre/scratch/daray/ixodes/iRic1/extensions/final_consensuses/*.fa /lustre/scratch/daray/ixodes/iRic2/extensions/final_consensuses/*.fa ../cdhit1/newlib4cdhit_trimmed.fa >cdhit2_input.fa
CDHIT=/lustre/work/daray/software/cdhit-4.8.1
$CDHIT/cd-hit-est -i cdhit2_input.fa -o cdhit2.cdhit90_sS09 -c 0.90 -d 70 -aS 0.9 -n 9 -M 2200 -l 100
$CDHIT/cd-hit-est -i cdhit2_input.fa -o cdhit2.cdhit95_sS09 -c 0.95 -d 70 -aS 0.9 -n 11 -T 36 -M 2200 -l 100
```
The very large elements are still a problem. I think one way to make this work is to just eliminate anything over 12kb from our analysis. The rationale can be that these are likely segmental duplications.
Had to consider whether to cull now or go back to before the extend_align run. In other words, how many of the original consensus sequences were greater than 12kb in the initial RModeler output. Went to look. There were only two. In the post extend_align there are 369.
Went back to arthropods+denovo+expanded.fa and removed anything from iSca group that was over 12kb and was classified as Unknown/Unknown. Removals are saved as removed_12kbplus.fa. Filtered library is saved as arthropods+denovo+expanded+under12kb.fa. Keeping anything that has a clear TIR. This way, we can say we kept potential mobilizing elements.
Will need to go back to the repeatmasker run for iSca and eliminate those entries from the library.
Getting the extensions.
```
cat /lustre/scratch/daray/ixodes/iRic1/extensions/final_consensuses/*.fa /lustre/scratch/daray/ixodes/iRic2/extensions/final_consensuses/*.fa >iRic1_2_extensions.fa
```
Downloaded and sorted by size. Removed anything greater than 12kb to iRic1_2_extensions_over12kb.fa. Kept anything less than 12kb and saved as iRic1_2_extensions_under12kb.fa.
Now, combine that last file with arthropods+denovo+expanded+under12kb.fa and rerun cdhit.
Moved all of the cdhit results generated earlier to 'originalrun'.
```
i12
cd /lustre/scratch/daray/ixodes/iRic_cdhit_work/cdhit2
cat arthropods+denovo+expanded+under12kb.fa iRic1_2_extensions_under12kb.fa >cdhit2_input.fa
CDHIT=/lustre/work/daray/software/cdhit-4.8.1
$CDHIT/cd-hit-est -i cdhit2_input.fa -o cdhit2.cdhit90_sS09 -c 0.90 -d 70 -aS 0.9 -n 9 -T 12 -M 2200 -l 100
$CDHIT/cd-hit-est -i cdhit2_input.fa -o cdhit2.cdhit95_sS09 -c 0.95 -d 70 -aS 0.9 -n 11 -T 12 -M 2200 -l 100
```
Still a lot of elements to work through.
```
$ cat cdhit2.cdhit95_sS09.clstr | fgrep '*' | grep "iRic" | wc -l
3181
$ cat cdhit2.cdhit90_sS09.clstr | fgrep '*' | grep "iRic" | wc -l
2937
```
Might as well just get to it.
```
actconda
cat cdhit2.cdhit95_sS09.clstr | fgrep '*' | grep "iRic1" | cut -d">" -f2 | cut -d"." -f1 > novel_iRic1.txt
python ~/gitrepositories/bioinfo_tools/pull_seqs2.py -l novel-iRic1.txt -f iRic1_2_extensions.fa -o novel_iRic1.fa
cat cdhit2.cdhit95_sS09.clstr | fgrep '*' | grep "iRic2" | cut -d">" -f2 | cut -d"." -f1 > novel_iRic2.txt
python ~/gitrepositories/bioinfo_tools/pull_seqs2.py -l novel-iRic2.txt -f iRic1_2_extensions.fa -o novel_iRic2.fa
```
Now, let's run these through te-aid.
```
mkdir -p /lustre/scratch/daray/ixodes/iRic1/te-aid/splits
cd /lustre/scratch/daray/ixodes/iRic1/te-aid/splits
WORK=/lustre/work/daray/software
cp novel_iRic1.fa .
$WORK/faSplit byname ../novel_iRic1.fa .
cp /lustre/scratch/daray/ixodes/iSca/round1-naming/unknowns-for-curating/david/te-aid_work/bin/te-aid.sh
nano te-aid.sh ####Modify as needed
sbatch te-aid.sh
```
Going to try and follow the instructions from Clement's te curation pipeline, lines 128-
First need to get all of my potential TEs from each genome together into one place. Then classify them to get the headers that the generate_priority_list_from_RM2.sh will retrieve.
```
cd /lustre/scratch/daray/ixodes/iRic1/repeatclassifier
cat ../te-aid/*_rep.fa >iRic1_extended_rep.fa
sbatch repeatclassifier.sh
cd /lustre/scratch/daray/ixodes/iRic2/repeatclassifier
cat ../te-aid/*_rep.fa >iRic2_extended_rep.fa
sbatch repeatclassifier.sh
```
```
cd /lustre/scratch/daray/ixodes/iRic1/
mkdir prioritize
cd prioritize
sbatch prioritize.sh
##### Below was run on local desktop
$BIN/generate_priority_list_from_RM2.sh ../repeatclassifier/iRic1_extended_rep.fa.classified ../../../assemblies/iRic1.fa ~/Pfam_db ~/TE_ManAnnot
```
Spent several days createing TEcurate.sh, a script to process and categorize TEs. Used it on iRic1 and iRic2.
Downloaded results for LTR, DNA, and LINE and easily categorized them using the pdfs.
The NOHIT files are next.
Going to concatenate all of the rep.fa files and run through a less stringent cd-hit analysis to see if I can categorize them a bit more.
```
cd /mnt/d/Dropbox/ixodes/ricinus/iRic1/fordownload/NOHIT
mkdir cdhit
cat *_rep.fa >all_rep.fa
mv all_rep.fa cdhit
cd cdhit
actconda
conda activate curate
cat all_rep.fa newlib4cdhit_trimmed.fa >lowstringency_cdhit_input.fa
cd-hit-est -i lowstringency_cdhit_input.fa -o lowstringency.cdhit80_sS09 -c 0.80 -d 70 -aS 0.9 -n 5 -T 12 -M 2200 -l 100
```
A problem with this approach. All of the crappy consensus sequences larger than 10kb occlude any meaningful assessment of potential subfamily TEs.
Potential solution. Get rid of anything from either iSca or iRic that's larger than 10kb before the cdhit run.
Used BioEdit to get rid of 10kb+ entries. Saved as lowstringency_cdhit_input_under10kb
```
cd-hit-est -i lowstringency_cdhit_input_under10kb.fa -o lowstringency_under10kb.cdhit80_sS09 -c 0.80 -d 70 -aS 0.9 -n 5 -T 12 -M 2200 -l 100
```
That yeilded a workable file.
I went through it manually, finding all the spots where an iRic was linked with a known iSca. Most of them were pairs. A few were trios or quartets.
The bash commands below allowed me to work through the pairs and classify them into a new folder within NOHIT, classified.
Renamed and classified the trios and quartets manually. There were only a few.
```
echo "Creating directories"
mkdir -p working/clusterfiles
#cd clusterfiles
cd working
echo "Creating individual files for each cluster"
awk '/>Cluster/ {x="clusterfiles/F"++i".txt";}{print >x;}' lowstringency_under10kb.cdhit80_sS09_trimmed.txt
cd clusterfiles
mkdir two
mkdir other
for FILE in *.txt; do NUMBER=$(basename $FILE .txt); N=${NUMBER:1}; grep -v ">Cluster" $FILE >G${N}.txt; rm $FILE; done
for FILE in G*.txt; do NUMBER=$(basename $FILE .txt); N=${NUMBER:1}; cut -d">" -f 2 $FILE | cut -d" " -f 1 | sed "s/.\{0,3\}$//; /^$/d" > H${N}.txt; rm $FILE; done
for FILE in H*.txt; do HOWMANY=$(wc -l $FILE | cut -d" " -f1)
if (( HOWMANY == 2 ))
then
mv $FILE two
else
mv $FILE other
fi
done
cd two
mkdir ../../../../classified
for FILE in H*.txt; do NUMBER=$(basename $FILE .txt); N=${NUMBER:1}; TOBE=$(grep iSca $FILE | cut -d"#" -f2); OLD=$(grep iRic $FILE); NEWCAT=${OLD}#${TOBE}; CONSNAMEMOD=${NEWCAT/-rnd-/.}; CONSNAMEMOD=${CONSNAMEMOD/_family-/.}; cp ../../../../${OLD}-_rep_mod.fa ../../../../${OLD}-_rep_mod_orig.fa; sed "s|$OLD|$CONSNAMEMOD|g" ../../../../${OLD}-_rep.fa > ../../../../${OLD}-_rep_mod.fa; mv ../../../../${OLD}-* ../../../../classified; done
```
That still leaves several hundred to process by hand.
Manually processed those to identify obvious TSDs.
There are plenty of LINE-like elements in there, based on my observation of several repetitive tails and a bunch of associated microsatellite repeats.
I'd noted that a LOT of these didn't seem to get picked up by earlier screens for TE ORFs. I need to run this batch through again and see if it works better this time.
Recollect all of the rep.fa files that remain and prep the rerun.
```
cd /mnt/d/Dropbox/ixodes/ricinus/iRic1/fordownload/NOHIT
cat *-_rep.fa >all_rep_4_rerun.fa
grep ">" all_rep_4_rerun.fa | sed "s/>//g" >> rerun_list.txt
#Moved to HPCC for remainder of work
cd /lustre/scratch/daray/ixodes/iRic1_rerun/repeatclassifier
actconda
python ~/gitrepositories/bioinfo_tools/pull_seqs.py -l rerun_list.txt -f iRic1_extended_rep.fa.classified -o iRic1_rerun.fa.classified
```
Altered TEcurate to the new directory (/lustre/scratch/daray/ixodes/iRic1_rerun) and to use the new classified file, just created.
Uploaded TEcurate_rerun.sh to /lustre/scratch/daray/ixodes/iRic1_rerun and ran.
First run got exactly zero ORF hits. I reduced the minorf value in the script. Maybe it was too high.
New minorf = 500. We'll see if that helps.
The results are very good. I'm constantly improving the output.
Once the results from the 'fordownload' folder are obtained, some post-processing is needed.
The following need to be identified by hand.
LINEs with upstream satellites.
Tandem Penelope elements
If a list of TE IDs is generated, list.txt, they can be post-processed as follows:
```
for FILE in *_rep_mod.fa; do
CONSNAME=$(basename $FILE -_rep_mod.fa)
echo $CONSNAME
CONSNAMEMOD=${CONSNAME/-rnd-/.}
CONSNAMEMOD=${CONSNAMEMOD/_family-/.}
HEADER=${CONSNAMEMOD}#LINE/LINE-like
echo $HEADER
sed -i "s|${CONSNAME}|${HEADER}|g" $FILE
done
for FILE in *.fa; do
seqkit seq -r -p -t DNA $FILE >${FILE}.tmp
mv ${FILE}.tmp $FILE
rm ${FILE}.tmp
done
#Find and copy the appropriate files used to build the dot plots.
cat list.txt | while read i; do
find . -type f -name "${i}_rep.fa.self-blast.pairs.txt" -exec cp {} . \;
done
#Extract and separate upstream satellites from ixodes LINEs
cat list.txt | while read i; do
#echo ${i}_rep.fa.self-blast.pairs.txt
REPEAT=$(sed '3q;d' ${i}_rep.fa.self-blast.pairs.txt | awk '{print $5}')
#echo $REPEAT
HEADER=$(grep ">" ${i}_rep_mod.fa | sed "s/>//g")
PREHEADER=$(echo $HEADER | cut -d"#" -f1)
SHEADER=${HEADER/'-'/'S-'}
SHEADER=${PREHEADER::-1}S-/#Satellite/Satellite
LHEADER=${HEADER/'-'/'L-'}
CONSSIZE=$(seqkit fx2tab --length --name ${i}_rep_mod.fa | awk '{print $2}')
#echo $CONSSIZE
cat ${i}_rep_mod.fa | seqkit subseq -r 1:$REPEAT >S.fa.tmp
cat ${i}_rep_mod.fa | seqkit subseq -r $REPEAT:$CONSSIZE >L.fa.tmp
#echo $HEADER
#echo $PREHEADER
#echo $POSTHEADER
#echo $SHEADER
#echo $LHEADER
sed -i "s|$HEADER|$SHEADER|g" S.fa.tmp
sed -i "s|$HEADER|$LHEADER|g" L.fa.tmp
cat S.fa.tmp L.fa.tmp >${i}_rep_mod-trimmed.fa
sed -i "s|/#|#|g" ${i}_rep_mod-trimmed.fa
rm S.fa.tmp L.fa.tmp
done
#Extract single, full-length Penelope from tandem repeats
cat list.txt | while read i; do
#echo ${i}_rep.fa.self-blast.pairs.txt
FULLEND=$(sed '3q;d' ${i}_rep.fa.self-blast.pairs.txt | awk '{print $5}')
FULLSTART=$(sed '3q;d' ${i}_rep.fa.self-blast.pairs.txt | awk '{print $3}')
cat ${i}_rep_mod.fa | seqkit subseq -r $FULLSTART:$FULLEND >${i}_rep_mod-trimmed.fa
done
#Add 'LTR-like' to headers where needed.
for FILE in *_rep_mod.fa; do
HEADER=$(grep ">" $FILE)
REPLACE=${HEADER}#LTR/LTR-like
sed -i "s|$HEADER|$REPLACE|g" $FILE
done
```
Pseudogene search
```
mkdir /lustre/scratch/daray/ixodes/pseudosearch
cd /lustre/scratch/daray/ixodes/pseudosearch
ln -s ../assemblies/iSca_ISE6_cds.fa
cp ../wgs_reads/iSca_lt12kb.fa .
actconda
conda activate extend_env
blastn -query iSca_ISE6_cds.fa -db iSca_lt12kb.fa -out iSca_v_iSca.out -outfmt 6 -evalue 1e-50
grep "LINE/unknown" iSca_v_iSca.out >iSca_possible_pseudogenes.txt
```
These will be considered potential pseudogenes and removed from the library.
Do the same for iRic
```
cp /lustre/scratch/daray/ixodes/iRic1_2_curated/iRic1_2_final_curated_clean.fa .
makeblastdb -in iRic1_2_final_curated_clean.fa -dbtype nucl
blastn -query iSca_ISE6_cds.fa -db iRic1_2_final_curated_clean.fa -out iSca_v_iRic.out -outfmt 6 -evalue 1e-50
grep "LINE/" iSca_v_iRic.out >iRic_possible_pseudogenes.txt
```
Nothing was identified as a pseudogene in iRic library. All were classified as some known LINE.
Removal from library
```
awk '{print $2}' iSca_possible_pseudogenes.txt | sort | uniq >iSca_pseudogene_list.txt
python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iSca_pseudogene_list.txt -f iSca_lt12kb.fa -o iSca_lt12kb_nopseudo.fa
```
FINAL LIBRARIES = iSca_lt12kb_nopseudo.fa, iRic1_2_final_curated_clean.fa
```
cp iSca_lt12kb_nopseudo.fa iRic1_2_final_curated_clean.fa /lustre/scratch/daray/ixodes/final_library_work
cd /lustre/scratch/daray/ixodes/final_library_work
cp /lustre/scratch/daray/ixodes/iSca/rmasker_final_under12kb/arthropods+denovo+expanded+under12kb.fa .
```
Manually removed all of the iSca elements from arthropods+denovo+expanded+under12kb.fa for combining with nopseudo library -->arthropods+denovo+expanded.fa
```
cat arthropods+denovo+expanded.fa iSca_lt12kb_nopseudo.fa iRic1_2_final_curated_clean.fa >ixodes_library_final_050202022.fa
```
Output is in iRic1_RM_original and iRic2_RM_original.
The results of this run were disappointing. At least 50% of the genome should be repetitive and I'm not getting that.

I think all of the stuff >12kb that I left out are causing the problem.
Reincorporated those elements into the library as follows.
Working in /lustre/scratch/daray/ixodes/over.12kb
Renamed the consensus sequences with #Unknown/Unknown.
```
for FILE in /lustre/scratch/daray/ixodes/over.12kb/splits1/*.fa; do
CONSNAME=$(basename $FILE .fa)
HEADER=${CONSNAME}#Unknown/Unknown
echo $HEADER
sed "s|$CONSNAME|$HEADER|g" $FILE >/lustre/scratch/daray/ixodes/over.12kb/${CONSNAME}.fa
done
```
Repeated for iRic2 sequences in splits2.
Concatenated each set into iRic1_over_12kb.fa and iRic2_over_12kb.fa
Moved to /lustre/scratch/daray/ixodes/final_library_work
Concatenated each of those files with ixodes_library_final_050202022.fa to create ixodes_library_final_05232022.fa.
Using that as my library for Repeatmasker.
Also need find out what all of these huge consensus sequences are.
Running TE-Aid on all of them to see.
I'm concerned that some of these are large overextensions of real TEs and that RMasker may be missing something by leaving them out.
For example,

This is a tandem Penelope. Is this Penelope not being called by RMasker using one of the other Penelope consensus sequences?

This one has a bunch of LTR/Gypsy components and it looks like there are hundreds of the right half, which would be about the right size for a regular old Gypsy element.
Plotting for ricinus paper
Violin plots:
Run catdata
```
cd /lustre/scratch/daray/ixodes/final_library_work/plots
sbatch catdata_props_PLE.py
```
Select which TE types to examine by counting the major types as follows:
```
cd /lustre/scratch/daray/ixodes/final_library_work/iRic1_RM
awk '{print $8}' iRic1_DNA_rm.bed | sort | uniq >iRic1_DNA_types.txt
cat iRic1_DNA_types.txt | while read i; do COUNT=$(grep $i iRic1_DNA_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT; done
```
Examination suggests these major players:
P 36254
PIF-Harbinger 18728
PiggyBac 2418
Sola 48508
Sola-2 19629
TIR-like 1373418
TcMar 590731
TcMariner 576530
Unknown 505711
nSola2 28770
piggyBac 369028
Notice the potential for double-counted elements. Need to fix this.
Use only TCMar line
Combine Sola
Used this with modifications for all three:
```
sed 's|hAT-Ac|hAT|g' iRic1_DNA_rm.bed >iRic1_DNA_consolidated_rm.bed
sed -i 's|hAT-Blackjack|hAT|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|hAT-Charlie|hAT|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|hAT-Pegasus|hAT|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|hAT-Tag1|hAT|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|hAT-Tip100|hAT|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|hAT-Tip100?|hAT|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|hAT-hAT19|hAT|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|hAT-hAT5|hAT|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|hAT-hATm|hAT|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|hAT-hobo|hAT|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|nSola2|Sola|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|PiggyBac|piggyBac|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Sola-2|Sola|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TIR-like|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMar-Fot1|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMar-Fot1?|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMar-ISRm11|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMar-Mariner|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMar-Pogo|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMar-Tc1|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMar-Tc1?|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMar-Tc2|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMar-m44|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMar?|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-AAG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-AGC|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-CAC|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-CAG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-CAT|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-CCA|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-CCC|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-CCG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-CGA|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-CGG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-CTC|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-CTG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GAC|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GAG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GCA|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GCC|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GCG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GGA|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GGC|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GGG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GGT|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GTA|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GTC|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-GTG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-TCA|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-TGC|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-TGG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-TGT|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-TTC|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner-TIR-TTG|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|TcMariner|TcMar|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-AAA|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-AAT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-ACC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-ATG|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-CAA|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-CAC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-CAG|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-CAT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-CCA|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-CCC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-CCT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-CTA|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-CTC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-GAG|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-GCC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-GGC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-GGG|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-GTA|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-GTG|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-TAT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-TATT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-TGC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
sed -i 's|Unknown-TIR-TGT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed
```
Do the same for LINE, LTR, Unknown, RC, PLE
```
TAXONLIST="iRic1 iRic2 iSca"
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' ${NAME}_LINE_rm.bed | sort | uniq >${NAME}_LINE_types.txt
cat ${NAME}_LINE_types.txt | while read i; do COUNT=$(grep $i ${NAME}_LINE_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_LINE_counts.txt; done
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' ${NAME}_LTR_rm.bed | sort | uniq >${NAME}_LTR_types.txt
cat ${NAME}_LTR_types.txt | while read i; do COUNT=$(grep $i ${NAME}_LTR_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_LTR_counts.txt; done
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' ${NAME}_RC_rm.bed | sort | uniq >${NAME}_RC_types.txt
cat ${NAME}_RC_types.txt | while read i; do COUNT=$(grep $i ${NAME}_RC_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_RC_counts.txt; done
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' iRic1_Unknown_rm.bed | sort | uniq >${NAME}_Unknown_types.txt
cat ${NAME}_Unknown_types.txt | while read i; do COUNT=$(grep $i ${NAME}_Unknown_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_Unknown_counts.txt; done
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' ${NAME}_PLE_rm.bed | sort | uniq >${NAME}_PLE_types.txt
cat ${NAME}_PLE_types.txt | while read i; do COUNT=$(grep $i ${NAME}_PLE_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_PLE_counts.txt; done
done
```
```
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="iRic1 iRic2 iSca"
for TAXON in $LIST; do
cd ${TAXON}_RM
sed 's|CR1-Zenon|CR1|g' ${TAXON}_LINE_rm.bed >${TAXON}_LINE_consolidated_rm.bed
sed -i 's|I-Jockey|I|g' ${TAXON}_LINE_consolidated_rm.bed
sed -i 's|L1-Tx1|L1|g' ${TAXON}_LINE_consolidated_rm.bed
sed -i 's|R2-NeSL|R2|g' ${TAXON}_LINE_consolidated_rm.bed
sed -i 's|RTE-BovB|RTE|g' ${TAXON}_LINE_consolidated_rm.bed
sed -i 's|RTE-RTE|RTE|g' ${TAXON}_LINE_consolidated_rm.bed
sed -i 's|RTE-X|RTE|g' ${TAXON}_LINE_consolidated_rm.bed
sed -i 's|Tx1|L1|g' ${TAXON}_LINE_consolidated_rm.bed
sed 's|Copia_internal|Copia|g' ${TAXON}_LTR_rm.bed >${TAXON}_LTR_consolidated_rm.bed
sed -i 's|Copia_ltr|Copia|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|Gypsy|Gypsy|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|Gypsy_internal|Gypsy|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|Gypsy_int|Gypsy|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|Gypsy_internal|Gypsy|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|Gypsy_ltr|Gypsy|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|LTR-like|LTR|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|LTR-like_internal|LTR|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|LTR-like_ltr|LTR|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|LTR_internal|LTR|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|LTR_ltr|LTR|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|LTRlike|LTR|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|LTRlike_internal|LTR|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|LTRlike_int|LTR|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|LTRlike_ltr|LTR|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|Pao_internal|Pao|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|Pao_int|Pao|g' ${TAXON}_LTR_consolidated_rm.bed
sed -i 's|Pao_ltr|Pao|g' ${TAXON}_LTR_consolidated_rm.bed
sed 's|unknown|Unknown|g' ${TAXON}_Unknown_rm.bed >${TAXON}_Unknown_consolidated_rm.bed
cd /lustre/scratch/daray/ixodes/final_library_work
done
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="iRic1 iRic2 iSca"
for TAXON in $LIST; do
cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Unknown_rm.bed ${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed
done
```
For downstream plots, copying these 'consolidated' files to the plotting directory. Will run catdata.
Concatenate them into a new all_rm_bed file.
```
cd /lustre/scratch/daray/ixodes/final_library_work/plots
cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed
cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed
cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed
cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed
cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed
cp iSca_all_consolidated_rm.bed iSca_all_rm.bed
sbatch cat_data.sh
cp iRic1_all_rm.bed iRic1_rm.bed
cp iRic2_all_rm.bed iRic2_rm.bed
cp iSca_all_rm.bed iSca_rm.bed
python filter_beds.py -g sizefile_mrates.txt -d 50
```
```
LIST="hAT Maverick MULE-MuDR P PIF-Harbinger piggyBac Sola TcMar"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="Helentron Helitron"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="tRNA-Deu"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="Unknown"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="Copia Gypsy LTR Pao"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="L1 L2 CR1 I Jockey R1 RTE"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="Penelope"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
```
Compile into category dataframes for generating the violin plots
```
LIST="hAT Maverick MULE-MuDR P PIF-Harbinger piggyBac Sola TcMar"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_violinframe.txt >>both_allDNA_violinframe.txt
done
LIST="Helentron Helitron"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_violinframe.txt >>both_allRC_violinframe.txt
done
LIST="tRNA-Deu"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_violinframe.txt >>both_allSINE_violinframe.txt
done
LIST="Unknown"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_violinframe.txt >>both_allUnknown_violinframe.txt
done
LIST="Copia Gypsy LTR Pao"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_violinframe.txt >>both_allLTR_violinframe.txt
done
LIST="L1 L2 CR1 I Jockey R1 RTE"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_violinframe.txt >>both_allLINE_violinframe.txt
done
LIST="Penelope"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_violinframe.txt >>both_allPLE_violinframe.txt
done
```
```
grep -v "Taxon,Div" both_allDNA_violinframe.txt >both_all_DNA_violinframe.txt
grep -v "Taxon,Div" both_allRC_violinframe.txt >both_all_RC_violinframe.txt
grep -v "Taxon,Div" both_allPLE_violinframe.txt >both_all_PLE_violinframe.txt
grep -v "Taxon,Div" both_allEINE_violinframe.txt >both_all_SINE_violinframe.txt
grep -v "Taxon,Div" both_allSINE_violinframe.txt >both_all_SINE_violinframe.txt
grep -v "Taxon,Div" both_allLINE_violinframe.txt >both_all_LINE_violinframe.txt
grep -v "Taxon,Div" both_allLTR_violinframe.txt >both_all_LTR_violinframe.txt
grep -v "Taxon,Div" both_allUnknown_violinframe.txt >both_all_Unknown_violinframe.txt
echo "Taxon,Div" >header.txt
cat header.txt both_all_Unknown_violinframe.txt >both_allUnknown_violinframe.txt
cat header.txt both_all_LTR_violinframe.txt >both_allLTR_violinframe.txt
cat header.txt both_all_LINE_violinframe.txt >both_allLINE_violinframe.txt
cat header.txt both_all_SINE_violinframe.txt >both_allSINE_violinframe.txt
cat header.txt both_all_PLE_violinframe.txt >both_allPLE_violinframe.txt
cat header.txt both_all_DNA_violinframe.txt >both_allDNA_violinframe.txt
cat header.txt both_all_RC_violinframe.txt >both_allRC_violinframe.txt
import pandas as pd
import os
import glob
import seaborn as sns
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
from pylab import savefig
LIST = ("DNA", "LINE", "LTR", "PLE", "RC", "SINE", "Unknown")
for TETYPE in LIST:
VIOLINFRAME = pd.read_csv('both_all' + TETYPE + '_violinframe.txt')
DIMS = (3, 10)
FIG, ax = plt.subplots(figsize=DIMS)
sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0)
ax.set_title(TETYPE + 's')
ax.yaxis.grid(True)
ax.yaxis.tick_right()
plt.xticks(rotation=0)
FIG.savefig(TETYPE + 's_violin' + '.png')
```
```
cat both_all_DNA_violinframe.txt both_all_LINE_violinframe.txt both_all_LTR_violinframe.txt both_all_PLE_violinframe.txt both_all_RC_violinframe.txt both_all_SINE_violinframe.txt both_all_Unknown_violinframe.txt >both_all_TE_violinframe.txt
cat header.txt both_all_TE_violineframe.txt >both_allTE_violinframe.txt
import pandas as pd
import os
import glob
import seaborn as sns
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
from pylab import savefig
VIOLINFRAME = pd.read_csv('both_allTE_violinframe.txt')
DIMS = (15, 10)
FIG, ax = plt.subplots(figsize=DIMS)
sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0)
ax.set_title('AllTEs')
ax.yaxis.grid(True)
ax.yaxis.tick_right()
plt.xticks(rotation=90)
FIG.savefig('AllTEs_violin' + '.png')
```
```
grep iSca both_allTE_violinframe.txt >iSca_all_TE_violinframe.txt
cat header.txt iSca_all_TE_violinframe.txt >iSca_allTE_violinframe.txt
grep iRic1 both_allTE_violinframe.txt >iRic1_all_TE_violinframe.txt
cat header.txt iRic1_all_TE_violinframe.txt >iRic1_allTE_violinframe.txt
import pandas as pd
import os
import glob
import seaborn as sns
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
from pylab import savefig
VIOLINFRAME = pd.read_csv('iSca_allTEs_violinframe.txt')
DIMS = (15, 5)
FIG, ax = plt.subplots(figsize=DIMS)
sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0)
ax.set_title('iSca_AllTEminusUnknown')
ax.yaxis.grid(True)
ax.yaxis.tick_right()
plt.xticks(rotation=90)
FIG.savefig('iSca_AllTEminusUnknown_violin' + '.png')
VIOLINFRAME = pd.read_csv('iRic1_allTEs_violinframe.txt')
DIMS = (15, 5)
FIG, ax = plt.subplots(figsize=DIMS)
sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0)
ax.set_title('iRic1_AllTEminusUnknown')
ax.yaxis.grid(True)
ax.yaxis.tick_right()
plt.xticks(rotation=90)
FIG.savefig('iRic1_AllTEminusUnknown_violin' + '.png')
```
These look bad. the Unknowns completely overwhelm all other TE types.
Separate the Unknowns from other TE types.
```
grep -v Unknown iSca_allTE_violinframe.txt >iSca_all_TEminusUnknown_violinframe.txt
cat header.txt iSca_all_TEminusUnknown_violinframe.txt >iSca_allTEminusUnknown_violinframe.txt
grep -v Unknown iRic1_allTE_violinframe.txt >iRic1_all_TEminusUnknown_violinframe.txt
cat header.txt iRic1_all_TEminusUnknown_violinframe.txt >iRic1_allTEminusUnknown_violinframe.txt
import pandas as pd
import os
import glob
import seaborn as sns
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
from pylab import savefig
VIOLINFRAME = pd.read_csv('iSca_allTEminusUnknown_violinframe.txt')
DIMS = (15, 5)
FIG, ax = plt.subplots(figsize=DIMS)
sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0)
ax.set_title('iSca_AllTEminusUnknown')
ax.yaxis.grid(True)
ax.yaxis.tick_right()
plt.xticks(rotation=90)
FIG.savefig('iSca_AllTEminusUnknown_violin' + '.png')
VIOLINFRAME = pd.read_csv('iRic1_allTEminusUnknown_violinframe.txt')
DIMS = (15, 5)
FIG, ax = plt.subplots(figsize=DIMS)
sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0)
ax.set_title('iRic1_AllTEminusUnknown')
ax.yaxis.grid(True)
ax.yaxis.tick_right()
plt.xticks(rotation=90)
FIG.savefig('iRic1_AllTEminusUnknown_violin' + '.png')
```
Some elements are showing up in one species but not the other.
hAT hAT
Maverick Maverick
MULE MULE
P P
Harbinger Harbinger
piggyBac piggyBac
Sola Sola
TcMar TcMar
L1 L1
L2 L2
CR1 CR2
I I
Jockey
R1 R1
RTE RTE
Copia
Gypsy Gypsy
LTR LTR
Pao Pao
Penelope Penelope
Helentron Helentron
Helitron
Deu Deu
This may be because of the 100 bp insertion cutoff.
Changed the scaledviolinplot.py script to a 75 bp cutoff. Rerun it and will need to reconcatenate all of the violinframe.txt files. Then replot.
```
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="hAT Maverick MULE-MuDR P PIF-Harbinger piggyBac Sola TcMar"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="Helentron Helitron"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="tRNA-Deu"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="Unknown"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="Copia Gypsy LTR Pao"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="L1 L2 CR1 I Jockey R1 RTE"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="Penelope"
for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done
LIST="hAT Maverick MULE-MuDR P PIF-Harbinger piggyBac Sola TcMar"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_min75_violinframe.txt >>both_allDNA_min75_violinframe.txt
done
LIST="Helentron Helitron"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_min75_violinframe.txt >>both_allRC_min75_violinframe.txt
done
LIST="tRNA-Deu"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_min75_violinframe.txt >>both_allSINE_min75_violinframe.txt
done
LIST="Unknown"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_min75_violinframe.txt >>both_allUnknown_min75_violinframe.txt
done
LIST="Copia Gypsy LTR Pao"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_min75_violinframe.txt >>both_allLTR_min75_violinframe.txt
done
LIST="L1 L2 CR1 I Jockey R1 RTE"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_vmin75_iolinframe.txt
done
for TE in $LIST; do
cat both_${TE}_min75_violinframe.txt >>both_allLINE_min75_violinframe.txt
done
LIST="Penelope"
for TE in $LIST; do
sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt
sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt
done
for TE in $LIST; do
cat both_${TE}_min75_violinframe.txt >>both_allPLE_min75_violinframe.txt
done
grep -v "Taxon,Div" both_allDNA_min75_violinframe.txt >both_all_DNA_min75_violinframe.txt
grep -v "Taxon,Div" both_allRC_min75_violinframe.txt >both_all_RC_min75_violinframe.txt
grep -v "Taxon,Div" both_allPLE_min75_violinframe.txt >both_all_PLE_min75_violinframe.txt
grep -v "Taxon,Div" both_allSINE_min75_violinframe.txt >both_all_SINE_min75_violinframe.txt
grep -v "Taxon,Div" both_allLINE_min75_violinframe.txt >both_all_LINE_min75_violinframe.txt
grep -v "Taxon,Div" both_allLTR_min75_violinframe.txt >both_all_LTR_min75_violinframe.txt
grep -v "Taxon,Div" both_allUnknown_min75_violinframe.txt >both_all_Unknown_min75_violinframe.txt
echo "Taxon,Div" >header.txt
cat header.txt both_all_Unknown_min75_violinframe.txt >both_allUnknown_min75_violinframe.txt
cat header.txt both_all_LTR_min75_violinframe.txt >both_allLTR_min75_violinframe.txt
cat header.txt both_all_LINE_min75_violinframe.txt >both_allLINE_min75_violinframe.txt
cat header.txt both_all_SINE_min75_violinframe.txt >both_allSINE_min75_violinframe.txt
cat header.txt both_all_PLE_min75_violinframe.txt >both_allPLE_min75_violinframe.txt
cat header.txt both_all_DNA_min75_violinframe.txt >both_allDNA_min75_violinframe.txt
cat header.txt both_all_RC_min75_violinframe.txt >both_allRC_min75_violinframe.txt
cat both_all_DNA_min75_violinframe.txt both_all_LINE_min75_violinframe.txt both_all_LTR_min75_violinframe.txt both_all_PLE_min75_violinframe.txt both_all_RC_min75_violinframe.txt both_all_SINE_min75_violinframe.txt both_all_Unknown_min75_violinframe.txt >both_all_TE_min75_violinframe.txt
cat header.txt both_all_TE_min75_violinframe.txt >both_allTE_min75_violinframe.txt
grep iSca both_allTE_min75_violinframe.txt >iSca_all_TE_min75_violinframe.txt
cat header.txt iSca_all_TE_min75_violinframe.txt >iSca_allTE_min75_violinframe.txt
grep iRic1 both_allTE_min75_violinframe.txt >iRic1_all_TE_min75_violinframe.txt
cat header.txt iRic1_all_TE_min75_violinframe.txt >iRic1_allTE_min75_violinframe.txt
grep -v Unknown iSca_allTE_min75_violinframe.txt >iSca_all_TEminusUnknown_min75_violinframe.txt
cat header.txt iSca_all_TEminusUnknown_min75_violinframe.txt >iSca_allTEminusUnknown_min75_violinframe.txt
grep -v Unknown iRic1_allTE_min75_violinframe.txt >iRic1_all_TEminusUnknown_min75_violinframe.txt
cat header.txt iRic1_all_TEminusUnknown_min75_violinframe.txt >iRic1_allTEminusUnknown_min75_violinframe.txt
grep -v Unknown iSca_allTE_min75_violinframe.txt >iSca_all_TEminusUnknown_min75_violinframe.txt
cat header.txt iSca_all_TEminusUnknown_min75_violinframe.txt >iSca_allTEminusUnknown_min75_violinframe.txt
grep -v Unknown iRic1_allTE_min75_violinframe.txt >iRic1_all_TEminusUnknown_min75_violinframe.txt
cat header.txt iRic1_all_TEminusUnknown_min75_violinframe.txt >iRic1_allTEminusUnknown_min75_violinframe.txt
import pandas as pd
import os
import glob
import seaborn as sns
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
from pylab import savefig
VIOLINFRAME = pd.read_csv('iSca_allTEminusUnknown_min75_violinframe.txt')
DIMS = (15, 5)
FIG, ax = plt.subplots(figsize=DIMS)
sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0)
ax.set_title('iSca_AllTEminusUnknown_min75')
ax.yaxis.grid(True)
ax.yaxis.tick_right()
plt.xticks(rotation=90)
FIG.savefig('iSca_AllTEminusUnknown_violin' + '_min75.png')
VIOLINFRAME = pd.read_csv('iRic1_allTEminusUnknown_min75_violinframe.txt')
DIMS = (15, 5)
FIG, ax = plt.subplots(figsize=DIMS)
sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0)
ax.set_title('iRic1_AllTEminusUnknown_min75')
ax.yaxis.grid(True)
ax.yaxis.tick_right()
plt.xticks(rotation=90)
FIG.savefig('iRic1_AllTEminusUnknown_violin' + '_min75.png')
```
Generate hit counts:
```
LIST="iRic1 iRic2 iSca"
for NAME in $LIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $4}' ${NAME}_Unknown_consolidated_rm.bed | sort | uniq >${NAME}_unique_Unknown.txt
cat ${NAME}_unique_Unknown.txt | while read i; do COUNT=$(grep $i ${NAME}_Unknown_consolidated_rm.bed | wc -l); echo $i $COUNT >> ${NAME}_Unknown_unique_hitcounts.txt; done
sort -k2 -n -r ${NAME}_Unknown_unique_hitcounts.txt >${NAME}_Unknown_unique_hitcounts_sorted.txt
done
```
Reworking the line plots to get rid of extra stuff. Altered curated curated_landscapes_bluePLE_mod.py
```
LIST="iRic1 iRic2 iSca"
for NAME in $LIST; do
python curated_landscapes_bluePLE_mod.py -d 50 -g sizefile_mrates.txt
done
```
### January 2023:
Noted that there were also satellite/LINEs in the iSca data. Needed to go back and separate those.
Did so and re-ran repeatmasker. Also created a simplified library that avoided all of the corrections that needed to be made before - ixodes_library_final_01242023_simple.fa.
Re-ran RepeatMasker. Now, what was the downstream processing?
Create 'consolidated' files:
```
sed 's|unknown|Unknown|g' ${TAXON}_Unknown_rm.bed >${TAXON}_Unknown_consolidated_rm.bed
cd /lustre/scratch/daray/ixodes/final_library_work
done
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="iRic1 iRic2 iSca"
for TAXON in $LIST; do
sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Unknown_rm.bed ${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed
sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed
done
```
For downstream plots, copying these 'consolidated' files to the plotting directory. Will run catdata.
Concatenate them into a new all_rm_bed file.
```
cd /lustre/scratch/daray/ixodes/final_library_work/plots
cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed
cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed
cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed
cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed
cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed
cp iSca_all_consolidated_rm.bed iSca_all_rm.bed
sbatch cat_data.sh
cp iRic1_all_rm.bed iRic1_rm.bed
cp iRic2_all_rm.bed iRic2_rm.bed
cp iSca_all_rm.bed iSca_rm.bed
python filter_beds.py -g sizefile_mrates.txt -d 50
```
Use altered curated curated_landscapes_bluePLE_mod.py
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iSca"
for NAME in $LIST
do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes.txt
done
```
How much of the satellite portion of the genome is derived from LINE/satellites?
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iSca"
for NAME in $LIST
do grep "S""$(printf '\t')" ../${NAME}_RM/${NAME}_Satellite_rm.bed > ../${NAME}_RM/${NAME}_LINE-Satellite_rm.bed
awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_LINE-Satellite_rm.bed; echo $NAME
done
```
18833628
iRic1
27812943
iRic2
28653645
iSca
Divide each by total genome size
18833628/2961411907 = 0.0063596786233898
iRic1
27812943/3667593454 = 0.0075834313014345
iRic2
28653645/2226883318 = 0.0128671514885361
iSca
How much from satellites more generally?
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iSca"
for NAME in $LIST
do awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_Satellite_rm.bed; echo $NAME
done
```
40142831/2961411907 = 0.0135553014104903
iRic1
52725396/3667593454 = 0.0143760197691748
iRic2
42767966/2226883318 = 0.0192053017121753
iSca
====================================
How much of each genome is made up by simple repeats?
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iSca"
for NAME in $LIST
do awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_Simple_repeat_rm.bed; echo $NAME
done
```
23064767/2961411907 = 0.0077884359637648
iRic1
30046243/3667593454 = 0.0081923592068883
iRic2
32837827/2226883318 = 0.0147460923231003
iSca
========================
Do all categories.
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iSca"
for NAME in $LIST
do awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_Simple_repeat_rm.bed; echo $NAME
done
```
The unknowns are the largest category. Let's see if we can figure out what one of the largest of the groups might be.
```
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="iRic1 iRic2 iSca"
for NAME in $LIST
do echo "Name Hits Total_bp" > ${NAME}_RM/${NAME}_Unknowns_table.txt
awk '{print $4}' ${NAME}_RM/${NAME}_Unknown_consolidated_rm.bed | sort | uniq >${NAME}_RM/${NAME}_Unknown_types.txt
cat ${NAME}_RM/${NAME}_Unknown_types.txt | while read i;
do COUNT=$(grep $i ${NAME}_RM/${NAME}_Unknown_consolidated_rm.bed | wc -l | awk '{print $1}')
MASS=$(awk -v TEID="$i" '$4 == TEID {sum += $5} END {print sum} ' ${NAME}_RM/${NAME}_Unknown_consolidated_rm.bed)
echo $i $COUNT $MASS >> ${NAME}_RM/${NAME}_Unknowns_table.txt
done
(tail -n +1 ${NAME}_RM/${NAME}_Unknowns_table.txt | sort -k2 -n -r) >${NAME}_RM/${NAME}_Unknowns_table_hits.txt
(tail -n +1 ${NAME}_RM/${NAME}_Unknowns_table.txt | sort -k3 -n -r) >${NAME}_RM/${NAME}_Unknowns_table_mass.txt
done
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="iRic1 iRic2 iSca"
for NAME in $LIST
do head -n 2 ${NAME}_RM/${NAME}_Unknowns_table.txt sort -k2 -n -r > ${NAME}_RM/${NAME}_Unknowns_table_hits.txt
sort -k3 -n -r ${NAME}_RM/${NAME}_Unknowns_table.txt > ${NAME}_RM/${NAME}_Unknowns_table_mass.txt
done
```
Started to rehash these using the old TEAid data.
Sorted into categories
lt5_FL_copies - Just remove from library as nothing
partial_TE - some missed penelopes and lines. separate into components and return to library
Satellite - relabel as satellites and return to library
to_split_satellite - split out satellite component and decide what to do with the rest
Unknown-SD - >5 copies, possible segmental duplications
zero_pdf - toss
Changes made, in order:
Replaced **Satellite **--> ixodes_library_final_01282023_simple.fa
removed **lt5_FL_copies** --> ixodes_library_final_01282023_simple_unknowns_deleted.fa
renamed **Unknown-SD** to #Unknown/ultra --> ixodes_library_final_01282023_simple_ultra.fa
Incorporated **to_split_satellite** --> ixodes_library_final_01282023_simple_tosplit.fa
Incorporated the LINEs (penelopes) from **partial_TE** --> ixodes_library_final_01282023_simple_LINEs.fa
Incorporated the unknowns from **partial_TE** --> ixodes_library_final_01282023_simple_partial_TEs_unknown.fa
Need to split the LTRs from partial_TE before adding to the library
Did so --> ixodes_library_final_01292023_simple.fa
Create 'consolidated' files:
```
sed 's|unknown|Unknown|g' ${TAXON}_Unknown_rm.bed >${TAXON}_Unknown_consolidated_rm.bed
cd /lustre/scratch/daray/ixodes/final_library_work
done
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="iRic1 iRic2 iSca"
for TAXON in $LIST; do
sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Unknown_rm.bed ${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed
sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed
done
```
For downstream plots, copying these 'consolidated' files to the plotting directory. Will run catdata.
Concatenate them into a new all_rm_bed file.
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed
cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed
cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed
cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed
cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed
cp iSca_all_consolidated_rm.bed iSca_all_rm.bed
sbatch cat_data.sh
cp iRic1_all_rm.bed iRic1_rm.bed
cp iRic2_all_rm.bed iRic2_rm.bed
cp iSca_all_rm.bed iSca_rm.bed
python filter_beds.py -g sizefile_mrates.txt -d 50
```
Use altered curated curated_landscapes_bluePLE_mod.py
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iSca"
for NAME in $LIST
do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes.txt
done
```
How much of the satellite portion of the genome is derived from LINE/satellites?
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iSca"
for NAME in $LIST
do grep "S""$(printf '\t')" ../${NAME}_RM/${NAME}_Satellite_rm.bed > ../${NAME}_RM/${NAME}_LINE-Satellite_rm.bed
awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_LINE-Satellite_rm.bed; echo $NAME
done
```
20346537
iRic1
28557933
iRic2
30223691
iSca
Divide each by total genome size
20346537/2961411907 = 0.0068705528440357
iRic1
28557933/3667593454 = 0.007786559049737
iRic2
30223691/2226883318 = 0.0135721933680586
iSca
How much from satellites more generally?
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iSca"
for NAME in $LIST
do awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_Satellite_rm.bed; echo $NAME
done
```
89228558
iRic1
120300373
iRic2
97170331
iSca
89228558/2961411907 = 0.0301304110343742
iRic1
120300373/3667593454 = 0.0328009018744421
iRic2
97170331/2226883318 = 0.0436351245772815
iSca
Do all categories.
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iSca"
TELIST="DNA LINE LTR PLE RC SINE Satellite Simple_repeat Unknown"
for NAME in $LIST
do for TENAME in $TELIST
do BP=$(awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_${TENAME}_rm.bed)
echo $NAME $TENAME $BP >>${NAME}_masses.txt
done
done
```
Receieved new and improved I ricinus genome assembly. Much higher quality. Only 7514 contigs vs 20k+ in the previous versions
Ixodes_ricinus_v1.0.fasta.gz
Calling it iRic3 for now, for ease of use.
Uploaded to /lustre/scratch/daray/ixodes/assemblies
Start analysis from the beginning to ensure nothing was missed due to the low quality of the previous assemblies:
```
cd /lustre/scratch/daray/ixodes
LIST="iRic3"
for NAME in $LIST; do mkdir $NAME; done
cd $NAME
cp /lustre/scratch/daray/curation_templates/* .
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" rmodel_template.sh >${NAME}_rmodel.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" extend_align_template.sh >${NAME}_extend_align.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" repeatclassifier_template.sh >${NAME}_repeatclassifier.sh; done
for NAME in $LIST; do sed "s/<NAME>/$NAME/g" TEcurate_template.sh >${NAME}_TEcurate.sh; done
```
Modify all the scripts to redirect properly.
```
for NAME in $LIST; do mkdir $NAME; sbatch ${NAME}_rmodel.sh; done
```
N=4222 in the iRic3-families.mod.fa file.
** Before doing anything else, reduce complexity of rmodeler output using usearch **
Modified procedure from https://hackmd.io/n1FOvqtnTRaoKr_9aK1EPg
```
cd /lustre/scratch/daray/ixodes/iRic3
mkdir usearch_work
cd usearch_work
cp /lustre/scratch/daray/ixodes/iRic3/rmodeler_dir/iRic3-families.mod.fa .
cp /lustre/scratch/daray/ixodes/final_library_work/ixodes_library_final_01292023_simple.fa .
/lustre/work/daray/software/usearch11.0.667_i86linux32 \
-usearch_global iRic3-families.mod.fa \
-db ixodes_library_final_01292023_simple.fa \
-strand both \
-id 0.85 \
-minsl 0.95 \
-maxsl 1.05 \
-maxaccepts 1 \
-maxrejects 128 \
-userfields query+target+id+ql+tl \
-userout iRic3_hits.tsv
awk '{print $1}' iRic3_hits.tsv >iRic3_remove.txt
actconda
python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_remove.txt -f iRic3-families.mod.fa -o iRic3_retained.fa
```
Eliminated 539. Not as many as I was hoping. Need to figure this out. Why are so many new things being found?
I'm thinking that the raw RepeatModeler output is going to be significantly shorter than the extended/finalized versions. So.....
Change lower limit to 50%
```
/lustre/work/daray/software/usearch11.0.667_i86linux32 -usearch_global iRic3-families.mod.fa -db ixodes_library_final_01292023_simple.fa -strand both -id 0.85 -minsl 0.50 -maxsl 1.05 -maxaccepts 1 -maxrejects 128 -userfields query+target+id+ql+tl -userout iRic3_hits.tsv
```
Higher hit rate, 24.2% vs. 10.5% for the previous run.
```
awk '{print $1}' iRic3_hits.tsv >iRic3_remove.txt
actconda
python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_remove.txt -f iRic3-families.mod.fa -o iRic3_retained.fa
```
That eliminated 1244. Now called, iRic3_retained.fa.
Go ahead and do the extension pipeline and then re-run this procedure.
** Redirect the extend_align script to the iRic3_retained.fa file.
Changed appropriate line in script to:
"CONSENSUSFILE=$WORKDIR/usearch_work/iRice3_retained.fa"
```
cd /lustre/scratch/daray/ixodes/iRic3
sbatch iRic3_extend_align.sh
```
Ran into trouble with line lengths in the original assembly download.
Fixed with:
```
actconda
conda activate gatk4
gatk NormalizeFasta I=iRic3.fa O=iRic3_normal.fa
```
Ran extension script and got output.
Concatenated the likely TEs and possible SDs:
```
cd /lustre/scratch/daray/ixodes/iRic3/extensions/images_and_alignments
cat likely_TEs/*rep.fa >../likely_TEs.fa
cat possible_SD/*rep.fa >../possible_SDs.fa
```
Now, to eliminate possible duplicates again. First with the likely TEs.
```
cd /lustre/scratch/daray/ixodes/iRic3/usearch_work
cp ../extensions/likely_TEs.fa iRic3_likely_TEs.fa
cp ../extensions/possible_SDs.fa iRic3_possible_SDs.fa
/lustre/work/daray/software/usearch11.0.667_i86linux32 \
-usearch_global iRic3_likely_TEs.fa \
-db ixodes_library_final_01292023_simple.fa \
-strand both \
-id 0.85 \
-minsl 0.95 \
-maxsl 1.05 \
-maxaccepts 1 \
-maxrejects 128 \
-userfields query+target+id+ql+tl \
-userout iRic3_likely_TEs_hits.tsv
awk '{print $1}' iRic3_likely_TEs_hits.tsv >iRic3_likely_TEs_remove.txt
actconda
python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_likely_TEs_remove.txt -f iRic3_likely_TEs.fa -o iRic3_likely_TEs_retained.fa
```
This eliminates 1159.
Now with the possible SDs. Note change to -minsl 0.50.
```
cd /lustre/scratch/daray/ixodes/iRic3/usearch_work
/lustre/work/daray/software/usearch11.0.667_i86linux32 \
-usearch_global iRic3_possible_SDs.fa \
-db ixodes_library_final_01292023_simple.fa \
-strand both \
-id 0.85 \
-minsl 0.50 \
-maxsl 1.05 \
-maxaccepts 1 \
-maxrejects 128 \
-userfields query+target+id+ql+tl \
-userout iRic3_possible_SDs_hits.tsv
awk '{print $1}' iRic3_possible_SDs_hits.tsv >iRic3_possible_SDs_remove.txt
actconda
python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_possible_SDs_remove.txt -f iRic3_possible_SDs.fa -o iRic3_possible_SDs_retained.fa
```
That removed 39 of the 154 possible SDs.
For the likely TEs, it might be worth going down to 80% length for the cutoff.
```
cd /lustre/scratch/daray/ixodes/iRic3/usearch_work
/lustre/work/daray/software/usearch11.0.667_i86linux32 \
-usearch_global iRic3_likely_TEs.fa \
-db ixodes_library_final_01292023_simple.fa \
-strand both \
-id 0.85 \
-minsl 0.81 \
-maxsl 1.19 \
-maxaccepts 1 \
-maxrejects 128 \
-userfields query+target+id+ql+tl \
-userout iRic3_likely_TEs_hits.tsv
awk '{print $1}' iRic3_likely_TEs_hits.tsv >iRic3_likely_TEs_remove.txt
actconda
python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_likely_TEs_remove.txt -f iRic3_likely_TEs.fa -o iRic3_likely_TEs_retained.fa
```
Eliminated a few hundred more, 1157. I'll take it.
Collect these to use for the TEcurate run.
```
cat iRic3_likely_TEs_retained.fa iRic3_possible_SDs_retained.fa >iRic3_secondUsearch_retained.fa
```
use iRic3_secondUsearch_retained.fa as the start for the TEcurate run.
To prepare for this, need to modify the repeatclassifier script to deal with the new file name.
Modified script.
```
#!/bin/bash
#SBATCH --job-name=iRic3_classify
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=60G
module --ignore-cache load gcc/10.1.0 r/4.0.2
. ~/conda/etc/profile.d/conda.sh
conda activate repeatmodeler
TAXON=iRic3
WORKDIR=/lustre/scratch/daray/ixodes/${TAXON}
#mkdir -p $WORKDIR/repeatclassifier
#cd $WORKDIR/extensions/final_consensuses
#Removed for specialized iRic3 run.
#cat ${TAXON}*.fa >$WORKDIR/repeatclassifier/${TAXON}_extended_rep.fa
#Renamed script from Usearch work
cd $WORKDIR/repeatclassifier
cp iRic3_secondUsearch_retained.fa ${TAXON}_extended_rep.fa
Repea#Run RepeatClassifier
RepeatClassifier -consensi ${TAXON}_extended_rep.fa
```
```
cd /lustre/scratch/daray/ixodes/iRic3/usearch_work
mkdir ../repeatclassifier
cp iRic3_secondUsearch_retained.fa ../repeatclassifier
cd ../repeatclassifier
cd ..
sbatch iRic3_repeatclassifier.sh
```
8722584 nocona iRic3_cl daray PD 0:00 1 (Priority)
```
sbatch --dependency=afterok:8722584 iRic3_TEcurate.sh
```
Started working through output in the fordownload folder and noticed a lot of potentially previously identified Penelope elements.
Need to re-run usearch with modified options. I think that using -maxl 1.05 caused me to miss a lot of tandem Penelope elements.
Yep. Matched 1/3 of the elements
```
#On local
cd ixodes/ricinus/iRic3/LINE/
cat *rep.fa >iRic3_LINEs.fa
#On HPCC
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/usearch
/lustre/work/daray/software/usearch11.0.667_i86linux32 -usearch_global iRic3_LINEs.fa -db ixodes_library_final_01292023_simple.fa -strand both -id 0.90 -minsl 0.50 -maxsl 1.50 -maxaccepts 1 -maxrejects 128 -userfields query+target+id+ql+tl -userout iRic3_LINE_hits.tsv
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only
00:00 71Mb 100.0% Reading ixodes_library_final_01292023_simple.fa
00:01 38Mb 100.0% Masking (fastnucleo)
00:02 39Mb 100.0% Word stats
00:02 39Mb 100.0% Alloc rows
00:03 138Mb 100.0% Build index
00:03 171Mb CPU has 128 cores, defaulting to 10 threads
WARNING: Max OMP threads 1
05:24 267Mb 100.0% Searching, 30.4% matched
awk '{print $1}' iRic3_LINE_hits.tsv >iRic3_LINEs_remove.txt
actconda
python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_LINEs_remove.txt -f iRic3_LINEs.fa -o iRic3_LINEs_retained.fa
LINEDIR=/lustre/scratch/daray/ixodes/iRic3/fordownload/LINE
mkdir $LINEDIR/usearch_hits
cat iRic3_LINEs_remove.txt | while read i; do
mv $LINEDIR/${i}* $LINEDIR/usearch_hits
done
```
Do the same with LTRs and NOHITs. Could be helpful.
```
#On HPCC
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/LTR
cat *rep.fa >iRic3_LTRs.fa
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/usearch
cp /lustre/scratch/daray/ixodes/iRic3/fordownload/LTR/iRic3_LTRs.fa .
/lustre/work/daray/software/usearch11.0.667_i86linux32 -usearch_global iRic3_LTRs.fa -db ixodes_library_final_01292023_simple.fa -strand both -id 0.80 -minsl 0.50 -maxsl 1.50 -maxaccepts 1 -maxrejects 128 -userfields query+target+id+ql+tl -userout iRic3_LTR_hits.tsv
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only
00:01 71Mb 100.0% Reading ixodes_library_final_01292023_simple.fa
00:01 38Mb 100.0% Masking (fastnucleo)
00:02 39Mb 100.0% Word stats
00:02 39Mb 100.0% Alloc rows
00:04 138Mb 100.0% Build index
00:04 171Mb CPU has 128 cores, defaulting to 10 threads
WARNING: Max OMP threads 1
06:03 279Mb 100.0% Searching, 19.6% matched
WARNING: Input has lower-case masked sequences
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only
00:01 71Mb 100.0% Reading ixodes_library_final_01292023_simple.fa
00:01 38Mb 100.0% Masking (fastnucleo)
00:02 39Mb 100.0% Word stats
00:02 39Mb 100.0% Alloc rows
00:04 138Mb 100.0% Build index
00:04 171Mb CPU has 128 cores, defaulting to 10 threads
WARNING: Max OMP threads 1
06:03 279Mb 100.0% Searching, 19.6% matched
awk '{print $1}' iRic3_LTR_hits.tsv >iRic3_LTRs_remove.txt
actconda
python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_LTR_remove.txt -f iRic3_LTRs.fa -o iRic3_LTRs_retained.fa
LTRDIR=/lustre/scratch/daray/ixodes/iRic3/fordownload/LTR
mkdir $LTRDIR/usearch_hits
cat iRic3_LTRs_remove.txt | while read i; do
mv $LTRDIR/${i}* $LTRDIR/usearch_hits
done
```
Only 6, but ok.
NOHITs
```
#On HPCC
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/NOHIT
cat *rep.fa >iRic3_NOHITs.fa
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/usearch
cp /lustre/scratch/daray/ixodes/iRic3/fordownload/NOHIT/iRic3_NOHITs.fa .
/lustre/work/daray/software/usearch11.0.667_i86linux32 -usearch_global iRic3_NOHITs.fa -db ixodes_library_final_01292023_simple.fa -strand both -id 0.80 -minsl 0.50 -maxsl 1.50 -maxaccepts 1 -maxrejects 128 -userfields query+target+id+ql+tl -userout iRic3_NOHIT_hits.tsv
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only
00:00 71Mb 100.0% Reading ixodes_library_final_01292023_simple.fa
00:00 38Mb 100.0% Masking (fastnucleo)
00:01 39Mb 100.0% Word stats
00:01 39Mb 100.0% Alloc rows
00:02 138Mb 100.0% Build index
00:02 171Mb CPU has 128 cores, defaulting to 10 threads
WARNING: Max OMP threads 1
03:43 344Mb 100.0% Searching, 20.3% matched
WARNING: Input has lower-case masked sequences
awk '{print $1}' iRic3_NOHIT_hits.tsv >iRic3_NOHITs_remove.txt
actconda
python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_NOHITs_remove.txt -f iRic3_NOHITs.fa -o iRic3_NOHITs_retained.fa
NOHITDIR=/lustre/scratch/daray/ixodes/iRic3/fordownload/NOHIT
mkdir $NOHITDIR/usearch_hits
cat iRic3_NOHITs_remove.txt | while read i; do
mv $NOHITDIR/${i}* $NOHITDIR/usearch_hits
done
```
Got rid of 83.
Have been triaging and finalizing the results on a local computer.
Need to work through the LTR retrotransposons and split off the LTR and internal fragments. Use the code from Jessica Storer.
Transfer the files to HPCC and work on them there.
```
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work/LTR/good
cp /lustre/scratch/daray/ixodes/iSca/round1-naming/combined_naming/LTR/ltr_split/crossMatchLTRParse.mod.py .
cp /lustre/scratch/daray/ixodes/iSca/round1-naming/combined_naming/LTR/ltr_split/GrepCrossmatch .
mkdir crossmatchoutput
for FILE in *mod.fa
do TENAME=$(basename $FILE rep_mod.fa)
/lustre/work/daray/software/cross_match/cross_match $FILE >crossmatchoutput/${TENAME}.crossmatch.out
done
for FILE in *mod.fa
do TENAME=$(basename $FILE rep_mod.fa)
perl GrepCrossmatch crossmatchoutput/${TENAME}.crossmatch.out > crossmatchoutput/${TENAME}.crossmatch.linesOnly.out
done
actconda
for FILE in *mod.fa
do TENAME=$(basename $FILE rep_mod.fa)
python crossMatchLTRParse.mod.py crossmatchoutput/${TENAME}.crossmatch.linesOnly.out $FILE ${TENAME}.LTR_int_split.fa
done
```
11 of them didn't work. Will need to handle manually.
Did so.
Moving on to NOHITs.
Reclassified the presumed nonautonomous DNA transposons.
```
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work/NOHIT/nonautonomous_DNA
for FILE in *rep_mod.fa; do
TENAME=$(basename $FILE _rep_mod.fa)
sed "s|Unknown/Unknown|DNA/TIR-like|g" $FILE >${TENAME}_rep_mod_rename.fa
done
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work/NOHIT/satellite
for FILE in *rep_mod.fa; do
TENAME=$(basename $FILE _rep_mod.fa)
sed "s|Unknown/Unknown|Satellite/Satellite|g" $FILE >${TENAME}_rep_mod_rename.fa
done
```
Also realized I need to reclassify the penelopes.
```
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work/LINE/good-complete
for FILE in *rep_mod.fa; do
TENAME=$(basename $FILE _rep_mod.fa)
sed -i "s|LINE/Penelope|PLE/Penelope|g" $FILE
done
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work/LINE/to_trim_penelope-complete
for FILE in *rep_mod_trim.fa; do
TENAME=$(basename $FILE _rep_mod_trim.fa)
sed -i "s|LINE/Penelope|PLE/Penelope|g" $FILE
done
```
Finished everything up as follows:
```
cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work
cd DNA-complete/
ls
cd good-complete/
cat *_rep_mod.fa >../good-DNA.fa
cd ../
cat unknown-complete/*_rep_mod.fa >good-unknown.fa
cd ..
cd LINE-complete/
cat good-complete/*_rep_mod.fa >good-LINE.fa
cat to_trim_penelope-complete/*trim.fa good-penelopetrim.fa
cat to_trim_penelope-complete/*trim.fa >good-penelopetrim.fa
cat to_trim_satellite-complete/*trim.fa >good-satellitetrim.fa
cd ../LTR-complete/
cat good-complete/*split.fa >good-ltr.fa
cat to_trim_tandem-complete/*trim.fa >good-ltrtrim.fa
cd ../NOHIT-complete/
cat nonautonomous_DNA-complete/*rename.fa >good-nonautonomous.fa
cat satellite-complete/*rename.fa >good-satellite.fa
cat unknown-complete/*mod.fa >good-unknown.fa
cd ../RC-complete/
cat *mod.fa good-RC.fa
cat *mod.fa >good-RC.fa
cd ..
cat DNA-complete/good*.fa LINE-complete/good*.fa LTR-complete/good*.fa NOHIT-complete/good*.fa RC-complete/good*.fa >iRic3_complete.fa
cd ../../../
cd final_library_work/
cat ixodes_library_final_01292023_simple.fa ../iRic3/fordownload/from_local_work/iRic3_complete-simple.fa >ixodes_library_final_03142023_simple.fa
```
New final library --> ixodes_library_final_03142023_simple.fa
Run RepeatMasker on all three genome assemblies.
Appropriate edits and...
```
sbatch iRic1_RM.sh
sbatch iRic2_RM.sh
sbatch iRic3_RM.sh
sbatch iSca_RM.sh
```
### March 2023:
Now that all the RepeatMasker runs are done, get back to analyzing the output.
Create 'consolidated' files:
```
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="iRic1 iRic2 iRic3 iSca"
for TAXON in $LIST; do
sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Unknown_rm.bed ${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed
sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed
done
```
For downstream plots, copying these 'consolidated' files to the plotting directory. Will run catdata.
Concatenate them into a new all_rm_bed file.
Need to add iRic3 to sizefile.txt, sizefile_mrates_4_curated_landscapes.txt, and sizefile_mrates.txt.
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed
cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed
cat ../iRic3_RM/*consolidated_rm.bed >iRic3_all_consolidated_rm.bed
cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed
cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed
cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed
cp iRic3_all_consolidated_rm.bed iRic3_all_rm.bed
cp iSca_all_consolidated_rm.bed iSca_all_rm.bed
sbatch cat_data.sh
cp iRic1_all_rm.bed iRic1_rm.bed
cp iRic2_all_rm.bed iRic2_rm.bed
cp iRic3_all_rm.bed iRic3_rm.bed
cp iSca_all_rm.bed iSca_rm.bed
### Wait for cat_data.sh to be finished.
python filter_beds.py -g sizefile_mrates.txt -d 50
```
Use altered curated_landscapes_bluePLE_mod.py
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iRic3 iSca"
for NAME in $LIST
do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes.txt
done
```
STOP. Something is messed up with the unknowns in iRic3


Need to do some investigating.
What is forming that huge peak?
```
TAXONLIST="iRic1 iRic2 iRic3 iSca"
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $4}' ${NAME}_Unknown_rm.bed | sort | uniq >${NAME}_Unknown_IDs.txt
cat ${NAME}_Unknown_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_Unknown_rm.bed | wc -l | awk '{print $1}')
SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_Unknown_rm.bed)
echo $i $COUNT $SUM >>${NAME}_Unknown_IDs_counts.txt
done
sort -k3 -n -r ${NAME}_Unknown_IDs_counts.txt >${NAME}_Unknown_IDs_sort_counts.txt
done
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' ${NAME}_Unknown_rm.bed | sort | uniq >${NAME}_Unknown_types.txt
cat ${NAME}_Unknown_types.txt | while read i; do COUNT=$(grep $i ${NAME}_Unknown_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_Unknown_counts.txt; done; done
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $4}' ${NAME}_LINE_rm.bed | sort | uniq >${NAME}_LINE_IDs.txt
cat ${NAME}_LINE_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_LINE_rm.bed | wc -l | awk '{print $1}')
SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_LINE_rm.bed)
echo $i $COUNT $SUM >>${NAME}_LINE_IDs_counts.txt
done
sort -k3 -n -r ${NAME}_LINE_IDs_counts.txt >${NAME}_LINE_IDs_sort_counts.txt
done
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' ${NAME}_LINE_rm.bed | sort | uniq >${NAME}_LINE_types.txt
cat ${NAME}_LINE_types.txt | while read i; do COUNT=$(grep $i ${NAME}_LINE_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_LINE_counts.txt; done; done
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $4}' ${NAME}_LTR_rm.bed | sort | uniq >${NAME}_LTR_IDs.txt
cat ${NAME}_LTR_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_LTR_rm.bed | wc -l | awk '{print $1}')
SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_LTR_rm.bed)
echo $i $COUNT $SUM >>${NAME}_LTR_IDs_counts.txt
done
sort -k3 -n -r ${NAME}_LTR_IDs_counts.txt >${NAME}_LTR_IDs_sort_counts.txt
done
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' ${NAME}_LTR_rm.bed | sort | uniq >${NAME}_LTR_types.txt
cat ${NAME}_LTR_types.txt | while read i; do COUNT=$(grep $i ${NAME}_LTR_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_LTR_counts.txt; done; done
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $4}' ${NAME}_RC_rm.bed | sort | uniq >${NAME}_RC_IDs.txt
cat ${NAME}_RC_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_RC_rm.bed | wc -l | awk '{print $1}')
SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_RC_rm.bed)
echo $i $COUNT $SUM >>${NAME}_RC_IDs_counts.txt
done
sort -k3 -n -r ${NAME}_RC_IDs_counts.txt >${NAME}_RC_IDs_sort_counts.txt
done
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' ${NAME}_RC_rm.bed | sort | uniq >${NAME}_RC_types.txt
cat ${NAME}_RC_types.txt | while read i; do COUNT=$(grep $i ${NAME}_RC_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_RC_counts.txt; done; done
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $4}' ${NAME}_DNA_rm.bed | sort | uniq >${NAME}_DNA_IDs.txt
cat ${NAME}_DNA_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_DNA_rm.bed | wc -l | awk '{print $1}')
SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_DNA_rm.bed)
echo $i $COUNT $SUM >>${NAME}_DNA_IDs_counts.txt
done
sort -k3 -n -r ${NAME}_DNA_IDs_counts.txt >${NAME}_DNA_IDs_sort_counts.txt
done
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' iRic1_DNA_rm.bed | sort | uniq >${NAME}_DNA_types.txt
cat ${NAME}_DNA_types.txt | while read i; do COUNT=$(grep $i ${NAME}_DNA_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_DNA_counts.txt; done; done
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $4}' ${NAME}_PLE_rm.bed | sort | uniq >${NAME}_PLE_IDs.txt
cat ${NAME}_DNA_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_PLE_rm.bed | wc -l | awk '{print $1}')
SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_PLE_rm.bed)
echo $i $COUNT $SUM >>${NAME}_PLE_IDs_counts.txt
done
sort -k3 -n -r ${NAME}_PLE_IDs_counts.txt >${NAME}_PLE_IDs_sort_counts.txt
done
for NAME in $TAXONLIST; do
cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM
awk '{print $8}' ${NAME}_PLE_rm.bed | sort | uniq >${NAME}_PLE_types.txt
cat ${NAME}_PLE_types.txt | while read i; do COUNT=$(grep $i ${NAME}_PLE_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_PLE_counts.txt; done
done
```
### April 2023:
Got a new, presumably filtered version of the better I ricinus genome assembly. Ixodes_ricinus_v1.2.fasta.gz, which I copied to iRic4.fa.gz
Ran RepeatMasker using iRic4_RM.sh
Get back to analyzing the output.
Create 'consolidated' files:
```
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="iRic1 iRic2 iRic4 iSca"
for TAXON in $LIST; do
sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Unknown_rm.bed ${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed
sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed
done
```
For downstream plots, copying these 'consolidated' files to the plotting directory. Will run catdata.
Concatenate them into a new all_rm_bed file.
Need to add iRic4 to sizefile.txt, sizefile_mrates_4_curated_landscapes.txt, and sizefile_mrates.txt.
```
nano sizefile_mrates_4_curated_landscapes.txt
Ixodes_ricinus1 2961411907 3.0e-9 iRic1
Ixodes_ricinus2 3667593454 3.0e-9 iRic2
Ixodes_ricinus3 2288563899 3.0e-9 iRic4
Ixodes_scapularis 2226883318 3.0e-9 iSca
nano sizefile_mrates.txt
iRic1 2961411907 3.0e-9
iRic2 3667593454 3.0e-9
iRic4 2288563899 3.0e-9
iSca 2226883318 3.0e-9
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed
cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed
cat ../iRic4_RM/*consolidated_rm.bed >iRic4_all_consolidated_rm.bed
cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed
cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed
cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed
cp iRic4_all_consolidated_rm.bed iRic4_all_rm.bed
cp iSca_all_consolidated_rm.bed iSca_all_rm.bed
sbatch cat_data.sh
cp iRic1_all_rm.bed iRic1_rm.bed
cp iRic2_all_rm.bed iRic2_rm.bed
cp iRic4_all_rm.bed iRic4_rm.bed
cp iSca_all_rm.bed iSca_rm.bed
### Wait for cat_data.sh to be finished.
python filter_beds.py -g sizefile_mrates.txt -d 50
```
Use altered curated_landscapes_bluePLE_mod.py
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iRic4 iSca"
for NAME in $LIST
do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes.txt
done
```
Is the huge peak still there?
Yes.
Investigated and the peak appears to be caused by a mislabeled satellite repeat.
It was called iRic2.6.4895#Unknown/Unknown. I've since renamed it to iRic2.6.4895#Satellite/Satellite.
First thought was that this repeat was introducing misassembled contigs/scaffolds. So, I queried the raw reads to see if it exists in the wild.
Yes, it does. See output in /lustre/scratch/daray/ixodes/odd_repeat_search/iRic2.6.4895_v_ont.out.
Now the question is, why does this repeat not show up in the other individuals.
Need to run the same analysis with the raw data from those samples.
Will do that.
For now, re-run the graph to see if the relabeling got rid of the wierd peak.
Create 'consolidated' files:
```
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="iRic1 iRic2 iRic4 iSca"
for TAXON in $LIST; do
sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Satellite_rm.bed ${TAXON}_RM/${TAXON}_Satellite_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed
sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Simple_repeat_rm.bed ${TAXON}_RM/${TAXON}_Simple_repeat_consolidated_rm.bed
done
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed
cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed
cat ../iRic4_RM/*consolidated_rm.bed >iRic4_all_consolidated_rm.bed
cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed
cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed
cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed
cp iRic4_all_consolidated_rm.bed iRic4_all_rm.bed
cp iSca_all_consolidated_rm.bed iSca_all_rm.bed
sbatch cat_data.sh
cp iRic1_all_rm.bed iRic1_rm.bed
cp iRic2_all_rm.bed iRic2_rm.bed
cp iRic4_all_rm.bed iRic4_rm.bed
cp iSca_all_rm.bed iSca_rm.bed
### Wait for cat_data.sh to be finished.
python filter_beds.py -g sizefile_mrates.txt -d 50
```
Use altered curated_landscapes_bluePLE_mod.py
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="iRic1 iRic2 iRic4 iSca"
for NAME in $LIST
do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes.txt
done
```
Still looking a little weird. Rodrigo reassembled using Flye and provided new assemblies. They're not as contiguous but worth examining.
Maya_Flye_raw.fasta
Murphy_Flye_raw.fasta
Compress and rename for RM analysis.
```
cd /lustre/scratch/daray/ixodes/assemblies
gzip -c Maya_Flye_raw.fasta >maya_flye.fa.gz
gzip -c Murphy_Flye_raw.fasta >murphy_flye.fa.gz
```
Got new assemblies from Katie (in Travis' lab). All are from the same assembly method and (we hope) will resolve this problem.
```
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="maya_flye_TCG murphy_flye_TCG kees_flye_TCG"
for TAXON in $LIST; do
sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Satellite_rm.bed ${TAXON}_RM/${TAXON}_Satellite_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed
sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Simple_repeat_rm.bed ${TAXON}_RM/${TAXON}_Simple_repeat_consolidated_rm.bed
done
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="maya_flye_TCG murphy_flye_TCG kees_flye_TCG"
for TAXON in $LIST; do
cat ../${TAXON}_RM/*consolidated_rm.bed >${TAXON}_all_consolidated_rm.bed
cp ${TAXON}_all_consolidated_rm.bed ${TAXON}_all_rm.bed
done
sbatch cat_data.sh
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="maya_flye_TCG murphy_flye_TCG kees_flye_TCG"
for TAXON in $LIST; do
cp ${TAXON}_all_rm.bed ${TAXON}_rm.bed
done
### Wait for cat_data.sh to be finished.
python filter_beds.py -g sizefile_mrates_TCG.txt -d 50
```
Use altered curated_landscapes_bluePLE_mod.py
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="maya_flye_TCG murphy_flye_TCG kees_flye_TCG"
for TAXON in $LIST; do
python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes_TCG.txt
done
cd /lustre/scratch/daray/ixodes/final_library_work/plots
rm proportions_table.txt
echo "Species Genome_size TE_proportion LINE_proportion SINE_proportion LTR_proportion DNA_proportion RC_proportion Unknown_proportion Satellite_proportion Simple_repeat_proportion" >proportions_table.tsv
TAXONLIST="maya_flye_TCG murphy_flye_TCG kees_flye_TCG iSca"
for TAXON in $TAXONLIST; do
STATS=$(tail -1 ../${TAXON}_RM/repeat_table_total.tsv)
echo $STATS >>proportions_table.tsv
done
```
# Meeting 5/25/2023 to decide which assemblies are for the final analysis.
From Travis' e-mail:
[versions in brackets exist at least in theory; not sure if we actually have them]
Kees-1.2 - first version we got from Ron (Hein), which looks good except for the satellite artifacts (and has already been annotated and used for orthology, etc.)
[Kees-1.1 - from Ron, going to the version that has the haplotigs removed, but before polishing with Pilon --> not sure we have this version]
Kees-1.0 - from Ron, going back to the version prior to haplotig removal & polishing
Kees-2.0.1 - flye assembly by Rodrigo (which includes all haplotigs)
Kees-2.0.2 - flye assembly by Katie (which includes all haplotigs)
Kees-2.0.3 - flye asm 40 assembly by Katie (which includes all haplotigs); asm 40 uses the longest reads (excludes shortest reads after 40x coverage is obtained)
Kees-2.1.1 - prior version (2.0.1) with haplotigs removed {Rodrigo's Flye assembly}
Kees-2.2.1 - prior version (2.1.1) polished with NextPolish {Rodrigo's polished Flye assembly}
Kees-2.3.1 - prior version (2.2.1) with contaminants removed
Maya-2.0.2 -Katie's Flye assembly
Murphy-2.0.2 - Katie's Flye assembly
Maya-2.0.1 - Rodrigo's Flye assembly
Murphy-2.0.1 - Rodrigo's Flye assembly
Maya-2.1.1 - haplotigs removed {Rodrigo's Flye assembly}
Murphy-2.1.1 - haplotigs removed {Rodrigo's Flye assembly}
I have:
murphy_purged_2.1.1.fa.gz = Murphy-2.1.1
murphy_flye_TCG_2.0.2.fa.gz = Murphy-2.0.2
murphy_flye_raw_2.0.1.fa.gz = Murphy-2.0.1
maya_purged_2.1.1.fa.gz = Maya-2.1.1
maya_flye_TCG_2.0.2.fa.gz = Maya-2.0.2
maya_flye_raw_2.0.1.fa.gz = Maya-2.0.1
kees_purged_2.1.1.fa.gz = kees-2.1.1
kees_flye_TCG_2.0.2.fa.gz = kees-2.0.2
kees_flye_raw_2.0.1.fa.gz = kees-2.0.1
iRic3.fa.gz = kees_1.2.0 = kees_ron_1.2.0.fa.gz
kees_40_2.0.3.fa.gz = kees1202_cov40
kees_ron_1.0.0.fa.gz = Kees-1.0
kees_purged_2.2.1.fa.gz = Kees_purged_nextpolish_2.2.1.fasta
murphy_purged_2.2.1.fa.gz = Murphy_purged_nextpolish_2.2.1.fasta
maya_purged_2.2.1.fa.gz = Murphy_purged_nextpolish_2.2.1.fasta
Restarting all of the repeatmasker runs with this list:
LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca"
```
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca"
for NAME in $LIST; do
sed "s/<NAME>/${NAME}/g" template_RM.sh >${NAME}_RM.sh
rm -rf ${NAME}_RM
sed "s/<NAME>/${NAME}/g" template_rm2bed.sh >${NAME}_rm2bed.sh
sbatch ${NAME}_RM.sh
done
```
Just got a kees_40_2.0.3.fa.gz (the 40x version). Add to the list.
Also got kees_1.0.0.
Submit all of the doLifts if necessary. Sometimes they don't go automatically.
```
LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 Sca"
cd /lustre/scratch/daray/ixodes/final_library_work
for NAME in $LIST; do
cd ${NAME}_RM/
rm *.err
rm *.out
sbatch doLift.sh
cd ..
done
```
Submit all of the rm2beds and then clean up.
```
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca"
for NAME in $LIST; do
rm ${NAME}_RM/*.err
rm ${NAME}_RM/*.out
rm -rf RMPart
sbatch generic_rm2bed.sh $NAME
done
cd /lustre/scratch/daray/ixodes/final_library_work
for NAME in $LIST; do
rm -rf ${NAME}_RM/RMPart
done
```
```
cd /lustre/scratch/daray/ixodes/final_library_work
LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca"
for TAXON in $LIST; do
sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Satellite_rm.bed ${TAXON}_RM/${TAXON}_Satellite_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed
sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed
cp ${TAXON}_RM/${TAXON}_Simple_repeat_rm.bed ${TAXON}_RM/${TAXON}_Simple_repeat_consolidated_rm.bed
done
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca"
for TAXON in $LIST; do
cat ../${TAXON}_RM/*consolidated_rm.bed >${TAXON}_all_consolidated_rm.bed
cp ${TAXON}_all_consolidated_rm.bed ${TAXON}_all_rm.bed
done
sbatch cat_data.sh
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca"
for TAXON in $LIST; do
cp ${TAXON}_all_rm.bed ${TAXON}_rm.bed
done
### Wait for cat_data.sh to be finished.
### fix the sizefiles
conda activate seqkit
cd /lustre/scratch/daray/ixodes/final_library_work/plots
LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca"
for NAME in $LIST; do
STATS=$(seqkit stats ../../assemblies/${NAME}.fa.gz -T | tail -1)
echo $STATS >>genomestats.txt
done
###Edit sizefiles as needed
python filter_beds.py -g sizefile_mrates_all.txt -d 50
```
Use altered curated_landscapes_bluePLE_mod.py
```
actconda
cd /lustre/scratch/daray/ixodes/final_library_work/plots
python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes_all.txt
cd /lustre/scratch/daray/ixodes/final_library_work/plots
rm proportions_table.txt
echo "Species Genome_size TE_proportion LINE_proportion SINE_proportion LTR_proportion DNA_proportion RC_proportion Unknown_proportion Satellite_proportion Simple_repeat_proportion" >proportions_table.tsv
LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca"
for TAXON in $LIST; do
STATS=$(tail -1 ../${TAXON}_RM/repeat_table_total.tsv)
echo $STATS >>proportions_table.tsv
done
```
Another new set of assemblies cleaned of contaminants
kees_PDC_2.3.1.fa.gz = Kees_PDC_2.3.1.fasta
murphy_PDC_2.3.1.fa.gz = Murphy_PDC_2.3.1.fasta
maya_PDC_2.3.1.fa.gz = Maya_PDC_2.3.1.fasta
Restarting all of the repeatmasker runs with this list:
LIST="kees_PDC_2.3.1 murphy_PDC_2.3.1 maya_PDC_2.3.1"
## Out of space. Moved to I. ricinus analysis part 2