I ricinus TE analysis

# I ricinus TE analysis Beginning by running RepeatModeler on two assembly versions. Both are females but one is several hundred megabases larger than the other. Haven't recieved and answer from Rodrigo or Travis as to why this could be the case. Regardless, I will run RepeatModeler and then collapse the library using cdhit set to 95% simililarity. RepeatModeler completed (3/10/2022) on both assemblies. Both runs yeilded a LOT of elements - around the same number as resulted from the iSca run (n=4527). iRic1-families.fa, n=4675. iRic2-families.fa, n=4723. Not sure exactly what is the best option for moving forward. 1. Run RAM and then use the output? I don't think so. RAM has the potential to generate some odd consensus sequences. We've already been through what will likely be most of those in iSca. 2. Run cdhit on one output at a time. Combine with newlib4cdhit_trimmed.fa (n=2059), which contains all of the stuff we felt confident naming and all of the stuff that was uninterpretable or 'satellite'. Another option is the use lib4cdhit_trimmmed.fa (n=1534. It contains all of the stuff we felt comfortable naming. Regardless, would need to run iRic1(&2)-families.fa with the library I choose and then get rid of the duplicates. Do this for each of the files, iRic1 and iRic2. 3. Combine iRic1 and iRic2 and then run cdhit on the combined version using either newlib4cdhit_trimmed.fa or lib4cdhit_trimmmed.fa. I'm leaning toward 3 at the moment. Will proceed and see how it goes. ###Option 3 ``` cd /lustre/scratch/daray/ixodes/iRic2/rmodeler_dir cut -d" " -f1 iRic2-families.fa > iRic2-families.mod.fa sed -i "s/rnd/iRic2-rnd/g" iRic2-families.mod.fa cd ../../iRic1/rmodeler_dir cut -d" " -f1 iRic1-families.fa > iRic1-families.mod.fa sed -i "s/rnd/iRic1-rnd/g" iRic1-families.mod.fa cd ../.. mkdir iRic_cdhit_work cd iRic_cdhit_work/ cat ../iRic1/rmodeler_dir/iRic1-families.mod.fa ../iRic2/rmodeler_dir/iRic2-families.mod.fa >iRic1_2-families.mod.fa cat iRic1_2-families.mod.fa newlib4cdhit_trimmed.fa >iRic1_2_newlib_4cdhit.fa i12 cd /lustre/scratch/daray/ixodes/iRic_cdhit_work CDHIT=/lustre/work/daray/software/cdhit-4.8.1 $CDHIT/cd-hit-est -i iRic1_2_newlib_4cdhit.fa -o iRic1_2_newlib.cdhit95_sS09 -c 0.95 -d 70 -aS 0.9 -n 12 -M 2200 -l 100 ``` To process output (iRic1_2_newlib.cdhit95_sS09.clstr), will use bash to filter the files as follows: Pull all hits that are from iRic1 or iRic2 and are NOT the longest hit in the cluster. In other words, find clusters like this. >Cluster 62 0 201nt, >iRic1-rnd-1_family-140#Unknown... at -/95.52% 1 194nt, >iRic1-rnd-4_family-1394#Unknown... at +/95.36% 2 10545nt, >iSca.1.307#Unknown/Unknown... * 3 193nt, >iSca.1.288#DNA/TcMariner-TIR-GGG... at -/97.93% Create a new list from lines 0 and 1. Target those for removal. This will leave clusters like this: 0 8745nt, >iRic1-rnd-6_family-1316#DNA/MULE-MuDR... * 1 6905nt, >iRic2-rnd-5_family-3428#DNA/MULE-MuDR... at +/98.93% and like this: >Cluster 2665 0 1132nt, >iRic2-rnd-5_family-710#LINE/CR1... * Line 0 will be retained in both cases. ``` cd /lustre/scratch/daray/ixodes/iRic_cdhit_work/cdhit1 cat iRic1_2_newlib.cdhit95_sS09.clstr | fgrep '*' | grep "iRic" ``` OUCH! That still leaves 6018 potentials. What if I do 90%? ``` i12 cd /lustre/scratch/daray/ixodes/iRic_cdhit_work/cdhit1 CDHIT=/lustre/work/daray/software/cdhit-4.8.1 $CDHIT/cd-hit-est -i iRic1_2_newlib_4cdhit.fa -o iRic1_2_newlib.cdhit90_sS09 -c 0.90 -d 70 -aS 0.9 -n 9 -M 2200 -l 100 ``` That takes the total down to 4677. That's still a LOT. Going to work with that, run it through RAM and then see what happens when I run cdhit again. ``` cat iRic1_2_newlib.cdhit90_sS09.clstr | fgrep '*' | grep "iRic" | cut -d">" -f2 | cut -d"." -f1 > novel-iRic.txt python ~/gitrepositories/bioinfo_tools/pull_seqs2.py -l novel-iRic.txt -f iRic1_2-families.mod.fa -o novel_iRic.fa cut -d"#" -f1 novel_iRic.fa >novel_iRic.mod.fa cd /lustre/scratch/daray/ixodes/iRic1 nano iRic1_extend_align.sh sbatch iRic1_extend_align.sh cd ../iRic2 cp ../iRic1/iRic1_extend_align.sh iRic2_extend_align.sh nano iRic2_extend_align.sh sbatch iRic2_extend_align.sh ``` Runs finished overnight. Now, collect all of the new consensus sequences and re-run cd-hit to eliminate potential duplicates. ``` i12 cd /lustre/scratch/daray/ixodes/iRic_cdhit_work/cdhit2 cat /lustre/scratch/daray/ixodes/iRic1/extensions/final_consensuses/*.fa /lustre/scratch/daray/ixodes/iRic2/extensions/final_consensuses/*.fa ../cdhit1/newlib4cdhit_trimmed.fa >cdhit2_input.fa CDHIT=/lustre/work/daray/software/cdhit-4.8.1 $CDHIT/cd-hit-est -i cdhit2_input.fa -o cdhit2.cdhit90_sS09 -c 0.90 -d 70 -aS 0.9 -n 9 -M 2200 -l 100 $CDHIT/cd-hit-est -i cdhit2_input.fa -o cdhit2.cdhit95_sS09 -c 0.95 -d 70 -aS 0.9 -n 11 -T 36 -M 2200 -l 100 ``` The very large elements are still a problem. I think one way to make this work is to just eliminate anything over 12kb from our analysis. The rationale can be that these are likely segmental duplications. Had to consider whether to cull now or go back to before the extend_align run. In other words, how many of the original consensus sequences were greater than 12kb in the initial RModeler output. Went to look. There were only two. In the post extend_align there are 369. Went back to arthropods+denovo+expanded.fa and removed anything from iSca group that was over 12kb and was classified as Unknown/Unknown. Removals are saved as removed_12kbplus.fa. Filtered library is saved as arthropods+denovo+expanded+under12kb.fa. Keeping anything that has a clear TIR. This way, we can say we kept potential mobilizing elements. Will need to go back to the repeatmasker run for iSca and eliminate those entries from the library. Getting the extensions. ``` cat /lustre/scratch/daray/ixodes/iRic1/extensions/final_consensuses/*.fa /lustre/scratch/daray/ixodes/iRic2/extensions/final_consensuses/*.fa >iRic1_2_extensions.fa ``` Downloaded and sorted by size. Removed anything greater than 12kb to iRic1_2_extensions_over12kb.fa. Kept anything less than 12kb and saved as iRic1_2_extensions_under12kb.fa. Now, combine that last file with arthropods+denovo+expanded+under12kb.fa and rerun cdhit. Moved all of the cdhit results generated earlier to 'originalrun'. ``` i12 cd /lustre/scratch/daray/ixodes/iRic_cdhit_work/cdhit2 cat arthropods+denovo+expanded+under12kb.fa iRic1_2_extensions_under12kb.fa >cdhit2_input.fa CDHIT=/lustre/work/daray/software/cdhit-4.8.1 $CDHIT/cd-hit-est -i cdhit2_input.fa -o cdhit2.cdhit90_sS09 -c 0.90 -d 70 -aS 0.9 -n 9 -T 12 -M 2200 -l 100 $CDHIT/cd-hit-est -i cdhit2_input.fa -o cdhit2.cdhit95_sS09 -c 0.95 -d 70 -aS 0.9 -n 11 -T 12 -M 2200 -l 100 ``` Still a lot of elements to work through. ``` $ cat cdhit2.cdhit95_sS09.clstr | fgrep '*' | grep "iRic" | wc -l 3181 $ cat cdhit2.cdhit90_sS09.clstr | fgrep '*' | grep "iRic" | wc -l 2937 ``` Might as well just get to it. ``` actconda cat cdhit2.cdhit95_sS09.clstr | fgrep '*' | grep "iRic1" | cut -d">" -f2 | cut -d"." -f1 > novel_iRic1.txt python ~/gitrepositories/bioinfo_tools/pull_seqs2.py -l novel-iRic1.txt -f iRic1_2_extensions.fa -o novel_iRic1.fa cat cdhit2.cdhit95_sS09.clstr | fgrep '*' | grep "iRic2" | cut -d">" -f2 | cut -d"." -f1 > novel_iRic2.txt python ~/gitrepositories/bioinfo_tools/pull_seqs2.py -l novel-iRic2.txt -f iRic1_2_extensions.fa -o novel_iRic2.fa ``` Now, let's run these through te-aid. ``` mkdir -p /lustre/scratch/daray/ixodes/iRic1/te-aid/splits cd /lustre/scratch/daray/ixodes/iRic1/te-aid/splits WORK=/lustre/work/daray/software cp novel_iRic1.fa . $WORK/faSplit byname ../novel_iRic1.fa . cp /lustre/scratch/daray/ixodes/iSca/round1-naming/unknowns-for-curating/david/te-aid_work/bin/te-aid.sh nano te-aid.sh ####Modify as needed sbatch te-aid.sh ``` Going to try and follow the instructions from Clement's te curation pipeline, lines 128- First need to get all of my potential TEs from each genome together into one place. Then classify them to get the headers that the generate_priority_list_from_RM2.sh will retrieve. ``` cd /lustre/scratch/daray/ixodes/iRic1/repeatclassifier cat ../te-aid/*_rep.fa >iRic1_extended_rep.fa sbatch repeatclassifier.sh cd /lustre/scratch/daray/ixodes/iRic2/repeatclassifier cat ../te-aid/*_rep.fa >iRic2_extended_rep.fa sbatch repeatclassifier.sh ``` ``` cd /lustre/scratch/daray/ixodes/iRic1/ mkdir prioritize cd prioritize sbatch prioritize.sh ##### Below was run on local desktop $BIN/generate_priority_list_from_RM2.sh ../repeatclassifier/iRic1_extended_rep.fa.classified ../../../assemblies/iRic1.fa ~/Pfam_db ~/TE_ManAnnot ``` Spent several days createing TEcurate.sh, a script to process and categorize TEs. Used it on iRic1 and iRic2. Downloaded results for LTR, DNA, and LINE and easily categorized them using the pdfs. The NOHIT files are next. Going to concatenate all of the rep.fa files and run through a less stringent cd-hit analysis to see if I can categorize them a bit more. ``` cd /mnt/d/Dropbox/ixodes/ricinus/iRic1/fordownload/NOHIT mkdir cdhit cat *_rep.fa >all_rep.fa mv all_rep.fa cdhit cd cdhit actconda conda activate curate cat all_rep.fa newlib4cdhit_trimmed.fa >lowstringency_cdhit_input.fa cd-hit-est -i lowstringency_cdhit_input.fa -o lowstringency.cdhit80_sS09 -c 0.80 -d 70 -aS 0.9 -n 5 -T 12 -M 2200 -l 100 ``` A problem with this approach. All of the crappy consensus sequences larger than 10kb occlude any meaningful assessment of potential subfamily TEs. Potential solution. Get rid of anything from either iSca or iRic that's larger than 10kb before the cdhit run. Used BioEdit to get rid of 10kb+ entries. Saved as lowstringency_cdhit_input_under10kb ``` cd-hit-est -i lowstringency_cdhit_input_under10kb.fa -o lowstringency_under10kb.cdhit80_sS09 -c 0.80 -d 70 -aS 0.9 -n 5 -T 12 -M 2200 -l 100 ``` That yeilded a workable file. I went through it manually, finding all the spots where an iRic was linked with a known iSca. Most of them were pairs. A few were trios or quartets. The bash commands below allowed me to work through the pairs and classify them into a new folder within NOHIT, classified. Renamed and classified the trios and quartets manually. There were only a few. ``` echo "Creating directories" mkdir -p working/clusterfiles #cd clusterfiles cd working echo "Creating individual files for each cluster" awk '/>Cluster/ {x="clusterfiles/F"++i".txt";}{print >x;}' lowstringency_under10kb.cdhit80_sS09_trimmed.txt cd clusterfiles mkdir two mkdir other for FILE in *.txt; do NUMBER=$(basename $FILE .txt); N=${NUMBER:1}; grep -v ">Cluster" $FILE >G${N}.txt; rm $FILE; done for FILE in G*.txt; do NUMBER=$(basename $FILE .txt); N=${NUMBER:1}; cut -d">" -f 2 $FILE | cut -d" " -f 1 | sed "s/.\{0,3\}$//; /^$/d" > H${N}.txt; rm $FILE; done for FILE in H*.txt; do HOWMANY=$(wc -l $FILE | cut -d" " -f1) if (( HOWMANY == 2 )) then mv $FILE two else mv $FILE other fi done cd two mkdir ../../../../classified for FILE in H*.txt; do NUMBER=$(basename $FILE .txt); N=${NUMBER:1}; TOBE=$(grep iSca $FILE | cut -d"#" -f2); OLD=$(grep iRic $FILE); NEWCAT=${OLD}#${TOBE}; CONSNAMEMOD=${NEWCAT/-rnd-/.}; CONSNAMEMOD=${CONSNAMEMOD/_family-/.}; cp ../../../../${OLD}-_rep_mod.fa ../../../../${OLD}-_rep_mod_orig.fa; sed "s|$OLD|$CONSNAMEMOD|g" ../../../../${OLD}-_rep.fa > ../../../../${OLD}-_rep_mod.fa; mv ../../../../${OLD}-* ../../../../classified; done ``` That still leaves several hundred to process by hand. Manually processed those to identify obvious TSDs. There are plenty of LINE-like elements in there, based on my observation of several repetitive tails and a bunch of associated microsatellite repeats. I'd noted that a LOT of these didn't seem to get picked up by earlier screens for TE ORFs. I need to run this batch through again and see if it works better this time. Recollect all of the rep.fa files that remain and prep the rerun. ``` cd /mnt/d/Dropbox/ixodes/ricinus/iRic1/fordownload/NOHIT cat *-_rep.fa >all_rep_4_rerun.fa grep ">" all_rep_4_rerun.fa | sed "s/>//g" >> rerun_list.txt #Moved to HPCC for remainder of work cd /lustre/scratch/daray/ixodes/iRic1_rerun/repeatclassifier actconda python ~/gitrepositories/bioinfo_tools/pull_seqs.py -l rerun_list.txt -f iRic1_extended_rep.fa.classified -o iRic1_rerun.fa.classified ``` Altered TEcurate to the new directory (/lustre/scratch/daray/ixodes/iRic1_rerun) and to use the new classified file, just created. Uploaded TEcurate_rerun.sh to /lustre/scratch/daray/ixodes/iRic1_rerun and ran. First run got exactly zero ORF hits. I reduced the minorf value in the script. Maybe it was too high. New minorf = 500. We'll see if that helps. The results are very good. I'm constantly improving the output. Once the results from the 'fordownload' folder are obtained, some post-processing is needed. The following need to be identified by hand. LINEs with upstream satellites. Tandem Penelope elements If a list of TE IDs is generated, list.txt, they can be post-processed as follows: ``` for FILE in *_rep_mod.fa; do CONSNAME=$(basename $FILE -_rep_mod.fa) echo $CONSNAME CONSNAMEMOD=${CONSNAME/-rnd-/.} CONSNAMEMOD=${CONSNAMEMOD/_family-/.} HEADER=${CONSNAMEMOD}#LINE/LINE-like echo $HEADER sed -i "s|${CONSNAME}|${HEADER}|g" $FILE done for FILE in *.fa; do seqkit seq -r -p -t DNA $FILE >${FILE}.tmp mv ${FILE}.tmp $FILE rm ${FILE}.tmp done #Find and copy the appropriate files used to build the dot plots. cat list.txt | while read i; do find . -type f -name "${i}_rep.fa.self-blast.pairs.txt" -exec cp {} . \; done #Extract and separate upstream satellites from ixodes LINEs cat list.txt | while read i; do #echo ${i}_rep.fa.self-blast.pairs.txt REPEAT=$(sed '3q;d' ${i}_rep.fa.self-blast.pairs.txt | awk '{print $5}') #echo $REPEAT HEADER=$(grep ">" ${i}_rep_mod.fa | sed "s/>//g") PREHEADER=$(echo $HEADER | cut -d"#" -f1) SHEADER=${HEADER/'-'/'S-'} SHEADER=${PREHEADER::-1}S-/#Satellite/Satellite LHEADER=${HEADER/'-'/'L-'} CONSSIZE=$(seqkit fx2tab --length --name ${i}_rep_mod.fa | awk '{print $2}') #echo $CONSSIZE cat ${i}_rep_mod.fa | seqkit subseq -r 1:$REPEAT >S.fa.tmp cat ${i}_rep_mod.fa | seqkit subseq -r $REPEAT:$CONSSIZE >L.fa.tmp #echo $HEADER #echo $PREHEADER #echo $POSTHEADER #echo $SHEADER #echo $LHEADER sed -i "s|$HEADER|$SHEADER|g" S.fa.tmp sed -i "s|$HEADER|$LHEADER|g" L.fa.tmp cat S.fa.tmp L.fa.tmp >${i}_rep_mod-trimmed.fa sed -i "s|/#|#|g" ${i}_rep_mod-trimmed.fa rm S.fa.tmp L.fa.tmp done #Extract single, full-length Penelope from tandem repeats cat list.txt | while read i; do #echo ${i}_rep.fa.self-blast.pairs.txt FULLEND=$(sed '3q;d' ${i}_rep.fa.self-blast.pairs.txt | awk '{print $5}') FULLSTART=$(sed '3q;d' ${i}_rep.fa.self-blast.pairs.txt | awk '{print $3}') cat ${i}_rep_mod.fa | seqkit subseq -r $FULLSTART:$FULLEND >${i}_rep_mod-trimmed.fa done #Add 'LTR-like' to headers where needed. for FILE in *_rep_mod.fa; do HEADER=$(grep ">" $FILE) REPLACE=${HEADER}#LTR/LTR-like sed -i "s|$HEADER|$REPLACE|g" $FILE done ``` Pseudogene search ``` mkdir /lustre/scratch/daray/ixodes/pseudosearch cd /lustre/scratch/daray/ixodes/pseudosearch ln -s ../assemblies/iSca_ISE6_cds.fa cp ../wgs_reads/iSca_lt12kb.fa . actconda conda activate extend_env blastn -query iSca_ISE6_cds.fa -db iSca_lt12kb.fa -out iSca_v_iSca.out -outfmt 6 -evalue 1e-50 grep "LINE/unknown" iSca_v_iSca.out >iSca_possible_pseudogenes.txt ``` These will be considered potential pseudogenes and removed from the library. Do the same for iRic ``` cp /lustre/scratch/daray/ixodes/iRic1_2_curated/iRic1_2_final_curated_clean.fa . makeblastdb -in iRic1_2_final_curated_clean.fa -dbtype nucl blastn -query iSca_ISE6_cds.fa -db iRic1_2_final_curated_clean.fa -out iSca_v_iRic.out -outfmt 6 -evalue 1e-50 grep "LINE/" iSca_v_iRic.out >iRic_possible_pseudogenes.txt ``` Nothing was identified as a pseudogene in iRic library. All were classified as some known LINE. Removal from library ``` awk '{print $2}' iSca_possible_pseudogenes.txt | sort | uniq >iSca_pseudogene_list.txt python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iSca_pseudogene_list.txt -f iSca_lt12kb.fa -o iSca_lt12kb_nopseudo.fa ``` FINAL LIBRARIES = iSca_lt12kb_nopseudo.fa, iRic1_2_final_curated_clean.fa ``` cp iSca_lt12kb_nopseudo.fa iRic1_2_final_curated_clean.fa /lustre/scratch/daray/ixodes/final_library_work cd /lustre/scratch/daray/ixodes/final_library_work cp /lustre/scratch/daray/ixodes/iSca/rmasker_final_under12kb/arthropods+denovo+expanded+under12kb.fa . ``` Manually removed all of the iSca elements from arthropods+denovo+expanded+under12kb.fa for combining with nopseudo library -->arthropods+denovo+expanded.fa ``` cat arthropods+denovo+expanded.fa iSca_lt12kb_nopseudo.fa iRic1_2_final_curated_clean.fa >ixodes_library_final_050202022.fa ``` Output is in iRic1_RM_original and iRic2_RM_original. The results of this run were disappointing. At least 50% of the genome should be repetitive and I'm not getting that. ![](https://i.imgur.com/OGLYgZO.png) I think all of the stuff >12kb that I left out are causing the problem. Reincorporated those elements into the library as follows. Working in /lustre/scratch/daray/ixodes/over.12kb Renamed the consensus sequences with #Unknown/Unknown. ``` for FILE in /lustre/scratch/daray/ixodes/over.12kb/splits1/*.fa; do CONSNAME=$(basename $FILE .fa) HEADER=${CONSNAME}#Unknown/Unknown echo $HEADER sed "s|$CONSNAME|$HEADER|g" $FILE >/lustre/scratch/daray/ixodes/over.12kb/${CONSNAME}.fa done ``` Repeated for iRic2 sequences in splits2. Concatenated each set into iRic1_over_12kb.fa and iRic2_over_12kb.fa Moved to /lustre/scratch/daray/ixodes/final_library_work Concatenated each of those files with ixodes_library_final_050202022.fa to create ixodes_library_final_05232022.fa. Using that as my library for Repeatmasker. Also need find out what all of these huge consensus sequences are. Running TE-Aid on all of them to see. I'm concerned that some of these are large overextensions of real TEs and that RMasker may be missing something by leaving them out. For example, ![](https://i.imgur.com/PddR6jo.png) This is a tandem Penelope. Is this Penelope not being called by RMasker using one of the other Penelope consensus sequences? ![](https://i.imgur.com/CxLZUX5.png) This one has a bunch of LTR/Gypsy components and it looks like there are hundreds of the right half, which would be about the right size for a regular old Gypsy element. Plotting for ricinus paper Violin plots: Run catdata ``` cd /lustre/scratch/daray/ixodes/final_library_work/plots sbatch catdata_props_PLE.py ``` Select which TE types to examine by counting the major types as follows: ``` cd /lustre/scratch/daray/ixodes/final_library_work/iRic1_RM awk '{print $8}' iRic1_DNA_rm.bed | sort | uniq >iRic1_DNA_types.txt cat iRic1_DNA_types.txt | while read i; do COUNT=$(grep $i iRic1_DNA_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT; done ``` Examination suggests these major players: P 36254 PIF-Harbinger 18728 PiggyBac 2418 Sola 48508 Sola-2 19629 TIR-like 1373418 TcMar 590731 TcMariner 576530 Unknown 505711 nSola2 28770 piggyBac 369028 Notice the potential for double-counted elements. Need to fix this. Use only TCMar line Combine Sola Used this with modifications for all three: ``` sed 's|hAT-Ac|hAT|g' iRic1_DNA_rm.bed >iRic1_DNA_consolidated_rm.bed sed -i 's|hAT-Blackjack|hAT|g' iRic1_DNA_consolidated_rm.bed sed -i 's|hAT-Charlie|hAT|g' iRic1_DNA_consolidated_rm.bed sed -i 's|hAT-Pegasus|hAT|g' iRic1_DNA_consolidated_rm.bed sed -i 's|hAT-Tag1|hAT|g' iRic1_DNA_consolidated_rm.bed sed -i 's|hAT-Tip100|hAT|g' iRic1_DNA_consolidated_rm.bed sed -i 's|hAT-Tip100?|hAT|g' iRic1_DNA_consolidated_rm.bed sed -i 's|hAT-hAT19|hAT|g' iRic1_DNA_consolidated_rm.bed sed -i 's|hAT-hAT5|hAT|g' iRic1_DNA_consolidated_rm.bed sed -i 's|hAT-hATm|hAT|g' iRic1_DNA_consolidated_rm.bed sed -i 's|hAT-hobo|hAT|g' iRic1_DNA_consolidated_rm.bed sed -i 's|nSola2|Sola|g' iRic1_DNA_consolidated_rm.bed sed -i 's|PiggyBac|piggyBac|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Sola-2|Sola|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TIR-like|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMar-Fot1|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMar-Fot1?|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMar-ISRm11|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMar-Mariner|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMar-Pogo|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMar-Tc1|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMar-Tc1?|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMar-Tc2|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMar-m44|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMar?|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-AAG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-AGC|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-CAC|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-CAG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-CAT|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-CCA|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-CCC|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-CCG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-CGA|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-CGG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-CTC|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-CTG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GAC|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GAG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GCA|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GCC|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GCG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GGA|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GGC|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GGG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GGT|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GTA|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GTC|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-GTG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-TCA|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-TGC|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-TGG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-TGT|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-TTC|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner-TIR-TTG|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|TcMariner|TcMar|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-AAA|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-AAT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-ACC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-ATG|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-CAA|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-CAC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-CAG|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-CAT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-CCA|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-CCC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-CCT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-CTA|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-CTC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-GAG|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-GCC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-GGC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-GGG|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-GTA|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-GTG|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-TAT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-TATT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-TGC|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed sed -i 's|Unknown-TIR-TGT|Unknown-TIR-like|g' iRic1_DNA_consolidated_rm.bed ``` Do the same for LINE, LTR, Unknown, RC, PLE ``` TAXONLIST="iRic1 iRic2 iSca" for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' ${NAME}_LINE_rm.bed | sort | uniq >${NAME}_LINE_types.txt cat ${NAME}_LINE_types.txt | while read i; do COUNT=$(grep $i ${NAME}_LINE_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_LINE_counts.txt; done cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' ${NAME}_LTR_rm.bed | sort | uniq >${NAME}_LTR_types.txt cat ${NAME}_LTR_types.txt | while read i; do COUNT=$(grep $i ${NAME}_LTR_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_LTR_counts.txt; done cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' ${NAME}_RC_rm.bed | sort | uniq >${NAME}_RC_types.txt cat ${NAME}_RC_types.txt | while read i; do COUNT=$(grep $i ${NAME}_RC_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_RC_counts.txt; done cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' iRic1_Unknown_rm.bed | sort | uniq >${NAME}_Unknown_types.txt cat ${NAME}_Unknown_types.txt | while read i; do COUNT=$(grep $i ${NAME}_Unknown_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_Unknown_counts.txt; done cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' ${NAME}_PLE_rm.bed | sort | uniq >${NAME}_PLE_types.txt cat ${NAME}_PLE_types.txt | while read i; do COUNT=$(grep $i ${NAME}_PLE_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_PLE_counts.txt; done done ``` ``` cd /lustre/scratch/daray/ixodes/final_library_work LIST="iRic1 iRic2 iSca" for TAXON in $LIST; do cd ${TAXON}_RM sed 's|CR1-Zenon|CR1|g' ${TAXON}_LINE_rm.bed >${TAXON}_LINE_consolidated_rm.bed sed -i 's|I-Jockey|I|g' ${TAXON}_LINE_consolidated_rm.bed sed -i 's|L1-Tx1|L1|g' ${TAXON}_LINE_consolidated_rm.bed sed -i 's|R2-NeSL|R2|g' ${TAXON}_LINE_consolidated_rm.bed sed -i 's|RTE-BovB|RTE|g' ${TAXON}_LINE_consolidated_rm.bed sed -i 's|RTE-RTE|RTE|g' ${TAXON}_LINE_consolidated_rm.bed sed -i 's|RTE-X|RTE|g' ${TAXON}_LINE_consolidated_rm.bed sed -i 's|Tx1|L1|g' ${TAXON}_LINE_consolidated_rm.bed sed 's|Copia_internal|Copia|g' ${TAXON}_LTR_rm.bed >${TAXON}_LTR_consolidated_rm.bed sed -i 's|Copia_ltr|Copia|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|Gypsy|Gypsy|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|Gypsy_internal|Gypsy|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|Gypsy_int|Gypsy|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|Gypsy_internal|Gypsy|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|Gypsy_ltr|Gypsy|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|LTR-like|LTR|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|LTR-like_internal|LTR|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|LTR-like_ltr|LTR|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|LTR_internal|LTR|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|LTR_ltr|LTR|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|LTRlike|LTR|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|LTRlike_internal|LTR|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|LTRlike_int|LTR|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|LTRlike_ltr|LTR|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|Pao_internal|Pao|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|Pao_int|Pao|g' ${TAXON}_LTR_consolidated_rm.bed sed -i 's|Pao_ltr|Pao|g' ${TAXON}_LTR_consolidated_rm.bed sed 's|unknown|Unknown|g' ${TAXON}_Unknown_rm.bed >${TAXON}_Unknown_consolidated_rm.bed cd /lustre/scratch/daray/ixodes/final_library_work done cd /lustre/scratch/daray/ixodes/final_library_work LIST="iRic1 iRic2 iSca" for TAXON in $LIST; do cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Unknown_rm.bed ${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed done ``` For downstream plots, copying these 'consolidated' files to the plotting directory. Will run catdata. Concatenate them into a new all_rm_bed file. ``` cd /lustre/scratch/daray/ixodes/final_library_work/plots cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed cp iSca_all_consolidated_rm.bed iSca_all_rm.bed sbatch cat_data.sh cp iRic1_all_rm.bed iRic1_rm.bed cp iRic2_all_rm.bed iRic2_rm.bed cp iSca_all_rm.bed iSca_rm.bed python filter_beds.py -g sizefile_mrates.txt -d 50 ``` ``` LIST="hAT Maverick MULE-MuDR P PIF-Harbinger piggyBac Sola TcMar" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="Helentron Helitron" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="tRNA-Deu" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="Unknown" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="Copia Gypsy LTR Pao" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="L1 L2 CR1 I Jockey R1 RTE" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="Penelope" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done ``` Compile into category dataframes for generating the violin plots ``` LIST="hAT Maverick MULE-MuDR P PIF-Harbinger piggyBac Sola TcMar" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt done for TE in $LIST; do cat both_${TE}_violinframe.txt >>both_allDNA_violinframe.txt done LIST="Helentron Helitron" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt done for TE in $LIST; do cat both_${TE}_violinframe.txt >>both_allRC_violinframe.txt done LIST="tRNA-Deu" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt done for TE in $LIST; do cat both_${TE}_violinframe.txt >>both_allSINE_violinframe.txt done LIST="Unknown" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt done for TE in $LIST; do cat both_${TE}_violinframe.txt >>both_allUnknown_violinframe.txt done LIST="Copia Gypsy LTR Pao" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt done for TE in $LIST; do cat both_${TE}_violinframe.txt >>both_allLTR_violinframe.txt done LIST="L1 L2 CR1 I Jockey R1 RTE" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt done for TE in $LIST; do cat both_${TE}_violinframe.txt >>both_allLINE_violinframe.txt done LIST="Penelope" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_violinframe.txt >>both_${TE}_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_violinframe.txt done for TE in $LIST; do cat both_${TE}_violinframe.txt >>both_allPLE_violinframe.txt done ``` ``` grep -v "Taxon,Div" both_allDNA_violinframe.txt >both_all_DNA_violinframe.txt grep -v "Taxon,Div" both_allRC_violinframe.txt >both_all_RC_violinframe.txt grep -v "Taxon,Div" both_allPLE_violinframe.txt >both_all_PLE_violinframe.txt grep -v "Taxon,Div" both_allEINE_violinframe.txt >both_all_SINE_violinframe.txt grep -v "Taxon,Div" both_allSINE_violinframe.txt >both_all_SINE_violinframe.txt grep -v "Taxon,Div" both_allLINE_violinframe.txt >both_all_LINE_violinframe.txt grep -v "Taxon,Div" both_allLTR_violinframe.txt >both_all_LTR_violinframe.txt grep -v "Taxon,Div" both_allUnknown_violinframe.txt >both_all_Unknown_violinframe.txt echo "Taxon,Div" >header.txt cat header.txt both_all_Unknown_violinframe.txt >both_allUnknown_violinframe.txt cat header.txt both_all_LTR_violinframe.txt >both_allLTR_violinframe.txt cat header.txt both_all_LINE_violinframe.txt >both_allLINE_violinframe.txt cat header.txt both_all_SINE_violinframe.txt >both_allSINE_violinframe.txt cat header.txt both_all_PLE_violinframe.txt >both_allPLE_violinframe.txt cat header.txt both_all_DNA_violinframe.txt >both_allDNA_violinframe.txt cat header.txt both_all_RC_violinframe.txt >both_allRC_violinframe.txt import pandas as pd import os import glob import seaborn as sns import matplotlib.pyplot as plt plt.switch_backend('Agg') from pylab import savefig LIST = ("DNA", "LINE", "LTR", "PLE", "RC", "SINE", "Unknown") for TETYPE in LIST: VIOLINFRAME = pd.read_csv('both_all' + TETYPE + '_violinframe.txt') DIMS = (3, 10) FIG, ax = plt.subplots(figsize=DIMS) sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0) ax.set_title(TETYPE + 's') ax.yaxis.grid(True) ax.yaxis.tick_right() plt.xticks(rotation=0) FIG.savefig(TETYPE + 's_violin' + '.png') ``` ``` cat both_all_DNA_violinframe.txt both_all_LINE_violinframe.txt both_all_LTR_violinframe.txt both_all_PLE_violinframe.txt both_all_RC_violinframe.txt both_all_SINE_violinframe.txt both_all_Unknown_violinframe.txt >both_all_TE_violinframe.txt cat header.txt both_all_TE_violineframe.txt >both_allTE_violinframe.txt import pandas as pd import os import glob import seaborn as sns import matplotlib.pyplot as plt plt.switch_backend('Agg') from pylab import savefig VIOLINFRAME = pd.read_csv('both_allTE_violinframe.txt') DIMS = (15, 10) FIG, ax = plt.subplots(figsize=DIMS) sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0) ax.set_title('AllTEs') ax.yaxis.grid(True) ax.yaxis.tick_right() plt.xticks(rotation=90) FIG.savefig('AllTEs_violin' + '.png') ``` ``` grep iSca both_allTE_violinframe.txt >iSca_all_TE_violinframe.txt cat header.txt iSca_all_TE_violinframe.txt >iSca_allTE_violinframe.txt grep iRic1 both_allTE_violinframe.txt >iRic1_all_TE_violinframe.txt cat header.txt iRic1_all_TE_violinframe.txt >iRic1_allTE_violinframe.txt import pandas as pd import os import glob import seaborn as sns import matplotlib.pyplot as plt plt.switch_backend('Agg') from pylab import savefig VIOLINFRAME = pd.read_csv('iSca_allTEs_violinframe.txt') DIMS = (15, 5) FIG, ax = plt.subplots(figsize=DIMS) sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0) ax.set_title('iSca_AllTEminusUnknown') ax.yaxis.grid(True) ax.yaxis.tick_right() plt.xticks(rotation=90) FIG.savefig('iSca_AllTEminusUnknown_violin' + '.png') VIOLINFRAME = pd.read_csv('iRic1_allTEs_violinframe.txt') DIMS = (15, 5) FIG, ax = plt.subplots(figsize=DIMS) sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0) ax.set_title('iRic1_AllTEminusUnknown') ax.yaxis.grid(True) ax.yaxis.tick_right() plt.xticks(rotation=90) FIG.savefig('iRic1_AllTEminusUnknown_violin' + '.png') ``` These look bad. the Unknowns completely overwhelm all other TE types. Separate the Unknowns from other TE types. ``` grep -v Unknown iSca_allTE_violinframe.txt >iSca_all_TEminusUnknown_violinframe.txt cat header.txt iSca_all_TEminusUnknown_violinframe.txt >iSca_allTEminusUnknown_violinframe.txt grep -v Unknown iRic1_allTE_violinframe.txt >iRic1_all_TEminusUnknown_violinframe.txt cat header.txt iRic1_all_TEminusUnknown_violinframe.txt >iRic1_allTEminusUnknown_violinframe.txt import pandas as pd import os import glob import seaborn as sns import matplotlib.pyplot as plt plt.switch_backend('Agg') from pylab import savefig VIOLINFRAME = pd.read_csv('iSca_allTEminusUnknown_violinframe.txt') DIMS = (15, 5) FIG, ax = plt.subplots(figsize=DIMS) sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0) ax.set_title('iSca_AllTEminusUnknown') ax.yaxis.grid(True) ax.yaxis.tick_right() plt.xticks(rotation=90) FIG.savefig('iSca_AllTEminusUnknown_violin' + '.png') VIOLINFRAME = pd.read_csv('iRic1_allTEminusUnknown_violinframe.txt') DIMS = (15, 5) FIG, ax = plt.subplots(figsize=DIMS) sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0) ax.set_title('iRic1_AllTEminusUnknown') ax.yaxis.grid(True) ax.yaxis.tick_right() plt.xticks(rotation=90) FIG.savefig('iRic1_AllTEminusUnknown_violin' + '.png') ``` Some elements are showing up in one species but not the other. hAT hAT Maverick Maverick MULE MULE P P Harbinger Harbinger piggyBac piggyBac Sola Sola TcMar TcMar L1 L1 L2 L2 CR1 CR2 I I Jockey R1 R1 RTE RTE Copia Gypsy Gypsy LTR LTR Pao Pao Penelope Penelope Helentron Helentron Helitron Deu Deu This may be because of the 100 bp insertion cutoff. Changed the scaledviolinplot.py script to a 75 bp cutoff. Rerun it and will need to reconcatenate all of the violinframe.txt files. Then replot. ``` cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="hAT Maverick MULE-MuDR P PIF-Harbinger piggyBac Sola TcMar" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="Helentron Helitron" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="tRNA-Deu" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="Unknown" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="Copia Gypsy LTR Pao" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="L1 L2 CR1 I Jockey R1 RTE" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="Penelope" for NAME in $LIST; do python scaledviolinplot_div.py -c family -g twotaxa_sizefile_mrates.txt -t $NAME -w 3 -y 10; done LIST="hAT Maverick MULE-MuDR P PIF-Harbinger piggyBac Sola TcMar" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt done for TE in $LIST; do cat both_${TE}_min75_violinframe.txt >>both_allDNA_min75_violinframe.txt done LIST="Helentron Helitron" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt done for TE in $LIST; do cat both_${TE}_min75_violinframe.txt >>both_allRC_min75_violinframe.txt done LIST="tRNA-Deu" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt done for TE in $LIST; do cat both_${TE}_min75_violinframe.txt >>both_allSINE_min75_violinframe.txt done LIST="Unknown" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt done for TE in $LIST; do cat both_${TE}_min75_violinframe.txt >>both_allUnknown_min75_violinframe.txt done LIST="Copia Gypsy LTR Pao" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt done for TE in $LIST; do cat both_${TE}_min75_violinframe.txt >>both_allLTR_min75_violinframe.txt done LIST="L1 L2 CR1 I Jockey R1 RTE" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_vmin75_iolinframe.txt done for TE in $LIST; do cat both_${TE}_min75_violinframe.txt >>both_allLINE_min75_violinframe.txt done LIST="Penelope" for TE in $LIST; do sed "s/iRic1/iRic1_${TE}/g" iSca_scaled_${TE}_min75_violinframe.txt >>both_${TE}_min75_violinframe.txt sed -i "s/iSca/iSca_${TE}/g" both_${TE}_min75_violinframe.txt done for TE in $LIST; do cat both_${TE}_min75_violinframe.txt >>both_allPLE_min75_violinframe.txt done grep -v "Taxon,Div" both_allDNA_min75_violinframe.txt >both_all_DNA_min75_violinframe.txt grep -v "Taxon,Div" both_allRC_min75_violinframe.txt >both_all_RC_min75_violinframe.txt grep -v "Taxon,Div" both_allPLE_min75_violinframe.txt >both_all_PLE_min75_violinframe.txt grep -v "Taxon,Div" both_allSINE_min75_violinframe.txt >both_all_SINE_min75_violinframe.txt grep -v "Taxon,Div" both_allLINE_min75_violinframe.txt >both_all_LINE_min75_violinframe.txt grep -v "Taxon,Div" both_allLTR_min75_violinframe.txt >both_all_LTR_min75_violinframe.txt grep -v "Taxon,Div" both_allUnknown_min75_violinframe.txt >both_all_Unknown_min75_violinframe.txt echo "Taxon,Div" >header.txt cat header.txt both_all_Unknown_min75_violinframe.txt >both_allUnknown_min75_violinframe.txt cat header.txt both_all_LTR_min75_violinframe.txt >both_allLTR_min75_violinframe.txt cat header.txt both_all_LINE_min75_violinframe.txt >both_allLINE_min75_violinframe.txt cat header.txt both_all_SINE_min75_violinframe.txt >both_allSINE_min75_violinframe.txt cat header.txt both_all_PLE_min75_violinframe.txt >both_allPLE_min75_violinframe.txt cat header.txt both_all_DNA_min75_violinframe.txt >both_allDNA_min75_violinframe.txt cat header.txt both_all_RC_min75_violinframe.txt >both_allRC_min75_violinframe.txt cat both_all_DNA_min75_violinframe.txt both_all_LINE_min75_violinframe.txt both_all_LTR_min75_violinframe.txt both_all_PLE_min75_violinframe.txt both_all_RC_min75_violinframe.txt both_all_SINE_min75_violinframe.txt both_all_Unknown_min75_violinframe.txt >both_all_TE_min75_violinframe.txt cat header.txt both_all_TE_min75_violinframe.txt >both_allTE_min75_violinframe.txt grep iSca both_allTE_min75_violinframe.txt >iSca_all_TE_min75_violinframe.txt cat header.txt iSca_all_TE_min75_violinframe.txt >iSca_allTE_min75_violinframe.txt grep iRic1 both_allTE_min75_violinframe.txt >iRic1_all_TE_min75_violinframe.txt cat header.txt iRic1_all_TE_min75_violinframe.txt >iRic1_allTE_min75_violinframe.txt grep -v Unknown iSca_allTE_min75_violinframe.txt >iSca_all_TEminusUnknown_min75_violinframe.txt cat header.txt iSca_all_TEminusUnknown_min75_violinframe.txt >iSca_allTEminusUnknown_min75_violinframe.txt grep -v Unknown iRic1_allTE_min75_violinframe.txt >iRic1_all_TEminusUnknown_min75_violinframe.txt cat header.txt iRic1_all_TEminusUnknown_min75_violinframe.txt >iRic1_allTEminusUnknown_min75_violinframe.txt grep -v Unknown iSca_allTE_min75_violinframe.txt >iSca_all_TEminusUnknown_min75_violinframe.txt cat header.txt iSca_all_TEminusUnknown_min75_violinframe.txt >iSca_allTEminusUnknown_min75_violinframe.txt grep -v Unknown iRic1_allTE_min75_violinframe.txt >iRic1_all_TEminusUnknown_min75_violinframe.txt cat header.txt iRic1_all_TEminusUnknown_min75_violinframe.txt >iRic1_allTEminusUnknown_min75_violinframe.txt import pandas as pd import os import glob import seaborn as sns import matplotlib.pyplot as plt plt.switch_backend('Agg') from pylab import savefig VIOLINFRAME = pd.read_csv('iSca_allTEminusUnknown_min75_violinframe.txt') DIMS = (15, 5) FIG, ax = plt.subplots(figsize=DIMS) sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0) ax.set_title('iSca_AllTEminusUnknown_min75') ax.yaxis.grid(True) ax.yaxis.tick_right() plt.xticks(rotation=90) FIG.savefig('iSca_AllTEminusUnknown_violin' + '_min75.png') VIOLINFRAME = pd.read_csv('iRic1_allTEminusUnknown_min75_violinframe.txt') DIMS = (15, 5) FIG, ax = plt.subplots(figsize=DIMS) sns.violinplot('Taxon', 'Div', data=VIOLINFRAME, scale="count", ax=ax, cut = 0) ax.set_title('iRic1_AllTEminusUnknown_min75') ax.yaxis.grid(True) ax.yaxis.tick_right() plt.xticks(rotation=90) FIG.savefig('iRic1_AllTEminusUnknown_violin' + '_min75.png') ``` Generate hit counts: ``` LIST="iRic1 iRic2 iSca" for NAME in $LIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $4}' ${NAME}_Unknown_consolidated_rm.bed | sort | uniq >${NAME}_unique_Unknown.txt cat ${NAME}_unique_Unknown.txt | while read i; do COUNT=$(grep $i ${NAME}_Unknown_consolidated_rm.bed | wc -l); echo $i $COUNT >> ${NAME}_Unknown_unique_hitcounts.txt; done sort -k2 -n -r ${NAME}_Unknown_unique_hitcounts.txt >${NAME}_Unknown_unique_hitcounts_sorted.txt done ``` Reworking the line plots to get rid of extra stuff. Altered curated curated_landscapes_bluePLE_mod.py ``` LIST="iRic1 iRic2 iSca" for NAME in $LIST; do python curated_landscapes_bluePLE_mod.py -d 50 -g sizefile_mrates.txt done ``` ### January 2023: Noted that there were also satellite/LINEs in the iSca data. Needed to go back and separate those. Did so and re-ran repeatmasker. Also created a simplified library that avoided all of the corrections that needed to be made before - ixodes_library_final_01242023_simple.fa. Re-ran RepeatMasker. Now, what was the downstream processing? Create 'consolidated' files: ``` sed 's|unknown|Unknown|g' ${TAXON}_Unknown_rm.bed >${TAXON}_Unknown_consolidated_rm.bed cd /lustre/scratch/daray/ixodes/final_library_work done cd /lustre/scratch/daray/ixodes/final_library_work LIST="iRic1 iRic2 iSca" for TAXON in $LIST; do sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Unknown_rm.bed ${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed done ``` For downstream plots, copying these 'consolidated' files to the plotting directory. Will run catdata. Concatenate them into a new all_rm_bed file. ``` cd /lustre/scratch/daray/ixodes/final_library_work/plots cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed cp iSca_all_consolidated_rm.bed iSca_all_rm.bed sbatch cat_data.sh cp iRic1_all_rm.bed iRic1_rm.bed cp iRic2_all_rm.bed iRic2_rm.bed cp iSca_all_rm.bed iSca_rm.bed python filter_beds.py -g sizefile_mrates.txt -d 50 ``` Use altered curated curated_landscapes_bluePLE_mod.py ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iSca" for NAME in $LIST do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes.txt done ``` How much of the satellite portion of the genome is derived from LINE/satellites? ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iSca" for NAME in $LIST do grep "S""$(printf '\t')" ../${NAME}_RM/${NAME}_Satellite_rm.bed > ../${NAME}_RM/${NAME}_LINE-Satellite_rm.bed awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_LINE-Satellite_rm.bed; echo $NAME done ``` 18833628 iRic1 27812943 iRic2 28653645 iSca Divide each by total genome size 18833628/2961411907 = 0.0063596786233898 iRic1 27812943/3667593454 = 0.0075834313014345 iRic2 28653645/2226883318 = 0.0128671514885361 iSca How much from satellites more generally? ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iSca" for NAME in $LIST do awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_Satellite_rm.bed; echo $NAME done ``` 40142831/2961411907 = 0.0135553014104903 iRic1 52725396/3667593454 = 0.0143760197691748 iRic2 42767966/2226883318 = 0.0192053017121753 iSca ==================================== How much of each genome is made up by simple repeats? ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iSca" for NAME in $LIST do awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_Simple_repeat_rm.bed; echo $NAME done ``` 23064767/2961411907 = 0.0077884359637648 iRic1 30046243/3667593454 = 0.0081923592068883 iRic2 32837827/2226883318 = 0.0147460923231003 iSca ======================== Do all categories. ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iSca" for NAME in $LIST do awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_Simple_repeat_rm.bed; echo $NAME done ``` The unknowns are the largest category. Let's see if we can figure out what one of the largest of the groups might be. ``` cd /lustre/scratch/daray/ixodes/final_library_work LIST="iRic1 iRic2 iSca" for NAME in $LIST do echo "Name Hits Total_bp" > ${NAME}_RM/${NAME}_Unknowns_table.txt awk '{print $4}' ${NAME}_RM/${NAME}_Unknown_consolidated_rm.bed | sort | uniq >${NAME}_RM/${NAME}_Unknown_types.txt cat ${NAME}_RM/${NAME}_Unknown_types.txt | while read i; do COUNT=$(grep $i ${NAME}_RM/${NAME}_Unknown_consolidated_rm.bed | wc -l | awk '{print $1}') MASS=$(awk -v TEID="$i" '$4 == TEID {sum += $5} END {print sum} ' ${NAME}_RM/${NAME}_Unknown_consolidated_rm.bed) echo $i $COUNT $MASS >> ${NAME}_RM/${NAME}_Unknowns_table.txt done (tail -n +1 ${NAME}_RM/${NAME}_Unknowns_table.txt | sort -k2 -n -r) >${NAME}_RM/${NAME}_Unknowns_table_hits.txt (tail -n +1 ${NAME}_RM/${NAME}_Unknowns_table.txt | sort -k3 -n -r) >${NAME}_RM/${NAME}_Unknowns_table_mass.txt done cd /lustre/scratch/daray/ixodes/final_library_work LIST="iRic1 iRic2 iSca" for NAME in $LIST do head -n 2 ${NAME}_RM/${NAME}_Unknowns_table.txt sort -k2 -n -r > ${NAME}_RM/${NAME}_Unknowns_table_hits.txt sort -k3 -n -r ${NAME}_RM/${NAME}_Unknowns_table.txt > ${NAME}_RM/${NAME}_Unknowns_table_mass.txt done ``` Started to rehash these using the old TEAid data. Sorted into categories lt5_FL_copies - Just remove from library as nothing partial_TE - some missed penelopes and lines. separate into components and return to library Satellite - relabel as satellites and return to library to_split_satellite - split out satellite component and decide what to do with the rest Unknown-SD - >5 copies, possible segmental duplications zero_pdf - toss Changes made, in order: Replaced **Satellite **--> ixodes_library_final_01282023_simple.fa removed **lt5_FL_copies** --> ixodes_library_final_01282023_simple_unknowns_deleted.fa renamed **Unknown-SD** to #Unknown/ultra --> ixodes_library_final_01282023_simple_ultra.fa Incorporated **to_split_satellite** --> ixodes_library_final_01282023_simple_tosplit.fa Incorporated the LINEs (penelopes) from **partial_TE** --> ixodes_library_final_01282023_simple_LINEs.fa Incorporated the unknowns from **partial_TE** --> ixodes_library_final_01282023_simple_partial_TEs_unknown.fa Need to split the LTRs from partial_TE before adding to the library Did so --> ixodes_library_final_01292023_simple.fa Create 'consolidated' files: ``` sed 's|unknown|Unknown|g' ${TAXON}_Unknown_rm.bed >${TAXON}_Unknown_consolidated_rm.bed cd /lustre/scratch/daray/ixodes/final_library_work done cd /lustre/scratch/daray/ixodes/final_library_work LIST="iRic1 iRic2 iSca" for TAXON in $LIST; do sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Unknown_rm.bed ${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed done ``` For downstream plots, copying these 'consolidated' files to the plotting directory. Will run catdata. Concatenate them into a new all_rm_bed file. ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed cp iSca_all_consolidated_rm.bed iSca_all_rm.bed sbatch cat_data.sh cp iRic1_all_rm.bed iRic1_rm.bed cp iRic2_all_rm.bed iRic2_rm.bed cp iSca_all_rm.bed iSca_rm.bed python filter_beds.py -g sizefile_mrates.txt -d 50 ``` Use altered curated curated_landscapes_bluePLE_mod.py ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iSca" for NAME in $LIST do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes.txt done ``` How much of the satellite portion of the genome is derived from LINE/satellites? ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iSca" for NAME in $LIST do grep "S""$(printf '\t')" ../${NAME}_RM/${NAME}_Satellite_rm.bed > ../${NAME}_RM/${NAME}_LINE-Satellite_rm.bed awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_LINE-Satellite_rm.bed; echo $NAME done ``` 20346537 iRic1 28557933 iRic2 30223691 iSca Divide each by total genome size 20346537/2961411907 = 0.0068705528440357 iRic1 28557933/3667593454 = 0.007786559049737 iRic2 30223691/2226883318 = 0.0135721933680586 iSca How much from satellites more generally? ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iSca" for NAME in $LIST do awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_Satellite_rm.bed; echo $NAME done ``` 89228558 iRic1 120300373 iRic2 97170331 iSca 89228558/2961411907 = 0.0301304110343742 iRic1 120300373/3667593454 = 0.0328009018744421 iRic2 97170331/2226883318 = 0.0436351245772815 iSca Do all categories. ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iSca" TELIST="DNA LINE LTR PLE RC SINE Satellite Simple_repeat Unknown" for NAME in $LIST do for TENAME in $TELIST do BP=$(awk -F' ' '{sum+=$5;} END{print sum;}' ../${NAME}_RM/${NAME}_${TENAME}_rm.bed) echo $NAME $TENAME $BP >>${NAME}_masses.txt done done ``` Receieved new and improved I ricinus genome assembly. Much higher quality. Only 7514 contigs vs 20k+ in the previous versions Ixodes_ricinus_v1.0.fasta.gz Calling it iRic3 for now, for ease of use. Uploaded to /lustre/scratch/daray/ixodes/assemblies Start analysis from the beginning to ensure nothing was missed due to the low quality of the previous assemblies: ``` cd /lustre/scratch/daray/ixodes LIST="iRic3" for NAME in $LIST; do mkdir $NAME; done cd $NAME cp /lustre/scratch/daray/curation_templates/* . for NAME in $LIST; do sed "s/<NAME>/$NAME/g" rmodel_template.sh >${NAME}_rmodel.sh; done for NAME in $LIST; do sed "s/<NAME>/$NAME/g" extend_align_template.sh >${NAME}_extend_align.sh; done for NAME in $LIST; do sed "s/<NAME>/$NAME/g" repeatclassifier_template.sh >${NAME}_repeatclassifier.sh; done for NAME in $LIST; do sed "s/<NAME>/$NAME/g" TEcurate_template.sh >${NAME}_TEcurate.sh; done ``` Modify all the scripts to redirect properly. ``` for NAME in $LIST; do mkdir $NAME; sbatch ${NAME}_rmodel.sh; done ``` N=4222 in the iRic3-families.mod.fa file. ** Before doing anything else, reduce complexity of rmodeler output using usearch ** Modified procedure from https://hackmd.io/n1FOvqtnTRaoKr_9aK1EPg ``` cd /lustre/scratch/daray/ixodes/iRic3 mkdir usearch_work cd usearch_work cp /lustre/scratch/daray/ixodes/iRic3/rmodeler_dir/iRic3-families.mod.fa . cp /lustre/scratch/daray/ixodes/final_library_work/ixodes_library_final_01292023_simple.fa . /lustre/work/daray/software/usearch11.0.667_i86linux32 \ -usearch_global iRic3-families.mod.fa \ -db ixodes_library_final_01292023_simple.fa \ -strand both \ -id 0.85 \ -minsl 0.95 \ -maxsl 1.05 \ -maxaccepts 1 \ -maxrejects 128 \ -userfields query+target+id+ql+tl \ -userout iRic3_hits.tsv awk '{print $1}' iRic3_hits.tsv >iRic3_remove.txt actconda python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_remove.txt -f iRic3-families.mod.fa -o iRic3_retained.fa ``` Eliminated 539. Not as many as I was hoping. Need to figure this out. Why are so many new things being found? I'm thinking that the raw RepeatModeler output is going to be significantly shorter than the extended/finalized versions. So..... Change lower limit to 50% ``` /lustre/work/daray/software/usearch11.0.667_i86linux32 -usearch_global iRic3-families.mod.fa -db ixodes_library_final_01292023_simple.fa -strand both -id 0.85 -minsl 0.50 -maxsl 1.05 -maxaccepts 1 -maxrejects 128 -userfields query+target+id+ql+tl -userout iRic3_hits.tsv ``` Higher hit rate, 24.2% vs. 10.5% for the previous run. ``` awk '{print $1}' iRic3_hits.tsv >iRic3_remove.txt actconda python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_remove.txt -f iRic3-families.mod.fa -o iRic3_retained.fa ``` That eliminated 1244. Now called, iRic3_retained.fa. Go ahead and do the extension pipeline and then re-run this procedure. ** Redirect the extend_align script to the iRic3_retained.fa file. Changed appropriate line in script to: "CONSENSUSFILE=$WORKDIR/usearch_work/iRice3_retained.fa" ``` cd /lustre/scratch/daray/ixodes/iRic3 sbatch iRic3_extend_align.sh ``` Ran into trouble with line lengths in the original assembly download. Fixed with: ``` actconda conda activate gatk4 gatk NormalizeFasta I=iRic3.fa O=iRic3_normal.fa ``` Ran extension script and got output. Concatenated the likely TEs and possible SDs: ``` cd /lustre/scratch/daray/ixodes/iRic3/extensions/images_and_alignments cat likely_TEs/*rep.fa >../likely_TEs.fa cat possible_SD/*rep.fa >../possible_SDs.fa ``` Now, to eliminate possible duplicates again. First with the likely TEs. ``` cd /lustre/scratch/daray/ixodes/iRic3/usearch_work cp ../extensions/likely_TEs.fa iRic3_likely_TEs.fa cp ../extensions/possible_SDs.fa iRic3_possible_SDs.fa /lustre/work/daray/software/usearch11.0.667_i86linux32 \ -usearch_global iRic3_likely_TEs.fa \ -db ixodes_library_final_01292023_simple.fa \ -strand both \ -id 0.85 \ -minsl 0.95 \ -maxsl 1.05 \ -maxaccepts 1 \ -maxrejects 128 \ -userfields query+target+id+ql+tl \ -userout iRic3_likely_TEs_hits.tsv awk '{print $1}' iRic3_likely_TEs_hits.tsv >iRic3_likely_TEs_remove.txt actconda python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_likely_TEs_remove.txt -f iRic3_likely_TEs.fa -o iRic3_likely_TEs_retained.fa ``` This eliminates 1159. Now with the possible SDs. Note change to -minsl 0.50. ``` cd /lustre/scratch/daray/ixodes/iRic3/usearch_work /lustre/work/daray/software/usearch11.0.667_i86linux32 \ -usearch_global iRic3_possible_SDs.fa \ -db ixodes_library_final_01292023_simple.fa \ -strand both \ -id 0.85 \ -minsl 0.50 \ -maxsl 1.05 \ -maxaccepts 1 \ -maxrejects 128 \ -userfields query+target+id+ql+tl \ -userout iRic3_possible_SDs_hits.tsv awk '{print $1}' iRic3_possible_SDs_hits.tsv >iRic3_possible_SDs_remove.txt actconda python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_possible_SDs_remove.txt -f iRic3_possible_SDs.fa -o iRic3_possible_SDs_retained.fa ``` That removed 39 of the 154 possible SDs. For the likely TEs, it might be worth going down to 80% length for the cutoff. ``` cd /lustre/scratch/daray/ixodes/iRic3/usearch_work /lustre/work/daray/software/usearch11.0.667_i86linux32 \ -usearch_global iRic3_likely_TEs.fa \ -db ixodes_library_final_01292023_simple.fa \ -strand both \ -id 0.85 \ -minsl 0.81 \ -maxsl 1.19 \ -maxaccepts 1 \ -maxrejects 128 \ -userfields query+target+id+ql+tl \ -userout iRic3_likely_TEs_hits.tsv awk '{print $1}' iRic3_likely_TEs_hits.tsv >iRic3_likely_TEs_remove.txt actconda python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_likely_TEs_remove.txt -f iRic3_likely_TEs.fa -o iRic3_likely_TEs_retained.fa ``` Eliminated a few hundred more, 1157. I'll take it. Collect these to use for the TEcurate run. ``` cat iRic3_likely_TEs_retained.fa iRic3_possible_SDs_retained.fa >iRic3_secondUsearch_retained.fa ``` use iRic3_secondUsearch_retained.fa as the start for the TEcurate run. To prepare for this, need to modify the repeatclassifier script to deal with the new file name. Modified script. ``` #!/bin/bash #SBATCH --job-name=iRic3_classify #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem-per-cpu=60G module --ignore-cache load gcc/10.1.0 r/4.0.2 . ~/conda/etc/profile.d/conda.sh conda activate repeatmodeler TAXON=iRic3 WORKDIR=/lustre/scratch/daray/ixodes/${TAXON} #mkdir -p $WORKDIR/repeatclassifier #cd $WORKDIR/extensions/final_consensuses #Removed for specialized iRic3 run. #cat ${TAXON}*.fa >$WORKDIR/repeatclassifier/${TAXON}_extended_rep.fa #Renamed script from Usearch work cd $WORKDIR/repeatclassifier cp iRic3_secondUsearch_retained.fa ${TAXON}_extended_rep.fa Repea#Run RepeatClassifier RepeatClassifier -consensi ${TAXON}_extended_rep.fa ``` ``` cd /lustre/scratch/daray/ixodes/iRic3/usearch_work mkdir ../repeatclassifier cp iRic3_secondUsearch_retained.fa ../repeatclassifier cd ../repeatclassifier cd .. sbatch iRic3_repeatclassifier.sh ``` 8722584 nocona iRic3_cl daray PD 0:00 1 (Priority) ``` sbatch --dependency=afterok:8722584 iRic3_TEcurate.sh ``` Started working through output in the fordownload folder and noticed a lot of potentially previously identified Penelope elements. Need to re-run usearch with modified options. I think that using -maxl 1.05 caused me to miss a lot of tandem Penelope elements. Yep. Matched 1/3 of the elements ``` #On local cd ixodes/ricinus/iRic3/LINE/ cat *rep.fa >iRic3_LINEs.fa #On HPCC cd /lustre/scratch/daray/ixodes/iRic3/fordownload/usearch /lustre/work/daray/software/usearch11.0.667_i86linux32 -usearch_global iRic3_LINEs.fa -db ixodes_library_final_01292023_simple.fa -strand both -id 0.90 -minsl 0.50 -maxsl 1.50 -maxaccepts 1 -maxrejects 128 -userfields query+target+id+ql+tl -userout iRic3_LINE_hits.tsv (C) Copyright 2013-18 Robert C. Edgar, all rights reserved. https://drive5.com/usearch License: personal use only 00:00 71Mb 100.0% Reading ixodes_library_final_01292023_simple.fa 00:01 38Mb 100.0% Masking (fastnucleo) 00:02 39Mb 100.0% Word stats 00:02 39Mb 100.0% Alloc rows 00:03 138Mb 100.0% Build index 00:03 171Mb CPU has 128 cores, defaulting to 10 threads WARNING: Max OMP threads 1 05:24 267Mb 100.0% Searching, 30.4% matched awk '{print $1}' iRic3_LINE_hits.tsv >iRic3_LINEs_remove.txt actconda python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_LINEs_remove.txt -f iRic3_LINEs.fa -o iRic3_LINEs_retained.fa LINEDIR=/lustre/scratch/daray/ixodes/iRic3/fordownload/LINE mkdir $LINEDIR/usearch_hits cat iRic3_LINEs_remove.txt | while read i; do mv $LINEDIR/${i}* $LINEDIR/usearch_hits done ``` Do the same with LTRs and NOHITs. Could be helpful. ``` #On HPCC cd /lustre/scratch/daray/ixodes/iRic3/fordownload/LTR cat *rep.fa >iRic3_LTRs.fa cd /lustre/scratch/daray/ixodes/iRic3/fordownload/usearch cp /lustre/scratch/daray/ixodes/iRic3/fordownload/LTR/iRic3_LTRs.fa . /lustre/work/daray/software/usearch11.0.667_i86linux32 -usearch_global iRic3_LTRs.fa -db ixodes_library_final_01292023_simple.fa -strand both -id 0.80 -minsl 0.50 -maxsl 1.50 -maxaccepts 1 -maxrejects 128 -userfields query+target+id+ql+tl -userout iRic3_LTR_hits.tsv (C) Copyright 2013-18 Robert C. Edgar, all rights reserved. https://drive5.com/usearch License: personal use only 00:01 71Mb 100.0% Reading ixodes_library_final_01292023_simple.fa 00:01 38Mb 100.0% Masking (fastnucleo) 00:02 39Mb 100.0% Word stats 00:02 39Mb 100.0% Alloc rows 00:04 138Mb 100.0% Build index 00:04 171Mb CPU has 128 cores, defaulting to 10 threads WARNING: Max OMP threads 1 06:03 279Mb 100.0% Searching, 19.6% matched WARNING: Input has lower-case masked sequences (C) Copyright 2013-18 Robert C. Edgar, all rights reserved. https://drive5.com/usearch License: personal use only 00:01 71Mb 100.0% Reading ixodes_library_final_01292023_simple.fa 00:01 38Mb 100.0% Masking (fastnucleo) 00:02 39Mb 100.0% Word stats 00:02 39Mb 100.0% Alloc rows 00:04 138Mb 100.0% Build index 00:04 171Mb CPU has 128 cores, defaulting to 10 threads WARNING: Max OMP threads 1 06:03 279Mb 100.0% Searching, 19.6% matched awk '{print $1}' iRic3_LTR_hits.tsv >iRic3_LTRs_remove.txt actconda python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_LTR_remove.txt -f iRic3_LTRs.fa -o iRic3_LTRs_retained.fa LTRDIR=/lustre/scratch/daray/ixodes/iRic3/fordownload/LTR mkdir $LTRDIR/usearch_hits cat iRic3_LTRs_remove.txt | while read i; do mv $LTRDIR/${i}* $LTRDIR/usearch_hits done ``` Only 6, but ok. NOHITs ``` #On HPCC cd /lustre/scratch/daray/ixodes/iRic3/fordownload/NOHIT cat *rep.fa >iRic3_NOHITs.fa cd /lustre/scratch/daray/ixodes/iRic3/fordownload/usearch cp /lustre/scratch/daray/ixodes/iRic3/fordownload/NOHIT/iRic3_NOHITs.fa . /lustre/work/daray/software/usearch11.0.667_i86linux32 -usearch_global iRic3_NOHITs.fa -db ixodes_library_final_01292023_simple.fa -strand both -id 0.80 -minsl 0.50 -maxsl 1.50 -maxaccepts 1 -maxrejects 128 -userfields query+target+id+ql+tl -userout iRic3_NOHIT_hits.tsv (C) Copyright 2013-18 Robert C. Edgar, all rights reserved. https://drive5.com/usearch License: personal use only 00:00 71Mb 100.0% Reading ixodes_library_final_01292023_simple.fa 00:00 38Mb 100.0% Masking (fastnucleo) 00:01 39Mb 100.0% Word stats 00:01 39Mb 100.0% Alloc rows 00:02 138Mb 100.0% Build index 00:02 171Mb CPU has 128 cores, defaulting to 10 threads WARNING: Max OMP threads 1 03:43 344Mb 100.0% Searching, 20.3% matched WARNING: Input has lower-case masked sequences awk '{print $1}' iRic3_NOHIT_hits.tsv >iRic3_NOHITs_remove.txt actconda python ~/gitrepositories/bioinfo_tools/remove_seqs2.py -l iRic3_NOHITs_remove.txt -f iRic3_NOHITs.fa -o iRic3_NOHITs_retained.fa NOHITDIR=/lustre/scratch/daray/ixodes/iRic3/fordownload/NOHIT mkdir $NOHITDIR/usearch_hits cat iRic3_NOHITs_remove.txt | while read i; do mv $NOHITDIR/${i}* $NOHITDIR/usearch_hits done ``` Got rid of 83. Have been triaging and finalizing the results on a local computer. Need to work through the LTR retrotransposons and split off the LTR and internal fragments. Use the code from Jessica Storer. Transfer the files to HPCC and work on them there. ``` cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work/LTR/good cp /lustre/scratch/daray/ixodes/iSca/round1-naming/combined_naming/LTR/ltr_split/crossMatchLTRParse.mod.py . cp /lustre/scratch/daray/ixodes/iSca/round1-naming/combined_naming/LTR/ltr_split/GrepCrossmatch . mkdir crossmatchoutput for FILE in *mod.fa do TENAME=$(basename $FILE rep_mod.fa) /lustre/work/daray/software/cross_match/cross_match $FILE >crossmatchoutput/${TENAME}.crossmatch.out done for FILE in *mod.fa do TENAME=$(basename $FILE rep_mod.fa) perl GrepCrossmatch crossmatchoutput/${TENAME}.crossmatch.out > crossmatchoutput/${TENAME}.crossmatch.linesOnly.out done actconda for FILE in *mod.fa do TENAME=$(basename $FILE rep_mod.fa) python crossMatchLTRParse.mod.py crossmatchoutput/${TENAME}.crossmatch.linesOnly.out $FILE ${TENAME}.LTR_int_split.fa done ``` 11 of them didn't work. Will need to handle manually. Did so. Moving on to NOHITs. Reclassified the presumed nonautonomous DNA transposons. ``` cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work/NOHIT/nonautonomous_DNA for FILE in *rep_mod.fa; do TENAME=$(basename $FILE _rep_mod.fa) sed "s|Unknown/Unknown|DNA/TIR-like|g" $FILE >${TENAME}_rep_mod_rename.fa done cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work/NOHIT/satellite for FILE in *rep_mod.fa; do TENAME=$(basename $FILE _rep_mod.fa) sed "s|Unknown/Unknown|Satellite/Satellite|g" $FILE >${TENAME}_rep_mod_rename.fa done ``` Also realized I need to reclassify the penelopes. ``` cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work/LINE/good-complete for FILE in *rep_mod.fa; do TENAME=$(basename $FILE _rep_mod.fa) sed -i "s|LINE/Penelope|PLE/Penelope|g" $FILE done cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work/LINE/to_trim_penelope-complete for FILE in *rep_mod_trim.fa; do TENAME=$(basename $FILE _rep_mod_trim.fa) sed -i "s|LINE/Penelope|PLE/Penelope|g" $FILE done ``` Finished everything up as follows: ``` cd /lustre/scratch/daray/ixodes/iRic3/fordownload/from_local_work cd DNA-complete/ ls cd good-complete/ cat *_rep_mod.fa >../good-DNA.fa cd ../ cat unknown-complete/*_rep_mod.fa >good-unknown.fa cd .. cd LINE-complete/ cat good-complete/*_rep_mod.fa >good-LINE.fa cat to_trim_penelope-complete/*trim.fa good-penelopetrim.fa cat to_trim_penelope-complete/*trim.fa >good-penelopetrim.fa cat to_trim_satellite-complete/*trim.fa >good-satellitetrim.fa cd ../LTR-complete/ cat good-complete/*split.fa >good-ltr.fa cat to_trim_tandem-complete/*trim.fa >good-ltrtrim.fa cd ../NOHIT-complete/ cat nonautonomous_DNA-complete/*rename.fa >good-nonautonomous.fa cat satellite-complete/*rename.fa >good-satellite.fa cat unknown-complete/*mod.fa >good-unknown.fa cd ../RC-complete/ cat *mod.fa good-RC.fa cat *mod.fa >good-RC.fa cd .. cat DNA-complete/good*.fa LINE-complete/good*.fa LTR-complete/good*.fa NOHIT-complete/good*.fa RC-complete/good*.fa >iRic3_complete.fa cd ../../../ cd final_library_work/ cat ixodes_library_final_01292023_simple.fa ../iRic3/fordownload/from_local_work/iRic3_complete-simple.fa >ixodes_library_final_03142023_simple.fa ``` New final library --> ixodes_library_final_03142023_simple.fa Run RepeatMasker on all three genome assemblies. Appropriate edits and... ``` sbatch iRic1_RM.sh sbatch iRic2_RM.sh sbatch iRic3_RM.sh sbatch iSca_RM.sh ``` ### March 2023: Now that all the RepeatMasker runs are done, get back to analyzing the output. Create 'consolidated' files: ``` cd /lustre/scratch/daray/ixodes/final_library_work LIST="iRic1 iRic2 iRic3 iSca" for TAXON in $LIST; do sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Unknown_rm.bed ${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed done ``` For downstream plots, copying these 'consolidated' files to the plotting directory. Will run catdata. Concatenate them into a new all_rm_bed file. Need to add iRic3 to sizefile.txt, sizefile_mrates_4_curated_landscapes.txt, and sizefile_mrates.txt. ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed cat ../iRic3_RM/*consolidated_rm.bed >iRic3_all_consolidated_rm.bed cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed cp iRic3_all_consolidated_rm.bed iRic3_all_rm.bed cp iSca_all_consolidated_rm.bed iSca_all_rm.bed sbatch cat_data.sh cp iRic1_all_rm.bed iRic1_rm.bed cp iRic2_all_rm.bed iRic2_rm.bed cp iRic3_all_rm.bed iRic3_rm.bed cp iSca_all_rm.bed iSca_rm.bed ### Wait for cat_data.sh to be finished. python filter_beds.py -g sizefile_mrates.txt -d 50 ``` Use altered curated_landscapes_bluePLE_mod.py ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iRic3 iSca" for NAME in $LIST do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes.txt done ``` STOP. Something is messed up with the unknowns in iRic3 ![](https://i.imgur.com/xTbCnfP.png) ![](https://i.imgur.com/OUAKbug.png) Need to do some investigating. What is forming that huge peak? ``` TAXONLIST="iRic1 iRic2 iRic3 iSca" for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $4}' ${NAME}_Unknown_rm.bed | sort | uniq >${NAME}_Unknown_IDs.txt cat ${NAME}_Unknown_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_Unknown_rm.bed | wc -l | awk '{print $1}') SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_Unknown_rm.bed) echo $i $COUNT $SUM >>${NAME}_Unknown_IDs_counts.txt done sort -k3 -n -r ${NAME}_Unknown_IDs_counts.txt >${NAME}_Unknown_IDs_sort_counts.txt done for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' ${NAME}_Unknown_rm.bed | sort | uniq >${NAME}_Unknown_types.txt cat ${NAME}_Unknown_types.txt | while read i; do COUNT=$(grep $i ${NAME}_Unknown_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_Unknown_counts.txt; done; done for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $4}' ${NAME}_LINE_rm.bed | sort | uniq >${NAME}_LINE_IDs.txt cat ${NAME}_LINE_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_LINE_rm.bed | wc -l | awk '{print $1}') SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_LINE_rm.bed) echo $i $COUNT $SUM >>${NAME}_LINE_IDs_counts.txt done sort -k3 -n -r ${NAME}_LINE_IDs_counts.txt >${NAME}_LINE_IDs_sort_counts.txt done cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' ${NAME}_LINE_rm.bed | sort | uniq >${NAME}_LINE_types.txt cat ${NAME}_LINE_types.txt | while read i; do COUNT=$(grep $i ${NAME}_LINE_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_LINE_counts.txt; done; done for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $4}' ${NAME}_LTR_rm.bed | sort | uniq >${NAME}_LTR_IDs.txt cat ${NAME}_LTR_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_LTR_rm.bed | wc -l | awk '{print $1}') SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_LTR_rm.bed) echo $i $COUNT $SUM >>${NAME}_LTR_IDs_counts.txt done sort -k3 -n -r ${NAME}_LTR_IDs_counts.txt >${NAME}_LTR_IDs_sort_counts.txt done for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' ${NAME}_LTR_rm.bed | sort | uniq >${NAME}_LTR_types.txt cat ${NAME}_LTR_types.txt | while read i; do COUNT=$(grep $i ${NAME}_LTR_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_LTR_counts.txt; done; done for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $4}' ${NAME}_RC_rm.bed | sort | uniq >${NAME}_RC_IDs.txt cat ${NAME}_RC_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_RC_rm.bed | wc -l | awk '{print $1}') SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_RC_rm.bed) echo $i $COUNT $SUM >>${NAME}_RC_IDs_counts.txt done sort -k3 -n -r ${NAME}_RC_IDs_counts.txt >${NAME}_RC_IDs_sort_counts.txt done for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' ${NAME}_RC_rm.bed | sort | uniq >${NAME}_RC_types.txt cat ${NAME}_RC_types.txt | while read i; do COUNT=$(grep $i ${NAME}_RC_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_RC_counts.txt; done; done for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $4}' ${NAME}_DNA_rm.bed | sort | uniq >${NAME}_DNA_IDs.txt cat ${NAME}_DNA_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_DNA_rm.bed | wc -l | awk '{print $1}') SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_DNA_rm.bed) echo $i $COUNT $SUM >>${NAME}_DNA_IDs_counts.txt done sort -k3 -n -r ${NAME}_DNA_IDs_counts.txt >${NAME}_DNA_IDs_sort_counts.txt done for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' iRic1_DNA_rm.bed | sort | uniq >${NAME}_DNA_types.txt cat ${NAME}_DNA_types.txt | while read i; do COUNT=$(grep $i ${NAME}_DNA_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_DNA_counts.txt; done; done for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $4}' ${NAME}_PLE_rm.bed | sort | uniq >${NAME}_PLE_IDs.txt cat ${NAME}_DNA_IDs.txt | while read i; do COUNT=$(grep $i ${NAME}_PLE_rm.bed | wc -l | awk '{print $1}') SUM=$(awk -v i="$i" '$4 == i {sum += $5} END {print sum}' ${NAME}_PLE_rm.bed) echo $i $COUNT $SUM >>${NAME}_PLE_IDs_counts.txt done sort -k3 -n -r ${NAME}_PLE_IDs_counts.txt >${NAME}_PLE_IDs_sort_counts.txt done for NAME in $TAXONLIST; do cd /lustre/scratch/daray/ixodes/final_library_work/${NAME}_RM awk '{print $8}' ${NAME}_PLE_rm.bed | sort | uniq >${NAME}_PLE_types.txt cat ${NAME}_PLE_types.txt | while read i; do COUNT=$(grep $i ${NAME}_PLE_rm.bed | wc -l | awk '{print $1}'); echo $i $COUNT >>${NAME}_PLE_counts.txt; done done ``` ### April 2023: Got a new, presumably filtered version of the better I ricinus genome assembly. Ixodes_ricinus_v1.2.fasta.gz, which I copied to iRic4.fa.gz Ran RepeatMasker using iRic4_RM.sh Get back to analyzing the output. Create 'consolidated' files: ``` cd /lustre/scratch/daray/ixodes/final_library_work LIST="iRic1 iRic2 iRic4 iSca" for TAXON in $LIST; do sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Unknown_rm.bed ${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed done ``` For downstream plots, copying these 'consolidated' files to the plotting directory. Will run catdata. Concatenate them into a new all_rm_bed file. Need to add iRic4 to sizefile.txt, sizefile_mrates_4_curated_landscapes.txt, and sizefile_mrates.txt. ``` nano sizefile_mrates_4_curated_landscapes.txt Ixodes_ricinus1 2961411907 3.0e-9 iRic1 Ixodes_ricinus2 3667593454 3.0e-9 iRic2 Ixodes_ricinus3 2288563899 3.0e-9 iRic4 Ixodes_scapularis 2226883318 3.0e-9 iSca nano sizefile_mrates.txt iRic1 2961411907 3.0e-9 iRic2 3667593454 3.0e-9 iRic4 2288563899 3.0e-9 iSca 2226883318 3.0e-9 actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed cat ../iRic4_RM/*consolidated_rm.bed >iRic4_all_consolidated_rm.bed cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed cp iRic4_all_consolidated_rm.bed iRic4_all_rm.bed cp iSca_all_consolidated_rm.bed iSca_all_rm.bed sbatch cat_data.sh cp iRic1_all_rm.bed iRic1_rm.bed cp iRic2_all_rm.bed iRic2_rm.bed cp iRic4_all_rm.bed iRic4_rm.bed cp iSca_all_rm.bed iSca_rm.bed ### Wait for cat_data.sh to be finished. python filter_beds.py -g sizefile_mrates.txt -d 50 ``` Use altered curated_landscapes_bluePLE_mod.py ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iRic4 iSca" for NAME in $LIST do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes.txt done ``` Is the huge peak still there? Yes. Investigated and the peak appears to be caused by a mislabeled satellite repeat. It was called iRic2.6.4895#Unknown/Unknown. I've since renamed it to iRic2.6.4895#Satellite/Satellite. First thought was that this repeat was introducing misassembled contigs/scaffolds. So, I queried the raw reads to see if it exists in the wild. Yes, it does. See output in /lustre/scratch/daray/ixodes/odd_repeat_search/iRic2.6.4895_v_ont.out. Now the question is, why does this repeat not show up in the other individuals. Need to run the same analysis with the raw data from those samples. Will do that. For now, re-run the graph to see if the relabeling got rid of the wierd peak. Create 'consolidated' files: ``` cd /lustre/scratch/daray/ixodes/final_library_work LIST="iRic1 iRic2 iRic4 iSca" for TAXON in $LIST; do sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Satellite_rm.bed ${TAXON}_RM/${TAXON}_Satellite_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Simple_repeat_rm.bed ${TAXON}_RM/${TAXON}_Simple_repeat_consolidated_rm.bed done actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots cat ../iRic1_RM/*consolidated_rm.bed >iRic1_all_consolidated_rm.bed cat ../iRic2_RM/*consolidated_rm.bed >iRic2_all_consolidated_rm.bed cat ../iRic4_RM/*consolidated_rm.bed >iRic4_all_consolidated_rm.bed cat ../iSca_RM/*consolidated_rm.bed >iSca_all_consolidated_rm.bed cp iRic1_all_consolidated_rm.bed iRic1_all_rm.bed cp iRic2_all_consolidated_rm.bed iRic2_all_rm.bed cp iRic4_all_consolidated_rm.bed iRic4_all_rm.bed cp iSca_all_consolidated_rm.bed iSca_all_rm.bed sbatch cat_data.sh cp iRic1_all_rm.bed iRic1_rm.bed cp iRic2_all_rm.bed iRic2_rm.bed cp iRic4_all_rm.bed iRic4_rm.bed cp iSca_all_rm.bed iSca_rm.bed ### Wait for cat_data.sh to be finished. python filter_beds.py -g sizefile_mrates.txt -d 50 ``` Use altered curated_landscapes_bluePLE_mod.py ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="iRic1 iRic2 iRic4 iSca" for NAME in $LIST do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes.txt done ``` Still looking a little weird. Rodrigo reassembled using Flye and provided new assemblies. They're not as contiguous but worth examining. Maya_Flye_raw.fasta Murphy_Flye_raw.fasta Compress and rename for RM analysis. ``` cd /lustre/scratch/daray/ixodes/assemblies gzip -c Maya_Flye_raw.fasta >maya_flye.fa.gz gzip -c Murphy_Flye_raw.fasta >murphy_flye.fa.gz ``` Got new assemblies from Katie (in Travis' lab). All are from the same assembly method and (we hope) will resolve this problem. ``` cd /lustre/scratch/daray/ixodes/final_library_work LIST="maya_flye_TCG murphy_flye_TCG kees_flye_TCG" for TAXON in $LIST; do sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Satellite_rm.bed ${TAXON}_RM/${TAXON}_Satellite_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Simple_repeat_rm.bed ${TAXON}_RM/${TAXON}_Simple_repeat_consolidated_rm.bed done actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="maya_flye_TCG murphy_flye_TCG kees_flye_TCG" for TAXON in $LIST; do cat ../${TAXON}_RM/*consolidated_rm.bed >${TAXON}_all_consolidated_rm.bed cp ${TAXON}_all_consolidated_rm.bed ${TAXON}_all_rm.bed done sbatch cat_data.sh cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="maya_flye_TCG murphy_flye_TCG kees_flye_TCG" for TAXON in $LIST; do cp ${TAXON}_all_rm.bed ${TAXON}_rm.bed done ### Wait for cat_data.sh to be finished. python filter_beds.py -g sizefile_mrates_TCG.txt -d 50 ``` Use altered curated_landscapes_bluePLE_mod.py ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="maya_flye_TCG murphy_flye_TCG kees_flye_TCG" for TAXON in $LIST; do python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes_TCG.txt done cd /lustre/scratch/daray/ixodes/final_library_work/plots rm proportions_table.txt echo "Species Genome_size TE_proportion LINE_proportion SINE_proportion LTR_proportion DNA_proportion RC_proportion Unknown_proportion Satellite_proportion Simple_repeat_proportion" >proportions_table.tsv TAXONLIST="maya_flye_TCG murphy_flye_TCG kees_flye_TCG iSca" for TAXON in $TAXONLIST; do STATS=$(tail -1 ../${TAXON}_RM/repeat_table_total.tsv) echo $STATS >>proportions_table.tsv done ``` # Meeting 5/25/2023 to decide which assemblies are for the final analysis. From Travis' e-mail: [versions in brackets exist at least in theory; not sure if we actually have them] Kees-1.2 - first version we got from Ron (Hein), which looks good except for the satellite artifacts (and has already been annotated and used for orthology, etc.) [Kees-1.1 - from Ron, going to the version that has the haplotigs removed, but before polishing with Pilon --> not sure we have this version] Kees-1.0 - from Ron, going back to the version prior to haplotig removal & polishing Kees-2.0.1 - flye assembly by Rodrigo (which includes all haplotigs) Kees-2.0.2 - flye assembly by Katie (which includes all haplotigs) Kees-2.0.3 - flye asm 40 assembly by Katie (which includes all haplotigs); asm 40 uses the longest reads (excludes shortest reads after 40x coverage is obtained) Kees-2.1.1 - prior version (2.0.1) with haplotigs removed {Rodrigo's Flye assembly} Kees-2.2.1 - prior version (2.1.1) polished with NextPolish {Rodrigo's polished Flye assembly} Kees-2.3.1 - prior version (2.2.1) with contaminants removed Maya-2.0.2 -Katie's Flye assembly Murphy-2.0.2 - Katie's Flye assembly Maya-2.0.1 - Rodrigo's Flye assembly Murphy-2.0.1 - Rodrigo's Flye assembly Maya-2.1.1 - haplotigs removed {Rodrigo's Flye assembly} Murphy-2.1.1 - haplotigs removed {Rodrigo's Flye assembly} I have: murphy_purged_2.1.1.fa.gz = Murphy-2.1.1 murphy_flye_TCG_2.0.2.fa.gz = Murphy-2.0.2 murphy_flye_raw_2.0.1.fa.gz = Murphy-2.0.1 maya_purged_2.1.1.fa.gz = Maya-2.1.1 maya_flye_TCG_2.0.2.fa.gz = Maya-2.0.2 maya_flye_raw_2.0.1.fa.gz = Maya-2.0.1 kees_purged_2.1.1.fa.gz = kees-2.1.1 kees_flye_TCG_2.0.2.fa.gz = kees-2.0.2 kees_flye_raw_2.0.1.fa.gz = kees-2.0.1 iRic3.fa.gz = kees_1.2.0 = kees_ron_1.2.0.fa.gz kees_40_2.0.3.fa.gz = kees1202_cov40 kees_ron_1.0.0.fa.gz = Kees-1.0 kees_purged_2.2.1.fa.gz = Kees_purged_nextpolish_2.2.1.fasta murphy_purged_2.2.1.fa.gz = Murphy_purged_nextpolish_2.2.1.fasta maya_purged_2.2.1.fa.gz = Murphy_purged_nextpolish_2.2.1.fasta Restarting all of the repeatmasker runs with this list: LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca" ``` cd /lustre/scratch/daray/ixodes/final_library_work LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca" for NAME in $LIST; do sed "s/<NAME>/${NAME}/g" template_RM.sh >${NAME}_RM.sh rm -rf ${NAME}_RM sed "s/<NAME>/${NAME}/g" template_rm2bed.sh >${NAME}_rm2bed.sh sbatch ${NAME}_RM.sh done ``` Just got a kees_40_2.0.3.fa.gz (the 40x version). Add to the list. Also got kees_1.0.0. Submit all of the doLifts if necessary. Sometimes they don't go automatically. ``` LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 Sca" cd /lustre/scratch/daray/ixodes/final_library_work for NAME in $LIST; do cd ${NAME}_RM/ rm *.err rm *.out sbatch doLift.sh cd .. done ``` Submit all of the rm2beds and then clean up. ``` cd /lustre/scratch/daray/ixodes/final_library_work LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca" for NAME in $LIST; do rm ${NAME}_RM/*.err rm ${NAME}_RM/*.out rm -rf RMPart sbatch generic_rm2bed.sh $NAME done cd /lustre/scratch/daray/ixodes/final_library_work for NAME in $LIST; do rm -rf ${NAME}_RM/RMPart done ``` ``` cd /lustre/scratch/daray/ixodes/final_library_work LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca" for TAXON in $LIST; do sed 's|unknown|Unknown|g' ${TAXON}_RM/${TAXON}_Unknown_rm.bed >${TAXON}_RM/${TAXON}_Unknown_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_RC_rm.bed ${TAXON}_RM/${TAXON}_RC_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_SINE_rm.bed ${TAXON}_RM/${TAXON}_SINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Satellite_rm.bed ${TAXON}_RM/${TAXON}_Satellite_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_PLE_rm.bed ${TAXON}_RM/${TAXON}_PLE_consolidated_rm.bed sed 's|RTE-X|RTE|g' ${TAXON}_RM/${TAXON}_LINE_rm.bed >${TAXON}_RM/${TAXON}_LINE_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_DNA_rm.bed ${TAXON}_RM/${TAXON}_DNA_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_LTR_rm.bed ${TAXON}_RM/${TAXON}_LTR_consolidated_rm.bed cp ${TAXON}_RM/${TAXON}_Simple_repeat_rm.bed ${TAXON}_RM/${TAXON}_Simple_repeat_consolidated_rm.bed done actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca" for TAXON in $LIST; do cat ../${TAXON}_RM/*consolidated_rm.bed >${TAXON}_all_consolidated_rm.bed cp ${TAXON}_all_consolidated_rm.bed ${TAXON}_all_rm.bed done sbatch cat_data.sh cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca" for TAXON in $LIST; do cp ${TAXON}_all_rm.bed ${TAXON}_rm.bed done ### Wait for cat_data.sh to be finished. ### fix the sizefiles conda activate seqkit cd /lustre/scratch/daray/ixodes/final_library_work/plots LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca" for NAME in $LIST; do STATS=$(seqkit stats ../../assemblies/${NAME}.fa.gz -T | tail -1) echo $STATS >>genomestats.txt done ###Edit sizefiles as needed python filter_beds.py -g sizefile_mrates_all.txt -d 50 ``` Use altered curated_landscapes_bluePLE_mod.py ``` actconda cd /lustre/scratch/daray/ixodes/final_library_work/plots python curated_landscapes_bluePLE_mod_filter.py -d 50 -g sizefile_mrates_4_curated_landscapes_all.txt cd /lustre/scratch/daray/ixodes/final_library_work/plots rm proportions_table.txt echo "Species Genome_size TE_proportion LINE_proportion SINE_proportion LTR_proportion DNA_proportion RC_proportion Unknown_proportion Satellite_proportion Simple_repeat_proportion" >proportions_table.tsv LIST="murphy_purged_2.1.1 murphy_flye_TCG_2.0.2 murphy_flye_raw_2.0.1 maya_purged_2.1.1 maya_flye_raw_2.0.1 kees_purged_2.1.1 maya_flye_TCG_2.0.2 kees_flye_raw_2.0.1 kees_ron_1.2.0 kees_40_2.0.3 kees_ron_1.0.0 kees_flye_TCG_2.0.2 murphy_purged_2.2.1 kees_purged_2.2.1 maya_purged_2.2.1 iSca" for TAXON in $LIST; do STATS=$(tail -1 ../${TAXON}_RM/repeat_table_total.tsv) echo $STATS >>proportions_table.tsv done ``` Another new set of assemblies cleaned of contaminants kees_PDC_2.3.1.fa.gz = Kees_PDC_2.3.1.fasta murphy_PDC_2.3.1.fa.gz = Murphy_PDC_2.3.1.fasta maya_PDC_2.3.1.fa.gz = Maya_PDC_2.3.1.fasta Restarting all of the repeatmasker runs with this list: LIST="kees_PDC_2.3.1 murphy_PDC_2.3.1 maya_PDC_2.3.1" ## Out of space. Moved to I. ricinus analysis part 2