longevity bootstrap

# longevity bootstrap Collect isoform 1 transcripts from TOGA data in /lustre/scratch/daray/tedensity/6bats_TOGA ``` cd /lustre/scratch/daray/tedensity/6bats_TOGA gtf2bed < HLmyoMyo6.Bat1Kannotation.intact.gtf > mMyo_bat1k.bed awk -F "\t" '{ if ($8 = "transcript") {print}} ' mMyo_bat1k.bed | grep -v "exon" >mMyo_bat1k_transcripts_only.bed sed -i "s/ /\t/g" mMyo_bat1k_transcripts_only.bed head mMyo_bat1k_transcripts_only.bed sort -k4 mMyo_bat1k_transcripts_only.bed >mMyo_bat1k_transcripts_sort.bed awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 }' mMyo_bat1k_transcripts_sort.bed >mMyo_bat1k_transcripts_trim.bed sed "s/_x/\t_x/g" mMyo_bat1k_transcripts_trim.bed | head sed "s/_x/\t_x/g" mMyo_bat1k_transcripts_trim.bed | awk '{if ($5 == "_x1") {print $1 "\t" $2 "\t" $3 "\t" $4 $5 "\t" $6 "\t" $7}}' >mMyo_bat1k_transcripts_x1.bed ``` Per Liliana, longevity gene data can be found at https://www.dropbox.com/sh/985gsici6bge1o9/AACZ7GjMIykwZ610MkHp3Hiya?dl=0 "the Navarro lists require collaboration for wilkinson, pick outputs/longevity" Genes in Wilkinson Momo and MYMY file are divvied in to Pos, Neg, and Neither. Assuming these correspond to positively influence, negatively influence, and no influence, respectively. Use bedtools intersect to identify overlaps for ALL genes in transcripts_x1 and then for subset of 'influencer' genes. Will use an overlap of within 1000 bp of entire transcript. ``` cd /lustre/scratch/daray/intersect/longevity_work cp /lustre/scratch/daray/tedensity/6bats_TOGA/*_x1.bed . cp ../mMol_te.bed ../mMyo_te.bed . ``` Created mMyo_influential_list.txt and mMol_influential_list.txt to extract those genes from full list. Sorted and uniq'd. No change ``` cat mMyo_influential_list.txt | while read i; do grep -w $i mMyo_bat1k_transcripts_x1.bed >>mMyo_influential.bed; done cat mMol_influential_list.txt | while read i; do grep -w $i mMol_bat1k_transcripts_x1.bed >>mMol_influential.bed; done ``` ``` bedtools window -a mMol_influential.bed -b mMol_te.bed -l 1000 -r 0 >mMol_window_influential_v_te_1000up.bed cat mMol_influential_list.txt | while read i; do COUNT=$(grep -w $i mMol_window_influential_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $i $COUNT >> mMol_influential_genes_counts.txt; done bedtools window -a mMyo_influential.bed -b mMyo_te.bed -l 1000 -r 0 >mMyo_window_influential_v_te_1000up.bed cat mMyo_influential_list.txt | while read i; do COUNT=$(grep -w $i mMyo_window_influential_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $i $COUNT >> mMyo_influential_genes_counts.txt; done awk '{print $4}' mMol_bat1k_transcripts_x1.bed | sort > mMol_transcripts_list.txt awk '{print $4}' mMyo_bat1k_transcripts_x1.bed | sort > mMyo_transcripts_list.txt bedtools window -a mMol_bat1k_transcripts_x1.bed -b mMol_te.bed -l 1000 -r 0 >mMol_window_transcripts_v_te_1000up.bed cat mMol_transcripts_list.txt| while read i; do COUNT=$(grep -w $i mMol_window_transcripts_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $i $COUNT >> mMol_all_genes_counts.txt; done bedtools window -a mMyo_bat1k_transcripts_x1.bed -b mMyo_te.bed -l 1000 -r 0 >mMyo_window_transcripts_v_te_1000up.bed cat mMyo_transcripts_list.txt | while read i; do COUNT=$(grep -w $i mMyo_window_transcripts_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $i $COUNT >> mMyo_all_genes_counts.txt; done ``` ### New work for longevity paper that Tanya is leading. Received files from Tanya and deposited in /lustre/scratch/daray/intersect/longevity_all_bats Unzipped and renamed all .gtf files in toga_annotations_longevity_bats/annotation_bed_gtf by removing "Copy of ". Copied those files to /lustre/scratch/daray/intersect/longevity_all_bats/processed Now working within that directory. ``` actconda #conda activate bedops cd /lustre/scratch/daray/intersect/longevity_all_bats/processed for FILE in *.gtf; do NAME=$(basename $FILE .gtf | awk -F'[_]' '{print $1}') echo $NAME sed "s|/projects/hillerlab/genome/gbdb-HL/hg38/TOGA/vs_${NAME}/||g" $FILE >${NAME} awk '{print $10}' ${NAME} >${NAME}_9.tmp sed -i 's|"||g' ${NAME}_9.tmp sed -i 's|;||g' ${NAME}_9.tmp paste ${NAME} ${NAME}_9.tmp >${NAME}.tmp rm ${NAME} ${NAME}_9.tmp awk '{print $1 "\t" $4 "\t" $5 "\t" $13 "\t" $6 "\t" $7 "\t" $2 "\t" $3 "\t" $8 "\t" $9 "\t" $10 "\t" $11 "\t" $12}' ${NAME}.tmp >${NAME}_bat1k.bed rm ${NAME}.tmp awk -F "\t" '{ if ($8 = "transcript") {print}} ' ${NAME}_bat1k.bed | grep -v "exon" >${NAME}_bat1k_transcripts_only.bed awk -i inplace '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6}' ${NAME}_bat1k_transcripts_only.bed sed -i "s/ /\t/g" ${NAME}_bat1k_transcripts_only.bed done ``` Renamed all the files to match our IDs. Grabbed the appropriate RepeatMasker files from /lustre/scratch/daray/covid_bats2 or from /lustre/scratch/daray/toarchive/zoonomia, as appropriate. During, check to make sure the assembly used to generate the RepeatMasker output is the same as the assembly used for gene annotation. To be safe, I just re-ran repeatmasker for all of the genome assemblies. Work performed in /lustre/scratch/daray/intersect/longevity_all_bats/repeatmasker A few species are missing due to bad assemblies. Will deal with it later. Get the influential genes from each species using the list from mMol. mMol_influential_list_uniq.txt was derived from longevity_geneset_masterlist.xlsx Output is a .bed of longevity-influential transcripts ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/beds cat ../list.txt | while read i; do cat ../mMol_influential_list_uniq.txt | while read j; do grep -w $j ${i}_bat1k_transcripts_only.bed >>${i}_influential.bed done done ``` Get the .bed files for the TE annotations ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/beds cat ../list.txt | while read i; do cat ../repeatmasker/${i}_RM/${i}_LTR_rm.bed ../repeatmasker/${i}_RM/${i}_LINE_rm.bed ../repeatmasker/${i}_RM/${i}_DNA_rm.bed ../repeatmasker/${i}_RM/${i}_SINE_rm.bed ../repeatmasker/${i}_RM/${i}_RC_rm.bed >${i}_te.bed done ``` ## Started working with Scott Teresi to run TEDensity. See threads in e-mail. To reduce the complexity of my TE annotation files, ran: ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/te_annotations cat ../../list.txt | while read i; do echo $i sed "s|5S-Core-RTE|5S|g" ${i}_TEs.tsv >${i}_TEs.tsv.tmp sed "s|5S-Deu-L2|5S|g; \ s|5S-RTE|5S|g; \ s|5S-Sauria-RTE|5S|g; \ s|5S|Other-SINE|g; \ s|7SL|Other-SINE|g; \ s|Alu|Other-SINE|g; \ s|U|Other-SINE|g; \ s|IS3EOther-SINE|IS3EU|g; \ s|MOther-SINELE-MuDR|MULE-MuDR|g; \ s|MOther-SINELE-NOF|MULE-NOF|g; \ s|Other-SINEnknown|Unknown|g; \ s|tRNA-7SL|tRNA-SINE|g; \ s|tRNA-CR1|tRNA-SINE|g; \ s|tRNA-Core|tRNA-SINE|g; \ s|tRNA-Core-L2|tRNA-SINE|g; \ s|tRNA-Core-RTE|tRNA-SINE|g; \ s|tRNA-Deu|tRNA-SINE|g; \ s|tRNA-Deu-L2|tRNA-SINE|g; \ s|tRNA-L1|tRNA-SINE|g; \ s|tRNA-L2|tRNA-SINE|g; \ s|tRNA-RTE|tRNA-SINE|g; \ s|tRNA-V|tRNA-SINE|g; \ s|tRNA-V-CR1|tRNA-SINE|g; \ s|Core-RTE|tRNA-SINE|g; \ s|tRNA-SINE-CR1|tRNA-SINE|g; \ s|tRNA-SINE-RTE|tRNA-SINE|g; \ s|tRNA-SINE-L2|tRNA-SINE|g; \ s|tRNA-V-Core-L2|tRNA-SINE|g; \ s|tRNA|tRNA-SINE|g; \ s|tRNA-SINE-SINE|tRNA-SINE|g; \ s|Rhin-1|Rhin|g; \ s|SINE2|Other-SINE|g; \ s|tRNA-SINE-Sauria|tRNA-SINE|g; \ s|unknown|Unknown|g; \ s|CMC-Chapaev-3|CMC|g; \ s|CMC-EnSpm|CMC|g; \ s|Crypton-A|Crypton|g; \ s|Crypton-H|Crypton|g; \ s|Crypton-V|Crypton|g; \ s|Kolobok-T2|Kolobok|g; \ s|MEG|Meg|g; \ s|MULE-MuDR|MULE|g; \ s|MULE-NOF|MULE|g; \ s|PiggyBac|piggyBac|g; \ s|Sola-1|Sola|g; \ s|Sola-2|Sola|g; \ s|TcMar-Fot1|TcMariner|g; \ s|TcMar-ISRm11|TcMariner|g; \ s|TcMar-Mariner|TcMariner|g; \ s|TcMar-Stowaway|TcMariner|g; \ s|TcMar-Tc1|TcMariner|g; \ s|TcMar-Tc2|TcMariner|g; \ s|TcMar-Tigger|TcMariner|g; \ s|TcMar|TcMariner|g; \ s|TcMariner-Mariner|TcMariner|g; \ s|TcMariner-ISRm11|TcMariner|g; \ s|TcMariner-Tigger|TcMariner|g; \ s|TcMariner-Mariner|TcMariner|g; \ s|TcMariner-ISRm11|TcMariner|g; \ s|TcMarineriner|TcMariner|g; \ s|hAT-Blackjack|hAT|g; \ s|hAT-Charlie|hAT|g; \ s|hAT-Tip100|hAT|g; \ s|Academ-1|Other-DNA|g; \ s|CACTA|Other-DNA|g; \ s|CMC|Other-DNA|g; \ s|Crypton|Other-DNA|g; \ s|Dada|Other-DNA|g; \ s|Ginger|Other-DNA|g; \ s|PIF-Harbinger|Harbinger|g; \ s|Harbinger|Other-DNA|g; \ s|IS3EU|Other-DNA|g; \ s|Kolobok|Other-DNA|g; \ s|Maverick|Other-DNA|g; \ s|MULE|Other-DNA|g; \ s|Mutator|Other-DNA|g; \ s|Merlin|Other-DNA|g; \ s|Novosib|Other-DNA|g; \ s|PIF-ISL2EU|Other-DNA|g; \ s|P|Other-DNA|g; \ s|Other-DNAao|Pao|g; \ s|Other-DNAenelope|P|g; \ s|Sola|Other-DNA|g; \ s|Zator|Other-DNA|g; \ s|Zisupton|Other-DNA|g; \ s|Ngaro|Other-Retrotransposon|g; \ s|DIRS|Other-Retrotransposon|g; \ s|I-Jockey|Other-LINE|g; \ s|Jockey|Other-LINE|g; \ s|R2-Hero|R2|g; \ s|R2-NeSL|R2|g; \ s|RTE-BovB|RTE|g; \ s|RTE-RTE|RTE|g; \ s|RTE-X|RTE|g; \ s|BovB|RTE|g; \ s|L1-Tx1|L1|g; \ s|B2|Other-LINE|g; \ s|B4|Other-LINE|g; \ s|Dong-R4|Other-LINE|g; \ s|L2|Other-LINE|g; \ s|LINE-like|Other-LINE|g; \ s|R1|Other-LINE|g; \ s|COther-LINE|CR1|g; \ s|R2|Other-LINE|g; \ s|Rex-Babar|Other-LINE|g; \ s|ERV-Lenti|ERV|g; \ s|ERV-Foamy|ERV|g; \ s|ERV1|ERV|g; \ s|ERV1_internal|ERV|g; \ s|ERV1_ltr|ERV|g; \ s|ERV4|ERV|g; \ s|ERVK|ERV|g; \ s|ERVK_internal|ERV|g; \ s|ERV_ltr|ERV|g; \ s|ERVK_ltr|ERV|g; \ s|ERVL|ERV|g; \ s|ERVL-MaLR|ERV|g; \ s|ERV-MaLR|ERV|g; \ s|ERVL_internal|ERV|g; \ s|ERVL_ltr|ERV|g; \ s|ERV_internal|ERV|g; \ s|ERV_ltr|ERV|g; \ s|Bhikhari|Other-LTR|g; \ s|Other-DNAIF-ISOther-LINEEOther-SINE|Other-DNA|g; \ #s|tRNA-SINE-Core-Other-LINE|tRNA-SINE|g; \ s|Pao|Other-LTR|g" ${i}_TEs.tsv.tmp >${i}_TEs_binned.tsv awk '{print $6}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_superfamily.txt awk '{print $5}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_order.txt rm ${i}_TEs.tsv.tmp #gzip -c ${i}_TEs_binned.tsv >${i}_TEs_binned.tsv.gz done ``` Determine which scaffolds to use: ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity sbatch scaff_filter.sh ``` Had to re-run the six bats because the scaffold names were different on the NCBI versions and the TOGA-run versions. Managed to find the assemblies with the old TOGA names and re-ran RepeatMasker. Output is in: /lustre/scratch/daray/intersect/longevity_all_bats/repeatmasker ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/beds LIST="mMol mMyo pDis pKuh rAeg rFer" for i in $LIST; do cat ../repeatmasker/${i}_RM/${i}_LTR_rm.bed ../repeatmasker/${i}_RM/${i}_LINE_rm.bed ../repeatmasker/${i}_RM/${i}_DNA_rm.bed ../repeatmasker/${i}_RM/${i}_SINE_rm.bed ../repeatmasker/${i}_RM/${i}_RC_rm.bed >${i}_te.bed done ``` Convert to tsv ![](https://i.imgur.com/aA1wYuB.png) ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/beds LIST="mMol mMyo pDis pKuh rAeg rFer" for i in $LIST; do awk '{print $1 "\t" $2 "\t" $3 "\t" $6 "\t" $7 "\t" $8 "\t" $3-$2+1}' ${i}_te.bed >../tedensity/te_annotations/${i}_TEs.tmp cat ../tedensity/te_annotations/te_header.txt ../tedensity/te_annotations/${i}_TEs.tmp >../tedensity/te_annotations/${i}_TEs.tsv rm ../tedensity/te_annotations/${i}_TEs.tmp done ``` Clean up ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/te_annotations LIST="mMol mMyo pDis pKuh rAeg rFer" for i in $LIST; do echo $i sed "s|5S-Core-RTE|5S|g" ${i}_TEs.tsv >${i}_TEs.tsv.tmp sed "s|5S-Deu-L2|5S|g; \ s|5S-RTE|5S|g; \ s|5S-Sauria-RTE|5S|g; \ s|5S|Other-SINE|g; \ s|7SL|Other-SINE|g; \ s|Alu|Other-SINE|g; \ s|U|Other-SINE|g; \ s|IS3EOther-SINE|IS3EU|g; \ s|MOther-SINELE-MuDR|MULE-MuDR|g; \ s|MOther-SINELE-NOF|MULE-NOF|g; \ s|Other-SINEnknown|Unknown|g; \ s|tRNA-7SL|tRNA-SINE|g; \ s|tRNA-CR1|tRNA-SINE|g; \ s|tRNA-Core|tRNA-SINE|g; \ s|tRNA-Core-L2|tRNA-SINE|g; \ s|tRNA-Core-RTE|tRNA-SINE|g; \ s|tRNA-Deu|tRNA-SINE|g; \ s|tRNA-Deu-L2|tRNA-SINE|g; \ s|tRNA-L1|tRNA-SINE|g; \ s|tRNA-L2|tRNA-SINE|g; \ s|tRNA-RTE|tRNA-SINE|g; \ s|tRNA-V|tRNA-SINE|g; \ s|tRNA-V-CR1|tRNA-SINE|g; \ s|Core-RTE|tRNA-SINE|g; \ s|tRNA-SINE-CR1|tRNA-SINE|g; \ s|tRNA-SINE-RTE|tRNA-SINE|g; \ s|tRNA-SINE-L2|tRNA-SINE|g; \ s|tRNA-V-Core-L2|tRNA-SINE|g; \ s|tRNA|tRNA-SINE|g; \ s|tRNA-SINE-SINE|tRNA-SINE|g; \ s|Rhin-1|Rhin|g; \ s|SINE2|Other-SINE|g; \ s|tRNA-SINE-Sauria|tRNA-SINE|g; \ s|tRNA-SINE-Ceph-RTE|tRNA-SINE|g; \ s|Ceph|tRNA-SINE|g; \ s|tRNA-SINE-Mermaid|tRNA-SINE|g; \ s|tRNA-SINE-Meta|tRNA-SINE|g; \ s|Core|Other-SINE|g; \ s|unknown|Unknown|g; \ s|CMC-Chapaev-3|CMC|g; \ s|CMC-EnSpm|CMC|g; \ s|Crypton-A|Crypton|g; \ s|Crypton-H|Crypton|g; \ s|Crypton-V|Crypton|g; \ s|Kolobok-T2|Kolobok|g; \ s|MEG|Meg|g; \ s|MULE-MuDR|MULE|g; \ s|MULE-NOF|MULE|g; \ s|PiggyBac|piggyBac|g; \ s|Sola-1|Sola|g; \ s|Sola-2|Sola|g; \ s|TcMar-Fot1|TcMariner|g; \ s|TcMar-ISRm11|TcMariner|g; \ s|TcMar-Mariner|TcMariner|g; \ s|TcMar-Stowaway|TcMariner|g; \ s|TcMar-Tc1|TcMariner|g; \ s|TcMar-Tc2|TcMariner|g; \ s|TcMar-Tigger|TcMariner|g; \ s|TcMar|TcMariner|g; \ s|TcMariner-Mariner|TcMariner|g; \ s|TcMariner-ISRm11|TcMariner|g; \ s|TcMariner-Tigger|TcMariner|g; \ s|TcMariner-Mariner|TcMariner|g; \ s|TcMariner-ISRm11|TcMariner|g; \ s|TcMarineriner|TcMariner|g; \ s|hAT-Blackjack|hAT|g; \ s|hAT-Charlie|hAT|g; \ s|hAT-Tip100|hAT|g; \ s|Academ-1|Other-DNA|g; \ s|CACTA|Other-DNA|g; \ s|CMC|Other-DNA|g; \ s|Crypton|Other-DNA|g; \ s|Dada|Other-DNA|g; \ s|Ginger|Other-DNA|g; \ s|PIF-Harbinger|Harbinger|g; \ s|Harbinger|Other-DNA|g; \ s|IS3EU|Other-DNA|g; \ s|Kolobok|Other-DNA|g; \ s|Maverick|Other-DNA|g; \ s|MULE|Other-DNA|g; \ s|Mutator|Other-DNA|g; \ s|Merlin|Other-DNA|g; \ s|Novosib|Other-DNA|g; \ s|PIF-ISL2EU|Other-DNA|g; \ s|P|Other-DNA|g; \ s|Other-DNAao|Pao|g; \ s|Other-DNAenelope|P|g; \ s|Sola|Other-DNA|g; \ s|Zator|Other-DNA|g; \ s|Zisupton|Other-DNA|g; \ s|Ngaro|Other-Retrotransposon|g; \ s|DIRS|Other-Retrotransposon|g; \ s|I-Jockey|Other-LINE|g; \ s|Jockey|Other-LINE|g; \ s|R2-Hero|R2|g; \ s|R2-NeSL|R2|g; \ s|RTE-BovB|RTE|g; \ s|RTE-RTE|RTE|g; \ s|RTE-X|RTE|g; \ s|BovB|RTE|g; \ s|L1-Tx1|L1|g; \ s|B2|Other-LINE|g; \ s|B4|Other-LINE|g; \ s|Dong-R4|Other-LINE|g; \ s|L2|Other-LINE|g; \ s|LINE-like|Other-LINE|g; \ s|R1|Other-LINE|g; \ s|COther-LINE|CR1|g; \ s|R2|Other-LINE|g; \ s|Rex-Babar|Other-LINE|g; \ s|CR1-Zenon|CR1|g; \ s|Tad1|Other-LINE|g; \ s|R0|Other-LINE|g; \ s|Other-DNAroto2|Other-LINE|g; \ s|ERV-Lenti|ERV|g; \ s|ERV-Foamy|ERV|g; \ s|ERV1|ERV|g; \ s|ERV1_internal|ERV|g; \ s|ERV1_ltr|ERV|g; \ s|ERV4|ERV|g; \ s|ERVK|ERV|g; \ s|ERVK_internal|ERV|g; \ s|ERV_ltr|ERV|g; \ s|ERVK_ltr|ERV|g; \ s|ERVL|ERV|g; \ s|ERVL-MaLR|ERV|g; \ s|ERV-MaLR|ERV|g; \ s|ERVL_internal|ERV|g; \ s|ERVL_ltr|ERV|g; \ s|ERV_internal|ERV|g; \ s|ERV_ltr|ERV|g; \ s|Bhikhari|Other-LTR|g; \ s|Other-DNAIF-ISOther-LINEEOther-SINE|Other-DNA|g; \ s|tRNA-SINE-Core-Other-LINE|tRNA-SINE|g; \ s|tRNA-SINE-Core-Other-LINE|tRNA-SINE|g; \ s|tRNA-SINE-Other-SINE|tRNA-SINE|g; \ s|tRNA-SINE-Other-LINE|tRNA-SINE|g; \ s|Other-DNA-Transib|Other-DNA|g; \ s|tRNA-SINE-RTE|tRNA-SINE|g; \ s|Pao|Other-LTR|g" ${i}_TEs.tsv.tmp >${i}_TEs_binned.tsv cp ${i}_TEs_binned.tsv /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/clean_files/cleaned/Cleaned_${i}_TEs_binned_DAR.tsv awk '{print $6}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_superfamily.txt awk '{print $5}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_order.txt rm ${i}_TEs.tsv.tmp gzip -c ${i}_TEs_binned.tsv >${i}_TEs_binned.tsv.gz done LIST="mMol mMyo pDis pKuh rAeg rFer" for i in $LIST; do cp ${i}_TEs_binned.tsv /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/clean_files/Cleaned_${i}_TEs_binned_DAR.tsv done ``` Clean up scaffolds and run TEDensity ``` sbatch scaff_filter_6bats.sh #Wait to finish cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts LIST="mMol mMyo pDis pKuh rAeg rFer" for i in $LIST; do sbatch ${i}_tedensity.sh done ``` aJam is messed up. Seems to be at the step of creating the Cleaned TE file. ``` sbatch scaff_filter_aJam.sh ``` The TE_binned file is still not right. Go back farther. Found the problem. The assembly used for TOGA and for RepeatMasker are not the same. Got what I think is the right assembly from Tanya but I need to check. Start by checking to see if the same scaffolds are represented in the TOGA and RM annotations. Run RepeatMasker on the 'correct' assembly ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/assemblies head aJam.masked.fa awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' aJam.masked.fa > aJam.unmasked.fa gzip aJam.unmasked.fa >aJam.fa.gz cd /lustre/scratch/daray/intersect/longevity_all_bats/repeatmasker sbatch aJam_RM.sh cd /lustre/scratch/daray/intersect/longevity_all_bats/beds LIST="aJam" for i in $LIST; do cat ../repeatmasker/${i}_RM/${i}_LTR_rm.bed ../repeatmasker/${i}_RM/${i}_LINE_rm.bed ../repeatmasker/${i}_RM/${i}_DNA_rm.bed ../repeatmasker/${i}_RM/${i}_SINE_rm.bed ../repeatmasker/${i}_RM/${i}_RC_rm.bed >${i}_te.bed done awk '{print $1}' aJam_te.bed | sort |uniq |wc -l ---> 830 awk '{print $1}' aJam_bat1k_transcripts_only.bed | sort |uniq |wc -l ---> 400 ``` The numbers don't match but that may not mean anything. It's unlikely that a gene was found on every scaffold of the assembly. Keep going and cross your fingers. ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/beds LIST="aJam" for i in $LIST; do awk '{print $1 "\t" $2 "\t" $3 "\t" $6 "\t" $7 "\t" $8 "\t" $3-$2+1}' ${i}_te.bed >../tedensity/te_annotations/${i}_TEs.tmp cat ../tedensity/te_annotations/te_header.txt ../tedensity/te_annotations/${i}_TEs.tmp >../tedensity/te_annotations/${i}_TEs.tsv rm ../tedensity/te_annotations/${i}_TEs.tmp done cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/te_annotations LIST="aJam" for i in $LIST; do echo $i sed "s|5S-Core-RTE|5S|g" ${i}_TEs.tsv >${i}_TEs.tsv.tmp sed "s|5S-Deu-L2|5S|g; \ s|5S-RTE|5S|g; \ s|5S-Sauria-RTE|5S|g; \ s|5S|Other-SINE|g; \ s|7SL|Other-SINE|g; \ s|Alu|Other-SINE|g; \ s|U|Other-SINE|g; \ s|IS3EOther-SINE|IS3EU|g; \ s|MOther-SINELE-MuDR|MULE-MuDR|g; \ s|MOther-SINELE-NOF|MULE-NOF|g; \ s|Other-SINEnknown|Unknown|g; \ s|tRNA-7SL|tRNA-SINE|g; \ s|tRNA-CR1|tRNA-SINE|g; \ s|tRNA-Core|tRNA-SINE|g; \ s|tRNA-Core-L2|tRNA-SINE|g; \ s|tRNA-Core-RTE|tRNA-SINE|g; \ s|tRNA-Deu|tRNA-SINE|g; \ s|tRNA-Deu-L2|tRNA-SINE|g; \ s|tRNA-L1|tRNA-SINE|g; \ s|tRNA-L2|tRNA-SINE|g; \ s|tRNA-RTE|tRNA-SINE|g; \ s|tRNA-V|tRNA-SINE|g; \ s|tRNA-V-CR1|tRNA-SINE|g; \ s|Core-RTE|tRNA-SINE|g; \ s|tRNA-SINE-CR1|tRNA-SINE|g; \ s|tRNA-SINE-RTE|tRNA-SINE|g; \ s|tRNA-SINE-L2|tRNA-SINE|g; \ s|tRNA-V-Core-L2|tRNA-SINE|g; \ s|tRNA|tRNA-SINE|g; \ s|tRNA-SINE-SINE|tRNA-SINE|g; \ s|Rhin-1|Rhin|g; \ s|SINE2|Other-SINE|g; \ s|tRNA-SINE-Sauria|tRNA-SINE|g; \ s|tRNA-SINE-Ceph-RTE|tRNA-SINE|g; \ s|Ceph|tRNA-SINE|g; \ s|tRNA-SINE-Mermaid|tRNA-SINE|g; \ s|tRNA-SINE-Meta|tRNA-SINE|g; \ s|Core|Other-SINE|g; \ s|unknown|Unknown|g; \ s|CMC-Chapaev-3|CMC|g; \ s|CMC-EnSpm|CMC|g; \ s|Crypton-A|Crypton|g; \ s|Crypton-H|Crypton|g; \ s|Crypton-V|Crypton|g; \ s|Kolobok-T2|Kolobok|g; \ s|MEG|Meg|g; \ s|MULE-MuDR|MULE|g; \ s|MULE-NOF|MULE|g; \ s|PiggyBac|piggyBac|g; \ s|Sola-1|Sola|g; \ s|Sola-2|Sola|g; \ s|TcMar-Fot1|TcMariner|g; \ s|TcMar-ISRm11|TcMariner|g; \ s|TcMar-Mariner|TcMariner|g; \ s|TcMar-Stowaway|TcMariner|g; \ s|TcMar-Tc1|TcMariner|g; \ s|TcMar-Tc2|TcMariner|g; \ s|TcMar-Tigger|TcMariner|g; \ s|TcMar|TcMariner|g; \ s|TcMariner-Mariner|TcMariner|g; \ s|TcMariner-ISRm11|TcMariner|g; \ s|TcMariner-Tigger|TcMariner|g; \ s|TcMariner-Mariner|TcMariner|g; \ s|TcMariner-ISRm11|TcMariner|g; \ s|TcMarineriner|TcMariner|g; \ s|hAT-Blackjack|hAT|g; \ s|hAT-Charlie|hAT|g; \ s|hAT-Tip100|hAT|g; \ s|Academ-1|Other-DNA|g; \ s|CACTA|Other-DNA|g; \ s|CMC|Other-DNA|g; \ s|Crypton|Other-DNA|g; \ s|Dada|Other-DNA|g; \ s|Ginger|Other-DNA|g; \ s|PIF-Harbinger|Harbinger|g; \ s|Harbinger|Other-DNA|g; \ s|IS3EU|Other-DNA|g; \ s|Kolobok|Other-DNA|g; \ s|Maverick|Other-DNA|g; \ s|MULE|Other-DNA|g; \ s|Mutator|Other-DNA|g; \ s|Merlin|Other-DNA|g; \ s|Novosib|Other-DNA|g; \ s|PIF-ISL2EU|Other-DNA|g; \ s|P|Other-DNA|g; \ s|Other-DNAao|Pao|g; \ s|Other-DNAenelope|P|g; \ s|Sola|Other-DNA|g; \ s|Zator|Other-DNA|g; \ s|Zisupton|Other-DNA|g; \ s|Ngaro|Other-Retrotransposon|g; \ s|DIRS|Other-Retrotransposon|g; \ s|I-Jockey|Other-LINE|g; \ s|Jockey|Other-LINE|g; \ s|R2-Hero|R2|g; \ s|R2-NeSL|R2|g; \ s|RTE-BovB|RTE|g; \ s|RTE-RTE|RTE|g; \ s|RTE-X|RTE|g; \ s|BovB|RTE|g; \ s|L1-Tx1|L1|g; \ s|B2|Other-LINE|g; \ s|B4|Other-LINE|g; \ s|Dong-R4|Other-LINE|g; \ s|L2|Other-LINE|g; \ s|LINE-like|Other-LINE|g; \ s|R1|Other-LINE|g; \ s|COther-LINE|CR1|g; \ s|R2|Other-LINE|g; \ s|Rex-Babar|Other-LINE|g; \ s|CR1-Zenon|CR1|g; \ s|Tad1|Other-LINE|g; \ s|R0|Other-LINE|g; \ s|Other-DNAroto2|Other-LINE|g; \ s|ERV-Lenti|ERV|g; \ s|ERV-Foamy|ERV|g; \ s|ERV1|ERV|g; \ s|ERV1_internal|ERV|g; \ s|ERV1_ltr|ERV|g; \ s|ERV4|ERV|g; \ s|ERVK|ERV|g; \ s|ERVK_internal|ERV|g; \ s|ERV_ltr|ERV|g; \ s|ERVK_ltr|ERV|g; \ s|ERVL|ERV|g; \ s|ERVL-MaLR|ERV|g; \ s|ERV-MaLR|ERV|g; \ s|ERVL_internal|ERV|g; \ s|ERVL_ltr|ERV|g; \ s|ERV_internal|ERV|g; \ s|ERV_ltr|ERV|g; \ s|Bhikhari|Other-LTR|g; \ s|Other-DNAIF-ISOther-LINEEOther-SINE|Other-DNA|g; \ s|tRNA-SINE-Core-Other-LINE|tRNA-SINE|g; \ s|tRNA-SINE-Core-Other-LINE|tRNA-SINE|g; \ s|tRNA-SINE-Other-SINE|tRNA-SINE|g; \ s|tRNA-SINE-Other-LINE|tRNA-SINE|g; \ s|Other-DNA-Transib|Other-DNA|g; \ s|tRNA-SINE-RTE|tRNA-SINE|g; \ s|Pao|Other-LTR|g" ${i}_TEs.tsv.tmp >${i}_TEs_binned.tsv cp ${i}_TEs_binned.tsv /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/clean_files/Cleaned_${i}_TEs_binned_DAR.tsv awk '{print $6}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_superfamily.txt awk '{print $5}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_order.txt rm ${i}_TEs.tsv.tmp gzip -c ${i}_TEs_binned.tsv >${i}_TEs_binned.tsv.gz done cd .. sbatch scaff_filter_aJam.sh #Wait to finish cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts LIST="aJam" for i in $LIST; do sbatch ${i}_tedensity.sh done ``` While talking with Scott, we discovered that the downstream analysis crashes if not all TEs are present in the .hd5 files. We decided to filter the input data for only scaffolds with ALL TEs. I did so with this: ``` for FILE in *_TEs_binned_DAR.tsv; do NAME=$(printf '%s\n' "$FILE" | cut -d_ -f 2) rm Cleaned_${NAME}_allTEs_DAR.tsv echo "TAXON=$NAME" awk '{ print $1 } ' $FILE | sort | uniq > ${NAME}.scaff sed -i '1d' ${NAME}.scaff cat ${NAME}.scaff | while read SCAFF; do #echo "scaffold=$SCAFF" grep -w "$SCAFF" $FILE >${SCAFF}.extract.tsv #TELIST=$(grep "$SCAFF" ${SCAFF}.extract.tsv | awk '{ print $5 } ' | sort | uniq ) #echo $TELIST TECOUNT=$(grep "$SCAFF" ${SCAFF}.extract.tsv | awk '{ print $5 } ' | sort | uniq | wc -l) #echo $TECOUNT if (( TECOUNT > 4 )); then cat ${SCAFF}.extract.tsv >> Cleaned_${NAME}_allTEs_DAR.tsv fi rm ${SCAFF}.extract.tsv done rm ${NAME}.scaff done ``` Now I need to re-run TEDensity but with a new config file. First, link the new TE files. ``` /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/data$ for FILE in ls /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/clean_files/*.tsv; do ln -s $FILE; done ``` Change production_config file to 5kb windows ``` [density_parameters] first_window_size = 5000 window_delta = 2500 last_window_size = 5000 ``` Change template to use production_config file and to use the new TE files. New template file: ``` #!/bin/bash #SBATCH --job-name=<NAME>-density #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=xlquanah #SBATCH --nodes=1 #SBATCH --ntasks=8 #SBATCH --time=300:00:00 #SBATCH --mem-per-cpu=31G #SBATCH --account=xlquanah #-------------------------------------------------------- CODE_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/TE_Density DATA_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/data OUTPUT_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/results GENOME=<NAME> RUNTYPE=${GENOME}_tedensity ## This script uses a python program. You need to make sure python is loaded and ready to go by activating conda. ## and the tedensity virtual environment . ~/conda/etc/profile.d/conda.sh conda activate tedensity #ST Load Python #ST module purge #ST module load GCCcore/11.3.0 Python/3.10.4 #ST Source the Python packages that are version controlled #ST source /mnt/research/edgerpat_lab/Scotty/venvs/Bat_TE/bin/activate #ST Go to project directory cd $CODE_DIR/scripts # NOTE, running again after it timed out after making the revised annotation, it took a day to revise, that part is single threaded though # Run the code python $CODE_DIR/process_genome.py \ $DATA_DIR/Cleaned_${GENOME}_transcripts_DAR.tsv \ $DATA_DIR/Cleaned_${GENOME}_allTEs_DAR.tsv \ $GENOME -c \ $CODE_DIR/config/production_run_config_edit.ini \ -n 8 \ -o $OUTPUT_DIR/$RUNTYPE ``` ``` cat ../../../list.txt | while read i; do sed "s/<NAME>/$i/g" tedensity_template.sh >${i}_tedensity.sh; done cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts cat ../../../list.txt | while read i; do sbatch ${i}_tedensity.sh done ``` Aargh. Ran into the problem with unequal numbers of scaffolds between TE files and gene files again. Need to re-run the equalization script again. Redo the TE counting again, yeilding a file that will be equalized. ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/te_annotations for FILE in *_TEs_binned.tsv; do FILENAME=$(basename $FILE) NAME=$(printf '%s\n' "$FILENAME" | cut -d_ -f 1) rm ${NAME}_allTEs_binned_DAR.tsv rm ${NAME}_allTEs_binned.tsv echo "TAXON=$NAME" awk '{ print $1 } ' $FILE | sort | uniq > ${NAME}.scaff sed -i '1d' ${NAME}.scaff cat ${NAME}.scaff | while read SCAFF; do #echo "scaffold=$SCAFF" grep -w "$SCAFF" $FILE >${SCAFF}.extract.tsv #TELIST=$(grep "$SCAFF" ${SCAFF}.extract.tsv | awk '{ print $5 } ' | sort | uniq ) #echo $TELIST TECOUNT=$(grep "$SCAFF" ${SCAFF}.extract.tsv | awk '{ print $5 } ' | sort | uniq | wc -l) #echo $TECOUNT if (( TECOUNT > 4 )); then cat ${SCAFF}.extract.tsv >> ${NAME}_allTEs_binned.tsv fi rm ${SCAFF}.extract.tsv done rm ${NAME}.scaff done ``` Now, run the scaff_filter_v2.sh script. ``` #!/bin/bash #SBATCH --job-name=filter #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=1 cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/te_annotations #LIST="aPal" cat ../../list.txt | while read i; do echo "TAXON=$i" #for i in $LIST; do rm ${i}_no_te_scaffolds.txt rm ${i}_no_gene_scaffolds.txt touch ${i}_no_te_scaffolds.txt touch ${i}_no_gene_scaffolds.txt cat ${i}_scaffolds.txt | while read j; do TE_SCAFFOLD_COUNT=$(grep -w "$j" ${i}_allTEs_binned.tsv | wc -l) if [ "$TE_SCAFFOLD_COUNT" -eq "0" ]; then echo "$j" >> ${i}_no_te_scaffolds.txt fi GENE_SCAFFOLD_COUNT=$(grep -w "$j" ../genes/${i}_transcripts.tsv | wc -l) if [ "$GENE_SCAFFOLD_COUNT" -eq "0" ]; then echo "$j" >> ${i}_no_gene_scaffolds.txt fi done cat ${i}_no_gene_scaffolds.txt ${i}_no_te_scaffolds.txt | sort | uniq >${i}_remove_scaffolds.txt grep -w -v -f ${i}_remove_scaffolds.txt ../genes/${i}_transcripts.tsv > ../clean_files/Cleaned_${i}_transcripts_DAR.tsv grep -w -v -f ${i}_remove_scaffolds.txt ${i}_allTEs_binned.tsv >${i}_allTEs_binned_DAR.tsv cat te_header.txt ${i}_allTEs_binned_DAR.tsv >../clean_files/Cleaned_${i}_allTEs_binned_DAR.tsv # sed -i '1s/^/Chromosome Start Stop Strand Order SuperFamily Length \n/' ../clean_files/Cleaned_${i}_allTEs_binned_DAR.tsv done ``` ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity sbatch scaff_filter_v2.sh ``` Link new files to data directory. ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/data for FILE in /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/clean_files/*.tsv; do ln -s $FILE; done ``` These files should work. New template file: ``` #!/bin/bash #SBATCH --job-name=<NAME>-density #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=xlquanah #SBATCH --nodes=1 #SBATCH --ntasks=8 #SBATCH --time=300:00:00 #SBATCH --mem-per-cpu=31G #SBATCH --account=xlquanah #-------------------------------------------------------- CODE_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/TE_Density DATA_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/data OUTPUT_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/results GENOME=<NAME> RUNTYPE=${GENOME}_tedensity ## This script uses a python program. You need to make sure python is loaded and ready to go by activating conda. ## and the tedensity virtual environment . ~/conda/etc/profile.d/conda.sh conda activate tedensity #ST Load Python #ST module purge #ST module load GCCcore/11.3.0 Python/3.10.4 #ST Source the Python packages that are version controlled #ST source /mnt/research/edgerpat_lab/Scotty/venvs/Bat_TE/bin/activate #ST Go to project directory cd $CODE_DIR/scripts # NOTE, running again after it timed out after making the revised annotation, it took a day to revise, that part is single threaded though # Run the code python $CODE_DIR/process_genome.py \ $DATA_DIR/Cleaned_${GENOME}_transcripts_DAR.tsv \ $DATA_DIR/Cleaned_${GENOME}_allTEs_binned_DAR.tsv \ $GENOME -c \ $CODE_DIR/config/production_run_config_edit.ini \ -n 8 \ -o $OUTPUT_DIR/$RUNTYPE ``` ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts cat ../../../list.txt | while read i; do sed "s/<NAME>/$i/g" tedensity_template.sh >${i}_tedensity.sh; done cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts cat ../../../list.txt | while read i; do sbatch ${i}_tedensity.sh done cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts cat ../../../list.txt | while read i; do sbatch ${i}_tedensity_nocona.sh done ``` ## August 2023 Have been attempting to get the downstream steps to work but was running into an error. See messages with Scott. Use /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/general_read_density_data.py This can be invoked for aPal via: `python src/general_read_density_data.py data/Cleaned_aPal_transcripts_DAR.tsv results /aPal_tedensity/ "aPal_(.*?).h5"` Based on an e-mail from 8/14/2023, the problem may be related to a permissions issue. =================================== I've been debugging the code for a while now and have been unable to replicate that error. I suspect it was something to do with the permissions of the files due to the weird nature of the error message and the fact that everything works fine for me. My steps in order of operations: * cd to /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences * Activate my TE Density python package environment. I don't know if yours is different, but it shouldn't be. If you continue to have an error I suggest we start from a fresh install here. I used miniconda for mine in order to pursue a minimum working install. * python src/general_read_density_data.py data/Cleaned_aPal_transcripts_DAR.tsv results/aPal_tedensity/ "aPal_(.*?).h5" The code worked. I noticed that upon editing any files, it reset the permissions to some default settings that would keep you locked out. So before I log off today I will run chmod -R 774. I suggest you do the same from now on to avoid any more permission errors. Also it may be worth checking your Python path so that you are pointing to /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/TE_Density At this point you should be able to subset the result arrays you want for your analysis within general_read_denstiy_data.py Finally got it working. Had to reinstall from the beginning. ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences conda create -n tedensity2 conda activate tedensity2 pip install -r TE_Density/requirements/requirements.txt python --version conda install python=3.10.11 pip install -r TE_Density/requirements/requirements.txt python src/general_read_density_data.py data/Cleaned_aPal_transcripts_DAR.tsv results/aPal_tedensity/ "aPal_(.*?).h5" ``` This generates a dataframe with the TE densities in the final column associated with the genes. Saved as the taxon prefix.tsv. Parts that I need to work on: Introduce new argument for the various TE types. Right now, that's hardcoded into line 134. Possibilities are: Total_TE_Density LINE SINE RC DNA LTR Then, loop through all of the taxa and all of the TE types to generate the dataframes. ``` Gene_Name Chromosome Feature Start Stop Strand Length Index_Val Total_TE_Density_5000_Upstream 0 ENST00000302277.ZNF804A.14 manual_scaffold_1 gene 607756.0 718519.0 + 110764.0 0 0.326935 1 ENST00000424728.FSIP2.14 manual_scaffold_1 gene 953827.0 1006527.0 + 52701.0 1 0.188162 2 ENST00000358470.STAT4.8 manual_scaffold_1 gene 1113315.0 1194013.0 + 80699.0 2 0.439912 3 ENST00000392323.STAT1.8 manual_scaffold_1 gene 1209441.0 1234443.0 + 25003.0 3 0.382723 4 ENST00000540176.STAT1.8 manual_scaffold_1 gene 1209442.0 1239630.0 + 30189.0 4 0.382723 .. ... ... ... ... ... ... ... ... ... 30 ENST00000611781.ANKRD30A.6543 manual_scaffold_92 gene 15289.0 15738.0 - 450.0 30 0.619676 31 ENST00000361713.ANKRD30A.6543 manual_scaffold_92 gene 15289.0 15738.0 - 450.0 31 0.619676 32 ENST00000602533.ANKRD30A.6543 manual_scaffold_92 gene 15289.0 15738.0 - 450.0 32 0.619676 33 ENST00000399273.CCDC144A.8399 manual_scaffold_92 gene 20378.0 20595.0 - 218.0 33 0.690262 34 ENST00000360524.CCDC144A.8399 manual_scaffold_92 gene 20378.0 20595.0 - 218.0 34 0.690262 ``` Need to talk with Liliana about how to best analyze the output. Would like to generate a figure like Figure 7 in https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4 Do do so, will need to identify the taxon pairs, generate the same dataframes, identify the shared genes or the longevity-related genes (not sure the best way forward), and then determine the differences between those gene sets. Once that's done, I have a function on my desktop that is the skeleton to generate the plot. graph_function.py ----- Per an e-mail from Scott: "Now that we know that the Python version and package versions all changed in a big way (if I recall correctly it looked like there was a major version change in Pandas and perhaps a few other packages), I would recommend that the density data is recalculated. I know that's probably not what you would like to do, but I can't speak to the accuracy of results produced under different circumstances, so I have no idea if the values produced were also affected." I'm going to re-run everything from the beginning. I revised the template script to use nocona. It seemed to run fast enough last time. New script -- tedensity_nocona_template.sh ``` #!/bin/bash #SBATCH --job-name=<NAME>-density #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=8 #SBATCH --mem-per-cpu=31G #-------------------------------------------------------- CODE_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/TE_Density DATA_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/data OUTPUT_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/results GENOME=<NAME> RUNTYPE=${GENOME}_tedensity ## This script uses a python program. You need to make sure python is loaded and ready to go by activating conda. ## and the tedensity virtual environment . ~/conda/etc/profile.d/conda.sh conda activate tedensity2 #ST Load Python #ST module purge #ST module load GCCcore/11.3.0 Python/3.10.4 #ST Source the Python packages that are version controlled #ST source /mnt/research/edgerpat_lab/Scotty/venvs/Bat_TE/bin/activate #ST Go to project directory cd $CODE_DIR/scripts # NOTE, running again after it timed out after making the revised annotation, it took a day to revise, that part is single threaded though # Run the code python $CODE_DIR/process_genome.py \ $DATA_DIR/Cleaned_${GENOME}_transcripts_DAR.tsv \ $DATA_DIR/Cleaned_${GENOME}_allTEs_binned_DAR.tsv \ $GENOME -c \ $CODE_DIR/config/production_run_config_edit.ini \ -n 8 \ -o $OUTPUT_DIR/$RUNTYPE ``` Get them running: ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts cat ../../../list.txt | while read i; do sed "s/<NAME>/$i/g" tedensity_nocona_template.sh >${i}_tedensity_nocona.sh; sbatch ${i}_tedensity_nocona.sh; done ``` Looks like the queue is a bit full. Going to switch a bunch of these back to xlquanah. ``` LIST="rAff rFer rCfl rMic rSed rSin rTri rAeg sBil sLep tBra aJam" scancel {12037966..12037977} cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts for i in $LIST; do sed "s/<NAME>/$i/g" tedensity_template.sh >${i}_tedensity.sh; sbatch ${i}_tedensity.sh; done ``` Those are all running. When done, generate all of the downstream dataframes using: ``` cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences TELIST="TOTAL LINE SINE DNA LTR RC" TAXONLIST="aPal aSto dRot dEca hCyc hLar mMol mMyo mNig pDis pHas pKuh rAff rFer rCfl rMic rSed rSin rTri rAeg sBil sLep tBra aJam" for TE in $TELIST; do for TAXON in $TAXONLIST; do python src/grd_${TE}.py data/Cleaned_${TAXON}_transcripts_DAR.tsv results/${TAXON}_tedensity/ "${TAXON}_(.*?).h5" done done ``` Ran this as a script, post_processing.sh, using individual grd files for each TE type. ``` #!/bin/bash #SBATCH --job-name=post_density #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=1 #-------------------------------------------------------- ``` #### obsolete Now intersect the TEs with the influential genes. ``` actconda conda activate bedtools cd /lustre/scratch/daray/intersect/longevity_all_bats/beds cat ../list.txt | while read i; do bedtools window -a ${i}_influential.bed -b ${i}_te.bed -l 1000 -r 0 >${i}_window_influential_v_te_1000up.bed cat ../mMol_influential_list_uniq.txt | while read j; do COUNT=$(grep -w $j ${i}_window_influential_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $j $COUNT >> ${i}influential_genes_counts.txt awk '{print $4}' ${i}_bat1k_transcripts_only.bed | sort > ${i}_transcripts_list.txt done done bedtools window -a mMyo_influential.bed -b mMyo_te.bed -l 1000 -r 0 >mMyo_window_influential_v_te_1000up.bed cat mMyo_influential_list.txt | while read i; do COUNT=$(grep -w $i mMyo_window_influential_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $i $COUNT >> mMyo_influential_genes_counts.txt; done awk '{print $4}' mMol_bat1k_transcripts_x1.bed | sort > mMol_transcripts_list.txt awk '{print $4}' mMyo_bat1k_transcripts_x1.bed | sort > mMyo_transcripts_list.txt