# longevity bootstrap
Collect isoform 1 transcripts from TOGA data in /lustre/scratch/daray/tedensity/6bats_TOGA
```
cd /lustre/scratch/daray/tedensity/6bats_TOGA
gtf2bed < HLmyoMyo6.Bat1Kannotation.intact.gtf > mMyo_bat1k.bed
awk -F "\t" '{ if ($8 = "transcript") {print}} ' mMyo_bat1k.bed | grep -v "exon" >mMyo_bat1k_transcripts_only.bed
sed -i "s/ /\t/g" mMyo_bat1k_transcripts_only.bed
head mMyo_bat1k_transcripts_only.bed
sort -k4 mMyo_bat1k_transcripts_only.bed >mMyo_bat1k_transcripts_sort.bed
awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 }' mMyo_bat1k_transcripts_sort.bed >mMyo_bat1k_transcripts_trim.bed
sed "s/_x/\t_x/g" mMyo_bat1k_transcripts_trim.bed | head
sed "s/_x/\t_x/g" mMyo_bat1k_transcripts_trim.bed | awk '{if ($5 == "_x1") {print $1 "\t" $2 "\t" $3 "\t" $4 $5 "\t" $6 "\t" $7}}' >mMyo_bat1k_transcripts_x1.bed
```
Per Liliana, longevity gene data can be found at https://www.dropbox.com/sh/985gsici6bge1o9/AACZ7GjMIykwZ610MkHp3Hiya?dl=0
"the Navarro lists require collaboration
for wilkinson, pick outputs/longevity"
Genes in Wilkinson Momo and MYMY file are divvied in to Pos, Neg, and Neither. Assuming these correspond to positively influence, negatively influence, and no influence, respectively.
Use bedtools intersect to identify overlaps for ALL genes in transcripts_x1 and then for subset of 'influencer' genes.
Will use an overlap of within 1000 bp of entire transcript.
```
cd /lustre/scratch/daray/intersect/longevity_work
cp /lustre/scratch/daray/tedensity/6bats_TOGA/*_x1.bed .
cp ../mMol_te.bed ../mMyo_te.bed .
```
Created mMyo_influential_list.txt and mMol_influential_list.txt to extract those genes from full list. Sorted and uniq'd. No change
```
cat mMyo_influential_list.txt | while read i; do grep -w $i mMyo_bat1k_transcripts_x1.bed >>mMyo_influential.bed; done
cat mMol_influential_list.txt | while read i; do grep -w $i mMol_bat1k_transcripts_x1.bed >>mMol_influential.bed; done
```
```
bedtools window -a mMol_influential.bed -b mMol_te.bed -l 1000 -r 0 >mMol_window_influential_v_te_1000up.bed
cat mMol_influential_list.txt | while read i; do COUNT=$(grep -w $i mMol_window_influential_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $i $COUNT >> mMol_influential_genes_counts.txt; done
bedtools window -a mMyo_influential.bed -b mMyo_te.bed -l 1000 -r 0 >mMyo_window_influential_v_te_1000up.bed
cat mMyo_influential_list.txt | while read i; do COUNT=$(grep -w $i mMyo_window_influential_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $i $COUNT >> mMyo_influential_genes_counts.txt; done
awk '{print $4}' mMol_bat1k_transcripts_x1.bed | sort > mMol_transcripts_list.txt
awk '{print $4}' mMyo_bat1k_transcripts_x1.bed | sort > mMyo_transcripts_list.txt
bedtools window -a mMol_bat1k_transcripts_x1.bed -b mMol_te.bed -l 1000 -r 0 >mMol_window_transcripts_v_te_1000up.bed
cat mMol_transcripts_list.txt| while read i; do COUNT=$(grep -w $i mMol_window_transcripts_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $i $COUNT >> mMol_all_genes_counts.txt; done
bedtools window -a mMyo_bat1k_transcripts_x1.bed -b mMyo_te.bed -l 1000 -r 0 >mMyo_window_transcripts_v_te_1000up.bed
cat mMyo_transcripts_list.txt | while read i; do COUNT=$(grep -w $i mMyo_window_transcripts_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $i $COUNT >> mMyo_all_genes_counts.txt; done
```
### New work for longevity paper that Tanya is leading.
Received files from Tanya and deposited in /lustre/scratch/daray/intersect/longevity_all_bats
Unzipped and renamed all .gtf files in toga_annotations_longevity_bats/annotation_bed_gtf by removing "Copy of ". Copied those files to /lustre/scratch/daray/intersect/longevity_all_bats/processed
Now working within that directory.
```
actconda
#conda activate bedops
cd /lustre/scratch/daray/intersect/longevity_all_bats/processed
for FILE in *.gtf; do
NAME=$(basename $FILE .gtf | awk -F'[_]' '{print $1}')
echo $NAME
sed "s|/projects/hillerlab/genome/gbdb-HL/hg38/TOGA/vs_${NAME}/||g" $FILE >${NAME}
awk '{print $10}' ${NAME} >${NAME}_9.tmp
sed -i 's|"||g' ${NAME}_9.tmp
sed -i 's|;||g' ${NAME}_9.tmp
paste ${NAME} ${NAME}_9.tmp >${NAME}.tmp
rm ${NAME} ${NAME}_9.tmp
awk '{print $1 "\t" $4 "\t" $5 "\t" $13 "\t" $6 "\t" $7 "\t" $2 "\t" $3 "\t" $8 "\t" $9 "\t" $10 "\t" $11 "\t" $12}' ${NAME}.tmp >${NAME}_bat1k.bed
rm ${NAME}.tmp
awk -F "\t" '{ if ($8 = "transcript") {print}} ' ${NAME}_bat1k.bed | grep -v "exon" >${NAME}_bat1k_transcripts_only.bed
awk -i inplace '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6}' ${NAME}_bat1k_transcripts_only.bed
sed -i "s/ /\t/g" ${NAME}_bat1k_transcripts_only.bed
done
```
Renamed all the files to match our IDs.
Grabbed the appropriate RepeatMasker files from /lustre/scratch/daray/covid_bats2 or from /lustre/scratch/daray/toarchive/zoonomia, as appropriate.
During, check to make sure the assembly used to generate the RepeatMasker output is the same as the assembly used for gene annotation.
To be safe, I just re-ran repeatmasker for all of the genome assemblies. Work performed in /lustre/scratch/daray/intersect/longevity_all_bats/repeatmasker
A few species are missing due to bad assemblies. Will deal with it later.
Get the influential genes from each species using the list from mMol. mMol_influential_list_uniq.txt was derived from longevity_geneset_masterlist.xlsx
Output is a .bed of longevity-influential transcripts
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/beds
cat ../list.txt | while read i; do
cat ../mMol_influential_list_uniq.txt | while read j; do grep -w $j ${i}_bat1k_transcripts_only.bed >>${i}_influential.bed
done
done
```
Get the .bed files for the TE annotations
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/beds
cat ../list.txt | while read i; do
cat ../repeatmasker/${i}_RM/${i}_LTR_rm.bed ../repeatmasker/${i}_RM/${i}_LINE_rm.bed ../repeatmasker/${i}_RM/${i}_DNA_rm.bed ../repeatmasker/${i}_RM/${i}_SINE_rm.bed ../repeatmasker/${i}_RM/${i}_RC_rm.bed >${i}_te.bed
done
```
## Started working with Scott Teresi to run TEDensity.
See threads in e-mail.
To reduce the complexity of my TE annotation files, ran:
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/te_annotations
cat ../../list.txt | while read i; do
echo $i
sed "s|5S-Core-RTE|5S|g" ${i}_TEs.tsv >${i}_TEs.tsv.tmp
sed "s|5S-Deu-L2|5S|g; \
s|5S-RTE|5S|g; \
s|5S-Sauria-RTE|5S|g; \
s|5S|Other-SINE|g; \
s|7SL|Other-SINE|g; \
s|Alu|Other-SINE|g; \
s|U|Other-SINE|g; \
s|IS3EOther-SINE|IS3EU|g; \
s|MOther-SINELE-MuDR|MULE-MuDR|g; \
s|MOther-SINELE-NOF|MULE-NOF|g; \
s|Other-SINEnknown|Unknown|g; \
s|tRNA-7SL|tRNA-SINE|g; \
s|tRNA-CR1|tRNA-SINE|g; \
s|tRNA-Core|tRNA-SINE|g; \
s|tRNA-Core-L2|tRNA-SINE|g; \
s|tRNA-Core-RTE|tRNA-SINE|g; \
s|tRNA-Deu|tRNA-SINE|g; \
s|tRNA-Deu-L2|tRNA-SINE|g; \
s|tRNA-L1|tRNA-SINE|g; \
s|tRNA-L2|tRNA-SINE|g; \
s|tRNA-RTE|tRNA-SINE|g; \
s|tRNA-V|tRNA-SINE|g; \
s|tRNA-V-CR1|tRNA-SINE|g; \
s|Core-RTE|tRNA-SINE|g; \
s|tRNA-SINE-CR1|tRNA-SINE|g; \
s|tRNA-SINE-RTE|tRNA-SINE|g; \
s|tRNA-SINE-L2|tRNA-SINE|g; \
s|tRNA-V-Core-L2|tRNA-SINE|g; \
s|tRNA|tRNA-SINE|g; \
s|tRNA-SINE-SINE|tRNA-SINE|g; \
s|Rhin-1|Rhin|g; \
s|SINE2|Other-SINE|g; \
s|tRNA-SINE-Sauria|tRNA-SINE|g; \
s|unknown|Unknown|g; \
s|CMC-Chapaev-3|CMC|g; \
s|CMC-EnSpm|CMC|g; \
s|Crypton-A|Crypton|g; \
s|Crypton-H|Crypton|g; \
s|Crypton-V|Crypton|g; \
s|Kolobok-T2|Kolobok|g; \
s|MEG|Meg|g; \
s|MULE-MuDR|MULE|g; \
s|MULE-NOF|MULE|g; \
s|PiggyBac|piggyBac|g; \
s|Sola-1|Sola|g; \
s|Sola-2|Sola|g; \
s|TcMar-Fot1|TcMariner|g; \
s|TcMar-ISRm11|TcMariner|g; \
s|TcMar-Mariner|TcMariner|g; \
s|TcMar-Stowaway|TcMariner|g; \
s|TcMar-Tc1|TcMariner|g; \
s|TcMar-Tc2|TcMariner|g; \
s|TcMar-Tigger|TcMariner|g; \
s|TcMar|TcMariner|g; \
s|TcMariner-Mariner|TcMariner|g; \
s|TcMariner-ISRm11|TcMariner|g; \
s|TcMariner-Tigger|TcMariner|g; \
s|TcMariner-Mariner|TcMariner|g; \
s|TcMariner-ISRm11|TcMariner|g; \
s|TcMarineriner|TcMariner|g; \
s|hAT-Blackjack|hAT|g; \
s|hAT-Charlie|hAT|g; \
s|hAT-Tip100|hAT|g; \
s|Academ-1|Other-DNA|g; \
s|CACTA|Other-DNA|g; \
s|CMC|Other-DNA|g; \
s|Crypton|Other-DNA|g; \
s|Dada|Other-DNA|g; \
s|Ginger|Other-DNA|g; \
s|PIF-Harbinger|Harbinger|g; \
s|Harbinger|Other-DNA|g; \
s|IS3EU|Other-DNA|g; \
s|Kolobok|Other-DNA|g; \
s|Maverick|Other-DNA|g; \
s|MULE|Other-DNA|g; \
s|Mutator|Other-DNA|g; \
s|Merlin|Other-DNA|g; \
s|Novosib|Other-DNA|g; \
s|PIF-ISL2EU|Other-DNA|g; \
s|P|Other-DNA|g; \
s|Other-DNAao|Pao|g; \
s|Other-DNAenelope|P|g; \
s|Sola|Other-DNA|g; \
s|Zator|Other-DNA|g; \
s|Zisupton|Other-DNA|g; \
s|Ngaro|Other-Retrotransposon|g; \
s|DIRS|Other-Retrotransposon|g; \
s|I-Jockey|Other-LINE|g; \
s|Jockey|Other-LINE|g; \
s|R2-Hero|R2|g; \
s|R2-NeSL|R2|g; \
s|RTE-BovB|RTE|g; \
s|RTE-RTE|RTE|g; \
s|RTE-X|RTE|g; \
s|BovB|RTE|g; \
s|L1-Tx1|L1|g; \
s|B2|Other-LINE|g; \
s|B4|Other-LINE|g; \
s|Dong-R4|Other-LINE|g; \
s|L2|Other-LINE|g; \
s|LINE-like|Other-LINE|g; \
s|R1|Other-LINE|g; \
s|COther-LINE|CR1|g; \
s|R2|Other-LINE|g; \
s|Rex-Babar|Other-LINE|g; \
s|ERV-Lenti|ERV|g; \
s|ERV-Foamy|ERV|g; \
s|ERV1|ERV|g; \
s|ERV1_internal|ERV|g; \
s|ERV1_ltr|ERV|g; \
s|ERV4|ERV|g; \
s|ERVK|ERV|g; \
s|ERVK_internal|ERV|g; \
s|ERV_ltr|ERV|g; \
s|ERVK_ltr|ERV|g; \
s|ERVL|ERV|g; \
s|ERVL-MaLR|ERV|g; \
s|ERV-MaLR|ERV|g; \
s|ERVL_internal|ERV|g; \
s|ERVL_ltr|ERV|g; \
s|ERV_internal|ERV|g; \
s|ERV_ltr|ERV|g; \
s|Bhikhari|Other-LTR|g; \
s|Other-DNAIF-ISOther-LINEEOther-SINE|Other-DNA|g; \
#s|tRNA-SINE-Core-Other-LINE|tRNA-SINE|g; \
s|Pao|Other-LTR|g" ${i}_TEs.tsv.tmp >${i}_TEs_binned.tsv
awk '{print $6}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_superfamily.txt
awk '{print $5}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_order.txt
rm ${i}_TEs.tsv.tmp
#gzip -c ${i}_TEs_binned.tsv >${i}_TEs_binned.tsv.gz
done
```
Determine which scaffolds to use:
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity
sbatch scaff_filter.sh
```
Had to re-run the six bats because the scaffold names were different on the NCBI versions and the TOGA-run versions.
Managed to find the assemblies with the old TOGA names and re-ran RepeatMasker. Output is in: /lustre/scratch/daray/intersect/longevity_all_bats/repeatmasker
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/beds
LIST="mMol mMyo pDis pKuh rAeg rFer"
for i in $LIST; do
cat ../repeatmasker/${i}_RM/${i}_LTR_rm.bed ../repeatmasker/${i}_RM/${i}_LINE_rm.bed ../repeatmasker/${i}_RM/${i}_DNA_rm.bed ../repeatmasker/${i}_RM/${i}_SINE_rm.bed ../repeatmasker/${i}_RM/${i}_RC_rm.bed >${i}_te.bed
done
```
Convert to tsv

```
cd /lustre/scratch/daray/intersect/longevity_all_bats/beds
LIST="mMol mMyo pDis pKuh rAeg rFer"
for i in $LIST; do
awk '{print $1 "\t" $2 "\t" $3 "\t" $6 "\t" $7 "\t" $8 "\t" $3-$2+1}' ${i}_te.bed >../tedensity/te_annotations/${i}_TEs.tmp
cat ../tedensity/te_annotations/te_header.txt ../tedensity/te_annotations/${i}_TEs.tmp >../tedensity/te_annotations/${i}_TEs.tsv
rm ../tedensity/te_annotations/${i}_TEs.tmp
done
```
Clean up
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/te_annotations
LIST="mMol mMyo pDis pKuh rAeg rFer"
for i in $LIST; do
echo $i
sed "s|5S-Core-RTE|5S|g" ${i}_TEs.tsv >${i}_TEs.tsv.tmp
sed "s|5S-Deu-L2|5S|g; \
s|5S-RTE|5S|g; \
s|5S-Sauria-RTE|5S|g; \
s|5S|Other-SINE|g; \
s|7SL|Other-SINE|g; \
s|Alu|Other-SINE|g; \
s|U|Other-SINE|g; \
s|IS3EOther-SINE|IS3EU|g; \
s|MOther-SINELE-MuDR|MULE-MuDR|g; \
s|MOther-SINELE-NOF|MULE-NOF|g; \
s|Other-SINEnknown|Unknown|g; \
s|tRNA-7SL|tRNA-SINE|g; \
s|tRNA-CR1|tRNA-SINE|g; \
s|tRNA-Core|tRNA-SINE|g; \
s|tRNA-Core-L2|tRNA-SINE|g; \
s|tRNA-Core-RTE|tRNA-SINE|g; \
s|tRNA-Deu|tRNA-SINE|g; \
s|tRNA-Deu-L2|tRNA-SINE|g; \
s|tRNA-L1|tRNA-SINE|g; \
s|tRNA-L2|tRNA-SINE|g; \
s|tRNA-RTE|tRNA-SINE|g; \
s|tRNA-V|tRNA-SINE|g; \
s|tRNA-V-CR1|tRNA-SINE|g; \
s|Core-RTE|tRNA-SINE|g; \
s|tRNA-SINE-CR1|tRNA-SINE|g; \
s|tRNA-SINE-RTE|tRNA-SINE|g; \
s|tRNA-SINE-L2|tRNA-SINE|g; \
s|tRNA-V-Core-L2|tRNA-SINE|g; \
s|tRNA|tRNA-SINE|g; \
s|tRNA-SINE-SINE|tRNA-SINE|g; \
s|Rhin-1|Rhin|g; \
s|SINE2|Other-SINE|g; \
s|tRNA-SINE-Sauria|tRNA-SINE|g; \
s|tRNA-SINE-Ceph-RTE|tRNA-SINE|g; \
s|Ceph|tRNA-SINE|g; \
s|tRNA-SINE-Mermaid|tRNA-SINE|g; \
s|tRNA-SINE-Meta|tRNA-SINE|g; \
s|Core|Other-SINE|g; \
s|unknown|Unknown|g; \
s|CMC-Chapaev-3|CMC|g; \
s|CMC-EnSpm|CMC|g; \
s|Crypton-A|Crypton|g; \
s|Crypton-H|Crypton|g; \
s|Crypton-V|Crypton|g; \
s|Kolobok-T2|Kolobok|g; \
s|MEG|Meg|g; \
s|MULE-MuDR|MULE|g; \
s|MULE-NOF|MULE|g; \
s|PiggyBac|piggyBac|g; \
s|Sola-1|Sola|g; \
s|Sola-2|Sola|g; \
s|TcMar-Fot1|TcMariner|g; \
s|TcMar-ISRm11|TcMariner|g; \
s|TcMar-Mariner|TcMariner|g; \
s|TcMar-Stowaway|TcMariner|g; \
s|TcMar-Tc1|TcMariner|g; \
s|TcMar-Tc2|TcMariner|g; \
s|TcMar-Tigger|TcMariner|g; \
s|TcMar|TcMariner|g; \
s|TcMariner-Mariner|TcMariner|g; \
s|TcMariner-ISRm11|TcMariner|g; \
s|TcMariner-Tigger|TcMariner|g; \
s|TcMariner-Mariner|TcMariner|g; \
s|TcMariner-ISRm11|TcMariner|g; \
s|TcMarineriner|TcMariner|g; \
s|hAT-Blackjack|hAT|g; \
s|hAT-Charlie|hAT|g; \
s|hAT-Tip100|hAT|g; \
s|Academ-1|Other-DNA|g; \
s|CACTA|Other-DNA|g; \
s|CMC|Other-DNA|g; \
s|Crypton|Other-DNA|g; \
s|Dada|Other-DNA|g; \
s|Ginger|Other-DNA|g; \
s|PIF-Harbinger|Harbinger|g; \
s|Harbinger|Other-DNA|g; \
s|IS3EU|Other-DNA|g; \
s|Kolobok|Other-DNA|g; \
s|Maverick|Other-DNA|g; \
s|MULE|Other-DNA|g; \
s|Mutator|Other-DNA|g; \
s|Merlin|Other-DNA|g; \
s|Novosib|Other-DNA|g; \
s|PIF-ISL2EU|Other-DNA|g; \
s|P|Other-DNA|g; \
s|Other-DNAao|Pao|g; \
s|Other-DNAenelope|P|g; \
s|Sola|Other-DNA|g; \
s|Zator|Other-DNA|g; \
s|Zisupton|Other-DNA|g; \
s|Ngaro|Other-Retrotransposon|g; \
s|DIRS|Other-Retrotransposon|g; \
s|I-Jockey|Other-LINE|g; \
s|Jockey|Other-LINE|g; \
s|R2-Hero|R2|g; \
s|R2-NeSL|R2|g; \
s|RTE-BovB|RTE|g; \
s|RTE-RTE|RTE|g; \
s|RTE-X|RTE|g; \
s|BovB|RTE|g; \
s|L1-Tx1|L1|g; \
s|B2|Other-LINE|g; \
s|B4|Other-LINE|g; \
s|Dong-R4|Other-LINE|g; \
s|L2|Other-LINE|g; \
s|LINE-like|Other-LINE|g; \
s|R1|Other-LINE|g; \
s|COther-LINE|CR1|g; \
s|R2|Other-LINE|g; \
s|Rex-Babar|Other-LINE|g; \
s|CR1-Zenon|CR1|g; \
s|Tad1|Other-LINE|g; \
s|R0|Other-LINE|g; \
s|Other-DNAroto2|Other-LINE|g; \
s|ERV-Lenti|ERV|g; \
s|ERV-Foamy|ERV|g; \
s|ERV1|ERV|g; \
s|ERV1_internal|ERV|g; \
s|ERV1_ltr|ERV|g; \
s|ERV4|ERV|g; \
s|ERVK|ERV|g; \
s|ERVK_internal|ERV|g; \
s|ERV_ltr|ERV|g; \
s|ERVK_ltr|ERV|g; \
s|ERVL|ERV|g; \
s|ERVL-MaLR|ERV|g; \
s|ERV-MaLR|ERV|g; \
s|ERVL_internal|ERV|g; \
s|ERVL_ltr|ERV|g; \
s|ERV_internal|ERV|g; \
s|ERV_ltr|ERV|g; \
s|Bhikhari|Other-LTR|g; \
s|Other-DNAIF-ISOther-LINEEOther-SINE|Other-DNA|g; \
s|tRNA-SINE-Core-Other-LINE|tRNA-SINE|g; \
s|tRNA-SINE-Core-Other-LINE|tRNA-SINE|g; \
s|tRNA-SINE-Other-SINE|tRNA-SINE|g; \
s|tRNA-SINE-Other-LINE|tRNA-SINE|g; \
s|Other-DNA-Transib|Other-DNA|g; \
s|tRNA-SINE-RTE|tRNA-SINE|g; \
s|Pao|Other-LTR|g" ${i}_TEs.tsv.tmp >${i}_TEs_binned.tsv
cp ${i}_TEs_binned.tsv /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/clean_files/cleaned/Cleaned_${i}_TEs_binned_DAR.tsv
awk '{print $6}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_superfamily.txt
awk '{print $5}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_order.txt
rm ${i}_TEs.tsv.tmp
gzip -c ${i}_TEs_binned.tsv >${i}_TEs_binned.tsv.gz
done
LIST="mMol mMyo pDis pKuh rAeg rFer"
for i in $LIST; do
cp ${i}_TEs_binned.tsv /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/clean_files/Cleaned_${i}_TEs_binned_DAR.tsv
done
```
Clean up scaffolds and run TEDensity
```
sbatch scaff_filter_6bats.sh
#Wait to finish
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts
LIST="mMol mMyo pDis pKuh rAeg rFer"
for i in $LIST; do
sbatch ${i}_tedensity.sh
done
```
aJam is messed up. Seems to be at the step of creating the Cleaned TE file.
```
sbatch scaff_filter_aJam.sh
```
The TE_binned file is still not right. Go back farther.
Found the problem. The assembly used for TOGA and for RepeatMasker are not the same.
Got what I think is the right assembly from Tanya but I need to check.
Start by checking to see if the same scaffolds are represented in the TOGA and RM annotations.
Run RepeatMasker on the 'correct' assembly
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/assemblies
head aJam.masked.fa
awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' aJam.masked.fa > aJam.unmasked.fa
gzip aJam.unmasked.fa >aJam.fa.gz
cd /lustre/scratch/daray/intersect/longevity_all_bats/repeatmasker
sbatch aJam_RM.sh
cd /lustre/scratch/daray/intersect/longevity_all_bats/beds
LIST="aJam"
for i in $LIST; do
cat ../repeatmasker/${i}_RM/${i}_LTR_rm.bed ../repeatmasker/${i}_RM/${i}_LINE_rm.bed ../repeatmasker/${i}_RM/${i}_DNA_rm.bed ../repeatmasker/${i}_RM/${i}_SINE_rm.bed ../repeatmasker/${i}_RM/${i}_RC_rm.bed >${i}_te.bed
done
awk '{print $1}' aJam_te.bed | sort |uniq |wc -l
---> 830
awk '{print $1}' aJam_bat1k_transcripts_only.bed | sort |uniq |wc -l
---> 400
```
The numbers don't match but that may not mean anything. It's unlikely that a gene was found on every scaffold of the assembly.
Keep going and cross your fingers.
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/beds
LIST="aJam"
for i in $LIST; do
awk '{print $1 "\t" $2 "\t" $3 "\t" $6 "\t" $7 "\t" $8 "\t" $3-$2+1}' ${i}_te.bed >../tedensity/te_annotations/${i}_TEs.tmp
cat ../tedensity/te_annotations/te_header.txt ../tedensity/te_annotations/${i}_TEs.tmp >../tedensity/te_annotations/${i}_TEs.tsv
rm ../tedensity/te_annotations/${i}_TEs.tmp
done
cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/te_annotations
LIST="aJam"
for i in $LIST; do
echo $i
sed "s|5S-Core-RTE|5S|g" ${i}_TEs.tsv >${i}_TEs.tsv.tmp
sed "s|5S-Deu-L2|5S|g; \
s|5S-RTE|5S|g; \
s|5S-Sauria-RTE|5S|g; \
s|5S|Other-SINE|g; \
s|7SL|Other-SINE|g; \
s|Alu|Other-SINE|g; \
s|U|Other-SINE|g; \
s|IS3EOther-SINE|IS3EU|g; \
s|MOther-SINELE-MuDR|MULE-MuDR|g; \
s|MOther-SINELE-NOF|MULE-NOF|g; \
s|Other-SINEnknown|Unknown|g; \
s|tRNA-7SL|tRNA-SINE|g; \
s|tRNA-CR1|tRNA-SINE|g; \
s|tRNA-Core|tRNA-SINE|g; \
s|tRNA-Core-L2|tRNA-SINE|g; \
s|tRNA-Core-RTE|tRNA-SINE|g; \
s|tRNA-Deu|tRNA-SINE|g; \
s|tRNA-Deu-L2|tRNA-SINE|g; \
s|tRNA-L1|tRNA-SINE|g; \
s|tRNA-L2|tRNA-SINE|g; \
s|tRNA-RTE|tRNA-SINE|g; \
s|tRNA-V|tRNA-SINE|g; \
s|tRNA-V-CR1|tRNA-SINE|g; \
s|Core-RTE|tRNA-SINE|g; \
s|tRNA-SINE-CR1|tRNA-SINE|g; \
s|tRNA-SINE-RTE|tRNA-SINE|g; \
s|tRNA-SINE-L2|tRNA-SINE|g; \
s|tRNA-V-Core-L2|tRNA-SINE|g; \
s|tRNA|tRNA-SINE|g; \
s|tRNA-SINE-SINE|tRNA-SINE|g; \
s|Rhin-1|Rhin|g; \
s|SINE2|Other-SINE|g; \
s|tRNA-SINE-Sauria|tRNA-SINE|g; \
s|tRNA-SINE-Ceph-RTE|tRNA-SINE|g; \
s|Ceph|tRNA-SINE|g; \
s|tRNA-SINE-Mermaid|tRNA-SINE|g; \
s|tRNA-SINE-Meta|tRNA-SINE|g; \
s|Core|Other-SINE|g; \
s|unknown|Unknown|g; \
s|CMC-Chapaev-3|CMC|g; \
s|CMC-EnSpm|CMC|g; \
s|Crypton-A|Crypton|g; \
s|Crypton-H|Crypton|g; \
s|Crypton-V|Crypton|g; \
s|Kolobok-T2|Kolobok|g; \
s|MEG|Meg|g; \
s|MULE-MuDR|MULE|g; \
s|MULE-NOF|MULE|g; \
s|PiggyBac|piggyBac|g; \
s|Sola-1|Sola|g; \
s|Sola-2|Sola|g; \
s|TcMar-Fot1|TcMariner|g; \
s|TcMar-ISRm11|TcMariner|g; \
s|TcMar-Mariner|TcMariner|g; \
s|TcMar-Stowaway|TcMariner|g; \
s|TcMar-Tc1|TcMariner|g; \
s|TcMar-Tc2|TcMariner|g; \
s|TcMar-Tigger|TcMariner|g; \
s|TcMar|TcMariner|g; \
s|TcMariner-Mariner|TcMariner|g; \
s|TcMariner-ISRm11|TcMariner|g; \
s|TcMariner-Tigger|TcMariner|g; \
s|TcMariner-Mariner|TcMariner|g; \
s|TcMariner-ISRm11|TcMariner|g; \
s|TcMarineriner|TcMariner|g; \
s|hAT-Blackjack|hAT|g; \
s|hAT-Charlie|hAT|g; \
s|hAT-Tip100|hAT|g; \
s|Academ-1|Other-DNA|g; \
s|CACTA|Other-DNA|g; \
s|CMC|Other-DNA|g; \
s|Crypton|Other-DNA|g; \
s|Dada|Other-DNA|g; \
s|Ginger|Other-DNA|g; \
s|PIF-Harbinger|Harbinger|g; \
s|Harbinger|Other-DNA|g; \
s|IS3EU|Other-DNA|g; \
s|Kolobok|Other-DNA|g; \
s|Maverick|Other-DNA|g; \
s|MULE|Other-DNA|g; \
s|Mutator|Other-DNA|g; \
s|Merlin|Other-DNA|g; \
s|Novosib|Other-DNA|g; \
s|PIF-ISL2EU|Other-DNA|g; \
s|P|Other-DNA|g; \
s|Other-DNAao|Pao|g; \
s|Other-DNAenelope|P|g; \
s|Sola|Other-DNA|g; \
s|Zator|Other-DNA|g; \
s|Zisupton|Other-DNA|g; \
s|Ngaro|Other-Retrotransposon|g; \
s|DIRS|Other-Retrotransposon|g; \
s|I-Jockey|Other-LINE|g; \
s|Jockey|Other-LINE|g; \
s|R2-Hero|R2|g; \
s|R2-NeSL|R2|g; \
s|RTE-BovB|RTE|g; \
s|RTE-RTE|RTE|g; \
s|RTE-X|RTE|g; \
s|BovB|RTE|g; \
s|L1-Tx1|L1|g; \
s|B2|Other-LINE|g; \
s|B4|Other-LINE|g; \
s|Dong-R4|Other-LINE|g; \
s|L2|Other-LINE|g; \
s|LINE-like|Other-LINE|g; \
s|R1|Other-LINE|g; \
s|COther-LINE|CR1|g; \
s|R2|Other-LINE|g; \
s|Rex-Babar|Other-LINE|g; \
s|CR1-Zenon|CR1|g; \
s|Tad1|Other-LINE|g; \
s|R0|Other-LINE|g; \
s|Other-DNAroto2|Other-LINE|g; \
s|ERV-Lenti|ERV|g; \
s|ERV-Foamy|ERV|g; \
s|ERV1|ERV|g; \
s|ERV1_internal|ERV|g; \
s|ERV1_ltr|ERV|g; \
s|ERV4|ERV|g; \
s|ERVK|ERV|g; \
s|ERVK_internal|ERV|g; \
s|ERV_ltr|ERV|g; \
s|ERVK_ltr|ERV|g; \
s|ERVL|ERV|g; \
s|ERVL-MaLR|ERV|g; \
s|ERV-MaLR|ERV|g; \
s|ERVL_internal|ERV|g; \
s|ERVL_ltr|ERV|g; \
s|ERV_internal|ERV|g; \
s|ERV_ltr|ERV|g; \
s|Bhikhari|Other-LTR|g; \
s|Other-DNAIF-ISOther-LINEEOther-SINE|Other-DNA|g; \
s|tRNA-SINE-Core-Other-LINE|tRNA-SINE|g; \
s|tRNA-SINE-Core-Other-LINE|tRNA-SINE|g; \
s|tRNA-SINE-Other-SINE|tRNA-SINE|g; \
s|tRNA-SINE-Other-LINE|tRNA-SINE|g; \
s|Other-DNA-Transib|Other-DNA|g; \
s|tRNA-SINE-RTE|tRNA-SINE|g; \
s|Pao|Other-LTR|g" ${i}_TEs.tsv.tmp >${i}_TEs_binned.tsv
cp ${i}_TEs_binned.tsv /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/clean_files/Cleaned_${i}_TEs_binned_DAR.tsv
awk '{print $6}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_superfamily.txt
awk '{print $5}' ${i}_TEs_binned.tsv | sort | uniq >${i}_uniq_order.txt
rm ${i}_TEs.tsv.tmp
gzip -c ${i}_TEs_binned.tsv >${i}_TEs_binned.tsv.gz
done
cd ..
sbatch scaff_filter_aJam.sh
#Wait to finish
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts
LIST="aJam"
for i in $LIST; do
sbatch ${i}_tedensity.sh
done
```
While talking with Scott, we discovered that the downstream analysis crashes if not all TEs are present in the .hd5 files.
We decided to filter the input data for only scaffolds with ALL TEs. I did so with this:
```
for FILE in *_TEs_binned_DAR.tsv; do
NAME=$(printf '%s\n' "$FILE" | cut -d_ -f 2)
rm Cleaned_${NAME}_allTEs_DAR.tsv
echo "TAXON=$NAME"
awk '{ print $1 } ' $FILE | sort | uniq > ${NAME}.scaff
sed -i '1d' ${NAME}.scaff
cat ${NAME}.scaff | while read SCAFF; do
#echo "scaffold=$SCAFF"
grep -w "$SCAFF" $FILE >${SCAFF}.extract.tsv
#TELIST=$(grep "$SCAFF" ${SCAFF}.extract.tsv | awk '{ print $5 } ' | sort | uniq )
#echo $TELIST
TECOUNT=$(grep "$SCAFF" ${SCAFF}.extract.tsv | awk '{ print $5 } ' | sort | uniq | wc -l)
#echo $TECOUNT
if (( TECOUNT > 4 )); then
cat ${SCAFF}.extract.tsv >> Cleaned_${NAME}_allTEs_DAR.tsv
fi
rm ${SCAFF}.extract.tsv
done
rm ${NAME}.scaff
done
```
Now I need to re-run TEDensity but with a new config file.
First, link the new TE files.
```
/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/data$ for FILE in ls /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/clean_files/*.tsv; do ln -s $FILE; done
```
Change production_config file to 5kb windows
```
[density_parameters]
first_window_size = 5000
window_delta = 2500
last_window_size = 5000
```
Change template to use production_config file and to use the new TE files.
New template file:
```
#!/bin/bash
#SBATCH --job-name=<NAME>-density
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=xlquanah
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=300:00:00
#SBATCH --mem-per-cpu=31G
#SBATCH --account=xlquanah
#--------------------------------------------------------
CODE_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/TE_Density
DATA_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/data
OUTPUT_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/results
GENOME=<NAME>
RUNTYPE=${GENOME}_tedensity
## This script uses a python program. You need to make sure python is loaded and ready to go by activating conda.
## and the tedensity virtual environment
. ~/conda/etc/profile.d/conda.sh
conda activate tedensity
#ST Load Python
#ST module purge
#ST module load GCCcore/11.3.0 Python/3.10.4
#ST Source the Python packages that are version controlled
#ST source /mnt/research/edgerpat_lab/Scotty/venvs/Bat_TE/bin/activate
#ST Go to project directory
cd $CODE_DIR/scripts
# NOTE, running again after it timed out after making the revised annotation, it took a day to revise, that part is single threaded though
# Run the code
python $CODE_DIR/process_genome.py \
$DATA_DIR/Cleaned_${GENOME}_transcripts_DAR.tsv \
$DATA_DIR/Cleaned_${GENOME}_allTEs_DAR.tsv \
$GENOME -c \
$CODE_DIR/config/production_run_config_edit.ini \
-n 8 \
-o $OUTPUT_DIR/$RUNTYPE
```
```
cat ../../../list.txt | while read i; do sed "s/<NAME>/$i/g" tedensity_template.sh >${i}_tedensity.sh; done
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts
cat ../../../list.txt | while read i; do
sbatch ${i}_tedensity.sh
done
```
Aargh. Ran into the problem with unequal numbers of scaffolds between TE files and gene files again. Need to re-run the equalization script again.
Redo the TE counting again, yeilding a file that will be equalized.
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/te_annotations
for FILE in *_TEs_binned.tsv; do
FILENAME=$(basename $FILE)
NAME=$(printf '%s\n' "$FILENAME" | cut -d_ -f 1)
rm ${NAME}_allTEs_binned_DAR.tsv
rm ${NAME}_allTEs_binned.tsv
echo "TAXON=$NAME"
awk '{ print $1 } ' $FILE | sort | uniq > ${NAME}.scaff
sed -i '1d' ${NAME}.scaff
cat ${NAME}.scaff | while read SCAFF; do
#echo "scaffold=$SCAFF"
grep -w "$SCAFF" $FILE >${SCAFF}.extract.tsv
#TELIST=$(grep "$SCAFF" ${SCAFF}.extract.tsv | awk '{ print $5 } ' | sort | uniq )
#echo $TELIST
TECOUNT=$(grep "$SCAFF" ${SCAFF}.extract.tsv | awk '{ print $5 } ' | sort | uniq | wc -l)
#echo $TECOUNT
if (( TECOUNT > 4 )); then
cat ${SCAFF}.extract.tsv >> ${NAME}_allTEs_binned.tsv
fi
rm ${SCAFF}.extract.tsv
done
rm ${NAME}.scaff
done
```
Now, run the scaff_filter_v2.sh script.
```
#!/bin/bash
#SBATCH --job-name=filter
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=1
cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/te_annotations
#LIST="aPal"
cat ../../list.txt | while read i; do
echo "TAXON=$i"
#for i in $LIST; do
rm ${i}_no_te_scaffolds.txt
rm ${i}_no_gene_scaffolds.txt
touch ${i}_no_te_scaffolds.txt
touch ${i}_no_gene_scaffolds.txt
cat ${i}_scaffolds.txt | while read j; do
TE_SCAFFOLD_COUNT=$(grep -w "$j" ${i}_allTEs_binned.tsv | wc -l)
if [ "$TE_SCAFFOLD_COUNT" -eq "0" ]; then
echo "$j" >> ${i}_no_te_scaffolds.txt
fi
GENE_SCAFFOLD_COUNT=$(grep -w "$j" ../genes/${i}_transcripts.tsv | wc -l)
if [ "$GENE_SCAFFOLD_COUNT" -eq "0" ]; then
echo "$j" >> ${i}_no_gene_scaffolds.txt
fi
done
cat ${i}_no_gene_scaffolds.txt ${i}_no_te_scaffolds.txt | sort | uniq >${i}_remove_scaffolds.txt
grep -w -v -f ${i}_remove_scaffolds.txt ../genes/${i}_transcripts.tsv > ../clean_files/Cleaned_${i}_transcripts_DAR.tsv
grep -w -v -f ${i}_remove_scaffolds.txt ${i}_allTEs_binned.tsv >${i}_allTEs_binned_DAR.tsv
cat te_header.txt ${i}_allTEs_binned_DAR.tsv >../clean_files/Cleaned_${i}_allTEs_binned_DAR.tsv
# sed -i '1s/^/Chromosome Start Stop Strand Order SuperFamily Length \n/' ../clean_files/Cleaned_${i}_allTEs_binned_DAR.tsv
done
```
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/tedensity
sbatch scaff_filter_v2.sh
```
Link new files to data directory.
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/data
for FILE in /lustre/scratch/daray/intersect/longevity_all_bats/tedensity/clean_files/*.tsv; do ln -s $FILE; done
```
These files should work.
New template file:
```
#!/bin/bash
#SBATCH --job-name=<NAME>-density
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=xlquanah
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=300:00:00
#SBATCH --mem-per-cpu=31G
#SBATCH --account=xlquanah
#--------------------------------------------------------
CODE_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/TE_Density
DATA_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/data
OUTPUT_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/results
GENOME=<NAME>
RUNTYPE=${GENOME}_tedensity
## This script uses a python program. You need to make sure python is loaded and ready to go by activating conda.
## and the tedensity virtual environment
. ~/conda/etc/profile.d/conda.sh
conda activate tedensity
#ST Load Python
#ST module purge
#ST module load GCCcore/11.3.0 Python/3.10.4
#ST Source the Python packages that are version controlled
#ST source /mnt/research/edgerpat_lab/Scotty/venvs/Bat_TE/bin/activate
#ST Go to project directory
cd $CODE_DIR/scripts
# NOTE, running again after it timed out after making the revised annotation, it took a day to revise, that part is single threaded though
# Run the code
python $CODE_DIR/process_genome.py \
$DATA_DIR/Cleaned_${GENOME}_transcripts_DAR.tsv \
$DATA_DIR/Cleaned_${GENOME}_allTEs_binned_DAR.tsv \
$GENOME -c \
$CODE_DIR/config/production_run_config_edit.ini \
-n 8 \
-o $OUTPUT_DIR/$RUNTYPE
```
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts
cat ../../../list.txt | while read i; do sed "s/<NAME>/$i/g" tedensity_template.sh >${i}_tedensity.sh; done
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts
cat ../../../list.txt | while read i; do
sbatch ${i}_tedensity.sh
done
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts
cat ../../../list.txt | while read i; do
sbatch ${i}_tedensity_nocona.sh
done
```
## August 2023
Have been attempting to get the downstream steps to work but was running into an error. See messages with Scott.
Use /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/general_read_density_data.py
This can be invoked for aPal via: `python src/general_read_density_data.py data/Cleaned_aPal_transcripts_DAR.tsv results
/aPal_tedensity/ "aPal_(.*?).h5"`
Based on an e-mail from 8/14/2023, the problem may be related to a permissions issue.
===================================
I've been debugging the code for a while now and have been unable to replicate that error. I suspect it was something to do with the permissions of the files due to the weird nature of the error message and the fact that everything works fine for me.
My steps in order of operations:
* cd to /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences
* Activate my TE Density python package environment. I don't know if yours is different, but it shouldn't be. If you continue to have an error I suggest we start from a fresh install here. I used miniconda for mine in order to pursue a minimum working install.
* python src/general_read_density_data.py data/Cleaned_aPal_transcripts_DAR.tsv results/aPal_tedensity/ "aPal_(.*?).h5"
The code worked.
I noticed that upon editing any files, it reset the permissions to some default settings that would keep you locked out. So before I log off today I will run chmod -R 774. I suggest you do the same from now on to avoid any more permission errors. Also it may be worth checking your Python path so that you are pointing to /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/TE_Density
At this point you should be able to subset the result arrays you want for your analysis within general_read_denstiy_data.py
Finally got it working. Had to reinstall from the beginning.
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences
conda create -n tedensity2
conda activate tedensity2
pip install -r TE_Density/requirements/requirements.txt
python --version
conda install python=3.10.11
pip install -r TE_Density/requirements/requirements.txt
python src/general_read_density_data.py data/Cleaned_aPal_transcripts_DAR.tsv results/aPal_tedensity/ "aPal_(.*?).h5"
```
This generates a dataframe with the TE densities in the final column associated with the genes. Saved as the taxon prefix.tsv.
Parts that I need to work on:
Introduce new argument for the various TE types. Right now, that's hardcoded into line 134.
Possibilities are:
Total_TE_Density
LINE
SINE
RC
DNA
LTR
Then, loop through all of the taxa and all of the TE types to generate the dataframes.
```
Gene_Name Chromosome Feature Start Stop Strand Length Index_Val Total_TE_Density_5000_Upstream
0 ENST00000302277.ZNF804A.14 manual_scaffold_1 gene 607756.0 718519.0 + 110764.0 0 0.326935
1 ENST00000424728.FSIP2.14 manual_scaffold_1 gene 953827.0 1006527.0 + 52701.0 1 0.188162
2 ENST00000358470.STAT4.8 manual_scaffold_1 gene 1113315.0 1194013.0 + 80699.0 2 0.439912
3 ENST00000392323.STAT1.8 manual_scaffold_1 gene 1209441.0 1234443.0 + 25003.0 3 0.382723
4 ENST00000540176.STAT1.8 manual_scaffold_1 gene 1209442.0 1239630.0 + 30189.0 4 0.382723
.. ... ... ... ... ... ... ... ... ...
30 ENST00000611781.ANKRD30A.6543 manual_scaffold_92 gene 15289.0 15738.0 - 450.0 30 0.619676
31 ENST00000361713.ANKRD30A.6543 manual_scaffold_92 gene 15289.0 15738.0 - 450.0 31 0.619676
32 ENST00000602533.ANKRD30A.6543 manual_scaffold_92 gene 15289.0 15738.0 - 450.0 32 0.619676
33 ENST00000399273.CCDC144A.8399 manual_scaffold_92 gene 20378.0 20595.0 - 218.0 33 0.690262
34 ENST00000360524.CCDC144A.8399 manual_scaffold_92 gene 20378.0 20595.0 - 218.0 34 0.690262
```
Need to talk with Liliana about how to best analyze the output.
Would like to generate a figure like Figure 7 in https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
Do do so, will need to identify the taxon pairs, generate the same dataframes, identify the shared genes or the longevity-related genes (not sure the best way forward), and then determine the differences between those gene sets.
Once that's done, I have a function on my desktop that is the skeleton to generate the plot. graph_function.py
-----
Per an e-mail from Scott:
"Now that we know that the Python version and package versions all changed in a big way (if I recall correctly it looked like there was a major version change in Pandas and perhaps a few other packages), I would recommend that the density data is recalculated.
I know that's probably not what you would like to do, but I can't speak to the accuracy of results produced under different circumstances, so I have no idea if the values produced were also affected."
I'm going to re-run everything from the beginning.
I revised the template script to use nocona. It seemed to run fast enough last time.
New script -- tedensity_nocona_template.sh
```
#!/bin/bash
#SBATCH --job-name=<NAME>-density
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --mem-per-cpu=31G
#--------------------------------------------------------
CODE_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/TE_Density
DATA_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/data
OUTPUT_DIR=/lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/results
GENOME=<NAME>
RUNTYPE=${GENOME}_tedensity
## This script uses a python program. You need to make sure python is loaded and ready to go by activating conda.
## and the tedensity virtual environment
. ~/conda/etc/profile.d/conda.sh
conda activate tedensity2
#ST Load Python
#ST module purge
#ST module load GCCcore/11.3.0 Python/3.10.4
#ST Source the Python packages that are version controlled
#ST source /mnt/research/edgerpat_lab/Scotty/venvs/Bat_TE/bin/activate
#ST Go to project directory
cd $CODE_DIR/scripts
# NOTE, running again after it timed out after making the revised annotation, it took a day to revise, that part is single threaded though
# Run the code
python $CODE_DIR/process_genome.py \
$DATA_DIR/Cleaned_${GENOME}_transcripts_DAR.tsv \
$DATA_DIR/Cleaned_${GENOME}_allTEs_binned_DAR.tsv \
$GENOME -c \
$CODE_DIR/config/production_run_config_edit.ini \
-n 8 \
-o $OUTPUT_DIR/$RUNTYPE
```
Get them running:
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts
cat ../../../list.txt | while read i; do sed "s/<NAME>/$i/g" tedensity_nocona_template.sh >${i}_tedensity_nocona.sh; sbatch ${i}_tedensity_nocona.sh; done
```
Looks like the queue is a bit full. Going to switch a bunch of these back to xlquanah.
```
LIST="rAff rFer rCfl rMic rSed rSin rTri rAeg sBil sLep tBra aJam"
scancel {12037966..12037977}
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences/src/scripts
for i in $LIST; do sed "s/<NAME>/$i/g" tedensity_template.sh >${i}_tedensity.sh; sbatch ${i}_tedensity.sh; done
```
Those are all running.
When done, generate all of the downstream dataframes using:
```
cd /lustre/scratch/daray/intersect/longevity_all_bats/Bat_TE_Differences
TELIST="TOTAL LINE SINE DNA LTR RC"
TAXONLIST="aPal aSto dRot dEca hCyc hLar mMol mMyo mNig pDis pHas pKuh rAff rFer rCfl rMic rSed rSin rTri rAeg sBil sLep tBra aJam"
for TE in $TELIST; do
for TAXON in $TAXONLIST; do
python src/grd_${TE}.py data/Cleaned_${TAXON}_transcripts_DAR.tsv results/${TAXON}_tedensity/ "${TAXON}_(.*?).h5"
done
done
```
Ran this as a script, post_processing.sh, using individual grd files for each TE type.
```
#!/bin/bash
#SBATCH --job-name=post_density
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=1
#--------------------------------------------------------
```
#### obsolete
Now intersect the TEs with the influential genes.
```
actconda
conda activate bedtools
cd /lustre/scratch/daray/intersect/longevity_all_bats/beds
cat ../list.txt | while read i; do
bedtools window -a ${i}_influential.bed -b ${i}_te.bed -l 1000 -r 0 >${i}_window_influential_v_te_1000up.bed
cat ../mMol_influential_list_uniq.txt | while read j; do COUNT=$(grep -w $j ${i}_window_influential_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $j $COUNT >> ${i}influential_genes_counts.txt
awk '{print $4}' ${i}_bat1k_transcripts_only.bed | sort > ${i}_transcripts_list.txt
done
done
bedtools window -a mMyo_influential.bed -b mMyo_te.bed -l 1000 -r 0 >mMyo_window_influential_v_te_1000up.bed
cat mMyo_influential_list.txt | while read i; do COUNT=$(grep -w $i mMyo_window_influential_v_te_1000up.bed | wc -l | awk '{print $1}'); echo $i $COUNT >> mMyo_influential_genes_counts.txt; done
awk '{print $4}' mMol_bat1k_transcripts_x1.bed | sort > mMol_transcripts_list.txt
awk '{print $4}' mMyo_bat1k_transcripts_x1.bed | sort > mMyo_transcripts_list.txt