SalmonTE*: a modified version for locus-specific expression analysis of transposable elements

# SalmonTE*: a modified version for locus-specific expression analysis of transposable elements This is inspired by the paper by Schwarz et al. (2021). Basically, SalmonTE (which was designed to quantify TE expression at the family level) can easily be hijacked to count TE at the locus level instead of the family level. According to the paper, it works fairly well (even better than softwares actually designed for locus level quantification). A possible caveat, though, is that the comparison was done using simulated data: real data being noisier, the salmonTE* method might not work as well when used to process real reads. The main idea is to run SalmonTE normally, but instead of feeding it a family index, you give him a locus-specific index, which is based on a repeat reference library. This library is created using the .align file generated by RepeatMasker and the reference genome. After installing RepeatMasker, I ran: ``` RepeatMasker -a -species "Drosophila melanogaster" /home/fgaudilliere/beegfs-data/data/assemblages/dmgoth101_assembl.fasta ``` The -a option indicates that RepeatMasker should output a .align file. I then used the alignToFasta.sh helper script written by Schwarz et al. to create a fasta file. To get this to work, I had to clone the git on which the scripts are located, install bedtools and generate the .bed file the script needs (and is apparently incapable of correctly generating on its own, despite trying): ``` sudo apt-get install bedtools awk 'BEGIN{OFS="\t"}{if(NR>3) {if($9=="C"){strand="-"}else{strand="+"};print $5,$6-1,$7,$10,".",strand}}' \ dmgoth101_assembl.fasta.out > dmgoth101_assembl.repeatAnnotation.bed git clone https://github.com/Hoffmann-Lab/TEdetectionEvaluation cd /home/fgaudilliere/beegfs-data/TEdetectionEvaluation/helper-scripts ./alignToFasta.sh \ /home/fgaudilliere/beegfs-data/data/assemblages/dmgoth101_assembl.fasta.align \ /home/fgaudilliere/beegfs-data/data/assemblages/dmgoth101_assembl.fasta ``` I then used that file to create the salmonTE index (using the basic salmon command, as indicated in the paper). Be careful to remove any points in fasta file names, otherwise salmonTE doesn't recognize them as fasta files. ``` salmon index --transcripts dmgoth101_assembl_repeats_formatted.fa --i dm_101_locus_specific --type quasi --kmerLen 31 ``` To get an index where each copy of a transposon has a distinct identifier, I had reformatted the assembl_repeats.fa fasta file using the following python function. Otherwise the output reference doesn't work in a copy-specific way (I tried). Apparently if the different copies do not have distinct identifiers - which is not the case in the assembl_repeats.fa file generated by default by the Schwarz et al. scripts -, salmon will all pool them when generating the reference. In this formatting step, I also remove short repeats of the (NNN)n format and N-rich repeated sequences (because those are not TEs, but are still listed in the assembl-repeats file). ``` import os os.chdir('/home/fgaudilliere/beegfs-data/data/assemblages/') def format_fasta_file(fasta_file_name, output_name = 'formatted_fasta_file.txt'): fasta_file = open(fasta_file_name, 'r') lines = fasta_file.read().splitlines() output_file = open(output_name, 'w') dic_copy_indexing = {} for j in range(len(lines)): line = lines[j] if line[0] == '>' and ')n' not in line and '-rich' not in line: if line[1:] not in dic_copy_indexing.keys(): dic_copy_indexing[line[1:]] = 1 copy_nb = dic_copy_indexing[line[1:]] output_file.write('>' + line[1:] + '_'+ str(copy_nb).zfill(5) + '\n') output_file.write(lines[j+1] + '\n') dic_copy_indexing[line[1:]] += 1 fasta_file.close() output_file.close() format_fasta_file('dmgoth101_assembl_repeats.fa', output_name = 'dmgoth101_assembl_repeats_formatted.fa') ``` Finally, you can run SalmonTE using this new index: ``` SalmonTE.py quant --reference=dm_101_locus_specific --outpath= test_output_locus_specific --num_threads=2 --exprtype=count test_data ``` The files fed to SalmonTE should not contain any dots in their names, otherwise it will not recognize them as fasta.gz files.