Running simpleaf using an inde

# Running simpleaf using a decoy index with extra sequences Building an index with decoy is supported from piscem v0.7.0 onward [`piscem`](https://github.com/COMBINE-lab/piscem/releases/tag/v0.7.0). However, due to a build system runner timeout currently imposed by bioconda, this release is not available via bioconda yet. However, we have pre-compiled executables for linux, OSX (intel) and OSX (arm) that you can get from the release page. First we need to download the latest `piscem` version and provide it to `simpleaf`. In the following we downloaded the version for OS `x86_64` (linux). If you use an apple computer or you want to compile by yourself, try other versions from [here](https://github.com/COMBINE-lab/piscem/releases/tag/v0.7.0). ## Fetch the latest piscem version ```bash= WORKDIR="$PWD/simpleaf_with_decoy" mkdir -p $WORKDIR cd $WORKDIR wget https://github.com/COMBINE-lab/piscem/releases/download/v0.7.0/piscem-x86_64-unknown-linux-gnu.tar.gz tar xzvf piscem-x86_64-unknown-linux-gnu.tar.gz PISCEM="$WORKDIR/piscem-x86_64-unknown-linux-gnu.tar.gz/piscem" $PISCEM --help ``` ## Set up simpleaf Then, we build a conda environment for `simpleaf` and specify that we want to use the `piscem` executable we downloaded. If you are using an existing conda env, please make sure that your version of `simpleaf` is $\ge 1.5.0$. ```bash= conda create -n simpleaf -y -c bioconda -c conda-forge simpleaf export ALEVIN_FRY_HOME="$WORKDIR/af_home" simpleaf set-paths --piscem $PISCEM ``` ## Prepare extra sequences to be included in the index `simpleaf index` accepts extra sequences and includes them in the index. There are two related arguments, `--spliced` and `--unspliced`, which take the extra sequences to be included in the index with an spliced and unspliced splicing status, respectively. One example of the `extra_spliced.fa` file is the following, where the MT-RNR1 sequence was obtrained from [UCSC](https://genome.cse.ucsc.edu/cgi-bin/hgGene?hgsid=1830735418_ASgmnXDjUw2EVXaFAVLesbxE8Gw8&hgg_do_getMrnaSeq=1&hgg_gene=ENST00000389680.2). Please make sure to name the gene accordingly: If you use the `gene_id_to_name.tsv` file generated by `simpleaf index` for converting gene IDs to names, then the name used in the following example, MT-RNR1, will work, because in that file, both the gene ID and name will be `MT-RNR1`. If you have your customized code for doing the conversion, you might want to use its gene ID instead of name because usually genes are named by their gene ID in simpleaf's results. ``` $ cat extra_spliced.fa >MT-RNR1 AATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCC CGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGC AGCAATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAA CCTTTAGCAATAAACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCG TGCCAGCCACCGCGGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTT TTAGATCACCCCCTCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTG ACACAAAATAGACTACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCA AACTGGGATTAGATACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAA CTGCTCGCCAGAACACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCAT ATCCCTCTAGAGGAGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTG CTCAGCCTATATACCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAG TACCCACGTAAAGACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACA TTTTCTACCCCAGAAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTA GCAGTAAACTAAGAGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCC CGTCACCCTCCTCAAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATAT AGAGGAGACAAGTCGTAACATGGTAAGTGTACTGGAAAGTGCACTTGGACGAAC ``` ## Build the index with decoy Here, we assume that you have your reference set in hands, including the genome FASTA file and the gene annotation GTF file. As for the extra sequences, - If you do not want to add extra seuquences, you just need to provide the reference set. - If you want to include extra spliced sequences in the index, you should have a FASTA file containing all extra spliced sequences. - If you want to include extra unspliced sequences, you can put them into another FASTA file named `extra_unspliced.fa`, just follows the same pattern as `extra_spliced.fa`. We recommend include all mitochondrial rNRAs in `extra_spliced.fa`. ```bash= GENOME="/path/to/genome.fa" GENE_ANN="/path/to/genes.gtf" EXTRA_S="/path/to/extra_spliced.fa" # delete if not applicable ``` ## Build decoy index with extra sequences After we set up the `piscem` executable and prepared reference files, we can call `simpleaf` to build the decoy index with the extra spliced sequences (MT-RNR1 in this case). The FASTA file containing extra spliced sequences should be passed to the `--spliced` argument. If you have an `extra_unspliced.fa` for some reason, you can pass it to the `--unspliced` argument. Usually, the decoy is generated from the genome, so we provide the genome FASTA file to `--decoy-paths`. If you have mulitple FASTA files to be used as the decoy, you can pass comma separated paths to `--decoy-paths` as well. ```bash= simpleaf index \ --output $WORKDIR/simpleaf_decoy_index_with_extra_spliced_seqs \ --fasta $GENOME \ --gtf $GENE_ANN \ --spliced $EXTRA_S \ --decoy-paths $GENOME \ --use-piscem \ --threads 16 INDEX="$WORKDIR/simpleaf_decoy_index_with_extra_spliced_seqs/index" REF="$WORKDIR/simpleaf_decoy_index_with_extra_spliced_seqs/ref" ``` The actual index can be found in `$INDEX`. We can find a `gene_id_to_name.tsv` file in `$REF`, which can be used for gene ID to name conversion. ## Process our data! Now we have the decoy index with extra sequences in hands. We can just use it to process our data. In the following we show an example to run `simpleaf quant`. Feel free to modify the command as your wish, because the decoy related mappings are handled in `piscem`'s mapping stage and do not affect other arguments in`simpleaf quant`. ```bash= FASTQ_DIR="\path\to\fastq\dir" # Define filename pattern READS1_PAT="_R1_" READS2_PAT="_R2_" # Obtain sorted filenames READS1="$(find -L $FASTQ_DIR -name "*$READS1_PAT*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)" READS2="$(find -L $FASTQ_DIR -name "*$READS2_PAT*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)" simpleaf quant \ --chemistry 10xv3 \ --output simpleaf_quant \ --threads 16 \ --index $INDEX \ --reads1 $READS1 \ --reads2 $READS2 \ --unfiltered-pl \ --resolution cr-like ```