Removing rRNA example

# Removing rRNA example This is using [sortmerna](https://github.com/biocore/sortmerna), the user manual is [here](https://github.com/biocore/sortmerna/wiki/2.-User-manual-(todo)). This is done on a server (so linux), where conda can install the recent 4.2.0-0 version. On Mac, the latest version avaiable through conda is 2.1b, which won't work the same way. I'm not sure what would be the case if relegated to a windows machine. Let me know if doing it on a server/linux is not an option for you, and I'll try to help out more if needed 🙂 ## Installing sortmerna ```bash conda create -n sortmerna -c conda-forge -c bioconda -c defaults sortmerna conda activate sortmerna ``` Can see all options with `sortmerna -h`. ## Downloading and indexing the rRNA database They have several [here](https://github.com/biocore/sortmerna/tree/master/data/rRNA_databases). This code downloads the silva-bac-16s-id90.fasta one, which I think is appropriate for the Crocosphaera case: ``` curl -LO https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/silva-bac-16s-id90.fasta ``` ## Example use Downloading test reads and making a reverse to be a paired-end example: ```bash curl -L -o test-R1.fq https://ndownloader.figshare.com/files/26646107 cp test-R1.fq test-R2.fq ``` If we look at these the first read is rRNA, the second is not: ```bash cat test-R1.fq | sed 's/^/# /' # @NR_024570.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence # AGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGA # + # ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? # @not_rRNA NC_000913.3:c4394067-4392928 Escherichia coli str. K-12 substr. MG1655, complete genome # ATGTCAGAGCCCCTCGATCTCAATCAGTTAGCGCAAAAAATTAAACAGTGGGGGCTGGAACTGGGCTTTCAGCAGGTAGGTATTACCGATACCGATCTCAGCGAGTCCGAGCCCAAACTGCAAGCATGGCTGGACAAACAATACCACGGCG # + # ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? ``` We're doing this with unzipped ones to be able to look at them, but works the same with gzipped. Running: ```bash sortmerna --ref silva-bac-16s-id90.fasta \ --reads test-R1.fq --reads test-R2.fq \ --fastx --paired_out --out2 \ --aligned rRNA-reads --other non-rRNA-reads ``` * `--ref` – points to the ref fasta * `--reads` – specify the input reads, two of them if paired reads in separate files like above * `--fastx` – says to output as fasta or fastq (depending on input) * `--paired_out` – says to keep pairs together (if one matches, then count both as matching db) * `--out2` – says to write output read pairs in separate files for forward and reverse reads * `--aligned` – output prefix for those reads that match the database * `--other` – output prefix for those reads that don't match the database, our wanted non-rRNA reads And if we look at our test outputs, the non-rRNA read files have the correct one: ```bash cat non-rRNA-reads_fwd.fq | sed 's/^/# /' # @not_rRNA NC_000913.3:c4394067-4392928 Escherichia coli str. K-12 substr. MG1655, complete genome # ATGTCAGAGCCCCTCGATCTCAATCAGTTAGCGCAAAAAATTAAACAGTGGGGGCTGGAACTGGGCTTTCAGCAGGTAGGTATTACCGATACCGATCTCAGCGAGTCCGAGCCCAAACTGCAAGCATGGCTGGACAAACAATACCACGGCG # + # ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? ``` ```bash cat non-rRNA-reads_rev.fq | sed 's/^/# /' # @not_rRNA NC_000913.3:c4394067-4392928 Escherichia coli str. K-12 substr. MG1655, complete genome # ATGTCAGAGCCCCTCGATCTCAATCAGTTAGCGCAAAAAATTAAACAGTGGGGGCTGGAACTGGGCTTTCAGCAGGTAGGTATTACCGATACCGATCTCAGCGAGTCCGAGCCCAAACTGCAAGCATGGCTGGACAAACAATACCACGGCG # + # ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? ``` ---