# Removing rRNA example
This is using [sortmerna](https://github.com/biocore/sortmerna), the user manual is [here](https://github.com/biocore/sortmerna/wiki/2.-User-manual-(todo)). This is done on a server (so linux), where conda can install the recent 4.2.0-0 version. On Mac, the latest version avaiable through conda is 2.1b, which won't work the same way. I'm not sure what would be the case if relegated to a windows machine. Let me know if doing it on a server/linux is not an option for you, and I'll try to help out more if needed 🙂
## Installing sortmerna
```bash
conda create -n sortmerna -c conda-forge -c bioconda -c defaults sortmerna
conda activate sortmerna
```
Can see all options with `sortmerna -h`.
## Downloading and indexing the rRNA database
They have several [here](https://github.com/biocore/sortmerna/tree/master/data/rRNA_databases). This code downloads the silva-bac-16s-id90.fasta one, which I think is appropriate for the Crocosphaera case:
```
curl -LO https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/silva-bac-16s-id90.fasta
```
## Example use
Downloading test reads and making a reverse to be a paired-end example:
```bash
curl -L -o test-R1.fq https://ndownloader.figshare.com/files/26646107
cp test-R1.fq test-R2.fq
```
If we look at these the first read is rRNA, the second is not:
```bash
cat test-R1.fq | sed 's/^/# /'
# @NR_024570.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence
# AGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGA
# +
# ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
# @not_rRNA NC_000913.3:c4394067-4392928 Escherichia coli str. K-12 substr. MG1655, complete genome
# ATGTCAGAGCCCCTCGATCTCAATCAGTTAGCGCAAAAAATTAAACAGTGGGGGCTGGAACTGGGCTTTCAGCAGGTAGGTATTACCGATACCGATCTCAGCGAGTCCGAGCCCAAACTGCAAGCATGGCTGGACAAACAATACCACGGCG
# +
# ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
```
We're doing this with unzipped ones to be able to look at them, but works the same with gzipped.
Running:
```bash
sortmerna --ref silva-bac-16s-id90.fasta \
--reads test-R1.fq --reads test-R2.fq \
--fastx --paired_out --out2 \
--aligned rRNA-reads --other non-rRNA-reads
```
* `--ref` – points to the ref fasta
* `--reads` – specify the input reads, two of them if paired reads in separate files like above
* `--fastx` – says to output as fasta or fastq (depending on input)
* `--paired_out` – says to keep pairs together (if one matches, then count both as matching db)
* `--out2` – says to write output read pairs in separate files for forward and reverse reads
* `--aligned` – output prefix for those reads that match the database
* `--other` – output prefix for those reads that don't match the database, our wanted non-rRNA reads
And if we look at our test outputs, the non-rRNA read files have the correct one:
```bash
cat non-rRNA-reads_fwd.fq | sed 's/^/# /'
# @not_rRNA NC_000913.3:c4394067-4392928 Escherichia coli str. K-12 substr. MG1655, complete genome
# ATGTCAGAGCCCCTCGATCTCAATCAGTTAGCGCAAAAAATTAAACAGTGGGGGCTGGAACTGGGCTTTCAGCAGGTAGGTATTACCGATACCGATCTCAGCGAGTCCGAGCCCAAACTGCAAGCATGGCTGGACAAACAATACCACGGCG
# +
# ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
```
```bash
cat non-rRNA-reads_rev.fq | sed 's/^/# /'
# @not_rRNA NC_000913.3:c4394067-4392928 Escherichia coli str. K-12 substr. MG1655, complete genome
# ATGTCAGAGCCCCTCGATCTCAATCAGTTAGCGCAAAAAATTAAACAGTGGGGGCTGGAACTGGGCTTTCAGCAGGTAGGTATTACCGATACCGATCTCAGCGAGTCCGAGCCCAAACTGCAAGCATGGCTGGACAAACAATACCACGGCG
# +
# ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
```
---