---
tags: Dorado paper
title: Getting sequence data for Lee et al. 2015 Dorado paper
---
# Getting sequence data for Lee et al. 2015 Dorado paper
[toc]
# Problem with data in SRA
I'm terribly sorry for the mess on this dataset in NCBI ([SRP063681](https://www.ncbi.nlm.nih.gov/sra/?term=SRP063681)). I had a lot of trouble getting this dataset in there at the time, it was my first one. I couldn't figure out how to put them in still multiplexed (in an effort to keep them closer to "raw"), and it ended up as it is :/
Which is that each individual sample file there holds all samples still multiplexed together. This page is an example of getting the data and demultiplexing them.
# Conda environments
We will use conda to install what we'll use here to download the data and demultiplex it (see [here](https://astrobiomike.github.io/unix/conda-intro) if unfamiliar with conda):
```bash
# what we'll use to download the data
conda create -n sratools -c conda-forge -c bioconda -c defaults sra-tools=2.11.0
# what we'll use to demultiplex the data
conda create -y -n sabre -c conda-forge -c bioconda -c defaults sabre=1.000
```
# Getting the data
We can download just one sample's reads files, as each one holds all as mentioned above. Here is a link to one's entry, [SRX1242977](https://www.ncbi.nlm.nih.gov/sra/SRX1242977), with the run accession [SRR2398601](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR2398601).
We will use that run accession with [sratools](https://github.com/ncbi/sra-tools) to download the data (if needed):
```bash
conda activate sratools
fasterq-dump --split-spot --split-files --progress SRR2398601
conda deactivate
```
After that's done, we have these two files, `SRR2398601_1.fastq` and `SRR2398601_1.fastq`, which again hold all samples together currently.
# Demultiplexing the data
> There is an explanation of what demultiplexing is and a slightly more detailed example [here](https://astrobiomike.github.io/amplicon/demultiplexing) if wanted.
We can download a mapping file with some info on each sample with the following:
```bash
curl -L -o Lee-et-al-Dorado-barcode-info.tsv https://figshare.com/ndownloader/files/34391834
```
And use that to make the format wanted for the [sabre](https://github.com/najoshi/sabre) program we are going to use to demultiplex the data. The program wants a file with 3 columns: barcode; forward read output file name; reverse read output file name. We can make that from the information file we just downloaded with the following:
```bash
awk -F $'\t' ' { OFS=FS } NR > 1 { print $2, $1"_R1.fastq", $1"_R2.fastq" } ' Lee-et-al-Dorado-barcode-info.tsv > barcodes-info-for-sabre.tsv
```
Which looks like this:
```bash
column -ts $'\t' barcodes-info-for-sabre.tsv
```
```
GAGTTGAG Blank-4_R1.fastq Blank-4_R2.fastq
GAGTTCTG Blank-3_R1.fastq Blank-3_R2.fastq
GAGTTCAC Blank-2_R1.fastq Blank-2_R2.fastq
GAGTGTGA Blank-1_R1.fastq Blank-1_R2.fastq
GAGTGTCT BW-2_R1.fastq BW-2_R2.fastq
GAGTGAGT BW-1_R1.fastq BW-1_R2.fastq
GAGTAGAC R11-BF_R1.fastq R11-BF_R2.fastq
GAGATGTG R12_R1.fastq R12_R2.fastq
GAGTACTC R11_R1.fastq R11_R2.fastq
GAGTCTGT R10_R1.fastq R10_R2.fastq
GAGTACAG R9_R1.fastq R9_R2.fastq
GAGTAGTG R8_R1.fastq R8_R2.fastq
GAGTGACA R7_R1.fastq R7_R2.fastq
GAGATCAG R6_R1.fastq R6_R2.fastq
GAGAGTGT R5_R1.fastq R5_R2.fastq
GAGTCTCA R4_R1.fastq R4_R2.fastq
GAGATGAC R3_R1.fastq R3_R2.fastq
GAGATCTC R2_R1.fastq R2_R2.fastq
GAGTCAGA R1A_R1.fastq R1A_R2.fastq
GAGTCACT R1_R1.fastq R1_R2.fastq
```
> I realize having a sample called "R2" and "R1" is super-confusing with also having "R1" and "R2" as suffixes to signify forward and reverse reads 😬
And running `sabre`:
```bash
conda activate sabre
sabre pe -f SRR2398601_1.fastq -r SRR2398601_2.fastq -b barcodes-info-for-sabre.tsv -u no-bc-match_R1.fastq -w no-bc-match_R2.fastq
```
After less than a minute, the output from that will say something like this:
```
Total FastQ records: 11775306 (5887653 pairs)
FastQ records for barcode GAGTCACT: 259602 (129801 pairs)
FastQ records for barcode GAGTCAGA: 342418 (171209 pairs)
FastQ records for barcode GAGATCTC: 365866 (182933 pairs)
FastQ records for barcode GAGATGAC: 372970 (186485 pairs)
FastQ records for barcode GAGTCTCA: 397152 (198576 pairs)
FastQ records for barcode GAGAGTGT: 388638 (194319 pairs)
FastQ records for barcode GAGATCAG: 315056 (157528 pairs)
FastQ records for barcode GAGTGACA: 169528 (84764 pairs)
FastQ records for barcode GAGTAGTG: 262056 (131028 pairs)
FastQ records for barcode GAGTACAG: 186218 (93109 pairs)
FastQ records for barcode GAGTCTGT: 243564 (121782 pairs)
FastQ records for barcode GAGTACTC: 191306 (95653 pairs)
FastQ records for barcode GAGATGTG: 332526 (166263 pairs)
FastQ records for barcode GAGTAGAC: 190388 (95194 pairs)
FastQ records for barcode GAGTGAGT: 47606 (23803 pairs)
FastQ records for barcode GAGTGTCT: 127212 (63606 pairs)
FastQ records for barcode GAGTGTGA: 33436 (16718 pairs)
FastQ records for barcode GAGTTCAC: 12258 (6129 pairs)
FastQ records for barcode GAGTTCTG: 10546 (5273 pairs)
FastQ records for barcode GAGTTGAG: 10488 (5244 pairs)
FastQ records with no barcode match: 7516472 (3758236 pairs)
```
Which says of about 6 million initial read-pairs, about 3.7 million had no barcode match. That's totally okay, as there were other samples mixed in with this run that were not part of this dataset. The numbers recovered above for each sample are about right.
Now all samples are demlutiplexed, and we could get rid of the unmatched-reads files:
```bash
rm no-bc-match_R?.fastq
```
# More sample info
This dataset (a subset verison) is used in my dada2 amplicon tutorial [here](https://astrobiomike.github.io/amplicon/dada2_workflow_ex#the-data), so that might be of interest or help to someone working with this dataset again.
Here is how we can quickly download a table with the only other information I had on the samples:
```bash
curl -L -o Lee-et-al-Dorado-sample-info.tsv https://figshare.com/ndownloader/files/34391960
```
Which looks like this:
```bash
column -ts $'\t' Lee-et-al-Dorado-sample-info.tsv
```
```
sample temp type char color
BW1 2.0 water water blue
BW2 2.0 water water blue
R10 13.7 rock glassy black
R11BF 7.3 biofilm biofilm darkgreen
R11 7.3 rock glassy black
R12 NA rock altered chocolate4
R1A 8.6 rock altered chocolate4
R1 8.6 rock altered chocolate4
R2 8.6 rock altered chocolate4
R3 12.7 rock altered chocolate4
R4 12.7 rock altered chocolate4
R5 12.7 rock altered chocolate4
R6 12.7 rock altered chocolate4
R7 NA rock carbonate darkkhaki
R8 13.5 rock glassy black
R9 13.7 rock glassy black
```