Try   HackMD

Getting sequence data for Lee et al. 2015 Dorado paper

Problem with data in SRA

I'm terribly sorry for the mess on this dataset in NCBI (SRP063681). I had a lot of trouble getting this dataset in there at the time, it was my first one. I couldn't figure out how to put them in still multiplexed (in an effort to keep them closer to "raw"), and it ended up as it is :/

Which is that each individual sample file there holds all samples still multiplexed together. This page is an example of getting the data and demultiplexing them.

Conda environments

We will use conda to install what we'll use here to download the data and demultiplex it (see here if unfamiliar with conda):

# what we'll use to download the data
conda create -n sratools -c conda-forge -c bioconda -c defaults sra-tools=2.11.0

# what we'll use to demultiplex the data
conda create -y -n sabre -c conda-forge -c bioconda -c defaults sabre=1.000

Getting the data

We can download just one sample's reads files, as each one holds all as mentioned above. Here is a link to one's entry, SRX1242977, with the run accession SRR2398601.

We will use that run accession with sratools to download the data (if needed):

conda activate sratools

fasterq-dump --split-spot --split-files --progress SRR2398601

conda deactivate

After that's done, we have these two files, SRR2398601_1.fastq and SRR2398601_1.fastq, which again hold all samples together currently.

Demultiplexing the data

There is an explanation of what demultiplexing is and a slightly more detailed example here if wanted.

We can download a mapping file with some info on each sample with the following:

curl -L -o Lee-et-al-Dorado-barcode-info.tsv https://figshare.com/ndownloader/files/34391834

And use that to make the format wanted for the sabre program we are going to use to demultiplex the data. The program wants a file with 3 columns: barcode; forward read output file name; reverse read output file name. We can make that from the information file we just downloaded with the following:

awk -F $'\t' ' { OFS=FS } NR > 1 { print $2, $1"_R1.fastq", $1"_R2.fastq" } ' Lee-et-al-Dorado-barcode-info.tsv > barcodes-info-for-sabre.tsv

Which looks like this:

column -ts $'\t' barcodes-info-for-sabre.tsv
GAGTTGAG  Blank-4_R1.fastq  Blank-4_R2.fastq
GAGTTCTG  Blank-3_R1.fastq  Blank-3_R2.fastq
GAGTTCAC  Blank-2_R1.fastq  Blank-2_R2.fastq
GAGTGTGA  Blank-1_R1.fastq  Blank-1_R2.fastq
GAGTGTCT  BW-2_R1.fastq     BW-2_R2.fastq
GAGTGAGT  BW-1_R1.fastq     BW-1_R2.fastq
GAGTAGAC  R11-BF_R1.fastq   R11-BF_R2.fastq
GAGATGTG  R12_R1.fastq      R12_R2.fastq
GAGTACTC  R11_R1.fastq      R11_R2.fastq
GAGTCTGT  R10_R1.fastq      R10_R2.fastq
GAGTACAG  R9_R1.fastq       R9_R2.fastq
GAGTAGTG  R8_R1.fastq       R8_R2.fastq
GAGTGACA  R7_R1.fastq       R7_R2.fastq
GAGATCAG  R6_R1.fastq       R6_R2.fastq
GAGAGTGT  R5_R1.fastq       R5_R2.fastq
GAGTCTCA  R4_R1.fastq       R4_R2.fastq
GAGATGAC  R3_R1.fastq       R3_R2.fastq
GAGATCTC  R2_R1.fastq       R2_R2.fastq
GAGTCAGA  R1A_R1.fastq      R1A_R2.fastq
GAGTCACT  R1_R1.fastq       R1_R2.fastq

I realize having a sample called "R2" and "R1" is super-confusing with also having "R1" and "R2" as suffixes to signify forward and reverse reads 😬

And running sabre:

conda activate sabre

sabre pe -f SRR2398601_1.fastq -r SRR2398601_2.fastq -b barcodes-info-for-sabre.tsv -u no-bc-match_R1.fastq -w no-bc-match_R2.fastq

After less than a minute, the output from that will say something like this:

Total FastQ records: 11775306 (5887653 pairs)

FastQ records for barcode GAGTCACT: 259602 (129801 pairs)
FastQ records for barcode GAGTCAGA: 342418 (171209 pairs)
FastQ records for barcode GAGATCTC: 365866 (182933 pairs)
FastQ records for barcode GAGATGAC: 372970 (186485 pairs)
FastQ records for barcode GAGTCTCA: 397152 (198576 pairs)
FastQ records for barcode GAGAGTGT: 388638 (194319 pairs)
FastQ records for barcode GAGATCAG: 315056 (157528 pairs)
FastQ records for barcode GAGTGACA: 169528 (84764 pairs)
FastQ records for barcode GAGTAGTG: 262056 (131028 pairs)
FastQ records for barcode GAGTACAG: 186218 (93109 pairs)
FastQ records for barcode GAGTCTGT: 243564 (121782 pairs)
FastQ records for barcode GAGTACTC: 191306 (95653 pairs)
FastQ records for barcode GAGATGTG: 332526 (166263 pairs)
FastQ records for barcode GAGTAGAC: 190388 (95194 pairs)
FastQ records for barcode GAGTGAGT: 47606 (23803 pairs)
FastQ records for barcode GAGTGTCT: 127212 (63606 pairs)
FastQ records for barcode GAGTGTGA: 33436 (16718 pairs)
FastQ records for barcode GAGTTCAC: 12258 (6129 pairs)
FastQ records for barcode GAGTTCTG: 10546 (5273 pairs)
FastQ records for barcode GAGTTGAG: 10488 (5244 pairs)

FastQ records with no barcode match: 7516472 (3758236 pairs)

Which says of about 6 million initial read-pairs, about 3.7 million had no barcode match. That's totally okay, as there were other samples mixed in with this run that were not part of this dataset. The numbers recovered above for each sample are about right.

Now all samples are demlutiplexed, and we could get rid of the unmatched-reads files:

rm no-bc-match_R?.fastq

More sample info

This dataset (a subset verison) is used in my dada2 amplicon tutorial here, so that might be of interest or help to someone working with this dataset again.

Here is how we can quickly download a table with the only other information I had on the samples:

curl -L -o Lee-et-al-Dorado-sample-info.tsv https://figshare.com/ndownloader/files/34391960

Which looks like this:

column -ts $'\t' Lee-et-al-Dorado-sample-info.tsv
sample  temp  type     char       color
BW1     2.0   water    water      blue
BW2     2.0   water    water      blue
R10     13.7  rock     glassy     black
R11BF   7.3   biofilm  biofilm    darkgreen
R11     7.3   rock     glassy     black
R12     NA    rock     altered    chocolate4
R1A     8.6   rock     altered    chocolate4
R1      8.6   rock     altered    chocolate4
R2      8.6   rock     altered    chocolate4
R3      12.7  rock     altered    chocolate4
R4      12.7  rock     altered    chocolate4
R5      12.7  rock     altered    chocolate4
R6      12.7  rock     altered    chocolate4
R7      NA    rock     carbonate  darkkhaki
R8      13.5  rock     glassy     black
R9      13.7  rock     glassy     black