I'm terribly sorry for the mess on this dataset in NCBI (SRP063681). I had a lot of trouble getting this dataset in there at the time, it was my first one. I couldn't figure out how to put them in still multiplexed (in an effort to keep them closer to "raw"), and it ended up as it is :/
Which is that each individual sample file there holds all samples still multiplexed together. This page is an example of getting the data and demultiplexing them.
We will use conda to install what we'll use here to download the data and demultiplex it (see here if unfamiliar with conda):
We can download just one sample's reads files, as each one holds all as mentioned above. Here is a link to one's entry, SRX1242977, with the run accession SRR2398601.
We will use that run accession with sratools to download the data (if needed):
After that's done, we have these two files, SRR2398601_1.fastq
and SRR2398601_1.fastq
, which again hold all samples together currently.
There is an explanation of what demultiplexing is and a slightly more detailed example here if wanted.
We can download a mapping file with some info on each sample with the following:
And use that to make the format wanted for the sabre program we are going to use to demultiplex the data. The program wants a file with 3 columns: barcode; forward read output file name; reverse read output file name. We can make that from the information file we just downloaded with the following:
Which looks like this:
I realize having a sample called "R2" and "R1" is super-confusing with also having "R1" and "R2" as suffixes to signify forward and reverse reads 😬
And running sabre
:
After less than a minute, the output from that will say something like this:
Which says of about 6 million initial read-pairs, about 3.7 million had no barcode match. That's totally okay, as there were other samples mixed in with this run that were not part of this dataset. The numbers recovered above for each sample are about right.
Now all samples are demlutiplexed, and we could get rid of the unmatched-reads files:
This dataset (a subset verison) is used in my dada2 amplicon tutorial here, so that might be of interest or help to someone working with this dataset again.
Here is how we can quickly download a table with the only other information I had on the samples:
Which looks like this: