--- tags: Dorado paper title: Getting sequence data for Lee et al. 2015 Dorado paper --- # Getting sequence data for Lee et al. 2015 Dorado paper [toc] # Problem with data in SRA I'm terribly sorry for the mess on this dataset in NCBI ([SRP063681](https://www.ncbi.nlm.nih.gov/sra/?term=SRP063681)). I had a lot of trouble getting this dataset in there at the time, it was my first one. I couldn't figure out how to put them in still multiplexed (in an effort to keep them closer to "raw"), and it ended up as it is :/ Which is that each individual sample file there holds all samples still multiplexed together. This page is an example of getting the data and demultiplexing them. # Conda environments We will use conda to install what we'll use here to download the data and demultiplex it (see [here](https://astrobiomike.github.io/unix/conda-intro) if unfamiliar with conda): ```bash # what we'll use to download the data conda create -n sratools -c conda-forge -c bioconda -c defaults sra-tools=2.11.0 # what we'll use to demultiplex the data conda create -y -n sabre -c conda-forge -c bioconda -c defaults sabre=1.000 ``` # Getting the data We can download just one sample's reads files, as each one holds all as mentioned above. Here is a link to one's entry, [SRX1242977](https://www.ncbi.nlm.nih.gov/sra/SRX1242977), with the run accession [SRR2398601](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR2398601). We will use that run accession with [sratools](https://github.com/ncbi/sra-tools) to download the data (if needed): ```bash conda activate sratools fasterq-dump --split-spot --split-files --progress SRR2398601 conda deactivate ``` After that's done, we have these two files, `SRR2398601_1.fastq` and `SRR2398601_1.fastq`, which again hold all samples together currently. # Demultiplexing the data > There is an explanation of what demultiplexing is and a slightly more detailed example [here](https://astrobiomike.github.io/amplicon/demultiplexing) if wanted. We can download a mapping file with some info on each sample with the following: ```bash curl -L -o Lee-et-al-Dorado-barcode-info.tsv https://figshare.com/ndownloader/files/34391834 ``` And use that to make the format wanted for the [sabre](https://github.com/najoshi/sabre) program we are going to use to demultiplex the data. The program wants a file with 3 columns: barcode; forward read output file name; reverse read output file name. We can make that from the information file we just downloaded with the following: ```bash awk -F $'\t' ' { OFS=FS } NR > 1 { print $2, $1"_R1.fastq", $1"_R2.fastq" } ' Lee-et-al-Dorado-barcode-info.tsv > barcodes-info-for-sabre.tsv ``` Which looks like this: ```bash column -ts $'\t' barcodes-info-for-sabre.tsv ``` ``` GAGTTGAG Blank-4_R1.fastq Blank-4_R2.fastq GAGTTCTG Blank-3_R1.fastq Blank-3_R2.fastq GAGTTCAC Blank-2_R1.fastq Blank-2_R2.fastq GAGTGTGA Blank-1_R1.fastq Blank-1_R2.fastq GAGTGTCT BW-2_R1.fastq BW-2_R2.fastq GAGTGAGT BW-1_R1.fastq BW-1_R2.fastq GAGTAGAC R11-BF_R1.fastq R11-BF_R2.fastq GAGATGTG R12_R1.fastq R12_R2.fastq GAGTACTC R11_R1.fastq R11_R2.fastq GAGTCTGT R10_R1.fastq R10_R2.fastq GAGTACAG R9_R1.fastq R9_R2.fastq GAGTAGTG R8_R1.fastq R8_R2.fastq GAGTGACA R7_R1.fastq R7_R2.fastq GAGATCAG R6_R1.fastq R6_R2.fastq GAGAGTGT R5_R1.fastq R5_R2.fastq GAGTCTCA R4_R1.fastq R4_R2.fastq GAGATGAC R3_R1.fastq R3_R2.fastq GAGATCTC R2_R1.fastq R2_R2.fastq GAGTCAGA R1A_R1.fastq R1A_R2.fastq GAGTCACT R1_R1.fastq R1_R2.fastq ``` > I realize having a sample called "R2" and "R1" is super-confusing with also having "R1" and "R2" as suffixes to signify forward and reverse reads 😬 And running `sabre`: ```bash conda activate sabre sabre pe -f SRR2398601_1.fastq -r SRR2398601_2.fastq -b barcodes-info-for-sabre.tsv -u no-bc-match_R1.fastq -w no-bc-match_R2.fastq ``` After less than a minute, the output from that will say something like this: ``` Total FastQ records: 11775306 (5887653 pairs) FastQ records for barcode GAGTCACT: 259602 (129801 pairs) FastQ records for barcode GAGTCAGA: 342418 (171209 pairs) FastQ records for barcode GAGATCTC: 365866 (182933 pairs) FastQ records for barcode GAGATGAC: 372970 (186485 pairs) FastQ records for barcode GAGTCTCA: 397152 (198576 pairs) FastQ records for barcode GAGAGTGT: 388638 (194319 pairs) FastQ records for barcode GAGATCAG: 315056 (157528 pairs) FastQ records for barcode GAGTGACA: 169528 (84764 pairs) FastQ records for barcode GAGTAGTG: 262056 (131028 pairs) FastQ records for barcode GAGTACAG: 186218 (93109 pairs) FastQ records for barcode GAGTCTGT: 243564 (121782 pairs) FastQ records for barcode GAGTACTC: 191306 (95653 pairs) FastQ records for barcode GAGATGTG: 332526 (166263 pairs) FastQ records for barcode GAGTAGAC: 190388 (95194 pairs) FastQ records for barcode GAGTGAGT: 47606 (23803 pairs) FastQ records for barcode GAGTGTCT: 127212 (63606 pairs) FastQ records for barcode GAGTGTGA: 33436 (16718 pairs) FastQ records for barcode GAGTTCAC: 12258 (6129 pairs) FastQ records for barcode GAGTTCTG: 10546 (5273 pairs) FastQ records for barcode GAGTTGAG: 10488 (5244 pairs) FastQ records with no barcode match: 7516472 (3758236 pairs) ``` Which says of about 6 million initial read-pairs, about 3.7 million had no barcode match. That's totally okay, as there were other samples mixed in with this run that were not part of this dataset. The numbers recovered above for each sample are about right. Now all samples are demlutiplexed, and we could get rid of the unmatched-reads files: ```bash rm no-bc-match_R?.fastq ``` # More sample info This dataset (a subset verison) is used in my dada2 amplicon tutorial [here](https://astrobiomike.github.io/amplicon/dada2_workflow_ex#the-data), so that might be of interest or help to someone working with this dataset again. Here is how we can quickly download a table with the only other information I had on the samples: ```bash curl -L -o Lee-et-al-Dorado-sample-info.tsv https://figshare.com/ndownloader/files/34391960 ``` Which looks like this: ```bash column -ts $'\t' Lee-et-al-Dorado-sample-info.tsv ``` ``` sample temp type char color BW1 2.0 water water blue BW2 2.0 water water blue R10 13.7 rock glassy black R11BF 7.3 biofilm biofilm darkgreen R11 7.3 rock glassy black R12 NA rock altered chocolate4 R1A 8.6 rock altered chocolate4 R1 8.6 rock altered chocolate4 R2 8.6 rock altered chocolate4 R3 12.7 rock altered chocolate4 R4 12.7 rock altered chocolate4 R5 12.7 rock altered chocolate4 R6 12.7 rock altered chocolate4 R7 NA rock carbonate darkkhaki R8 13.5 rock glassy black R9 13.7 rock glassy black ```