Examples downloading fastq data from NCBI's SRA

--- tags: happy belly title: Examples downloading fastq data from NCBI's SRA --- > **Starter page for a happy belly tutorial on downloading fastq data from SRA.** # Example downloading fastq data from NCBI's SRA [toc] --- # Environment ```bash conda create -y -n ncbi-tools -c conda-forge -c bioconda -c defaults sra-tools=2.10.1 entrez-direct=13.9 conda create -y -n pysradb -c conda-forge -c bioconda -c defaults python=3.7 pysradb conda activate pysradb ``` --- # Examples ## Downloading one sample's fastq files ### Starting with "run accession" (SRR...) To download things with `fasterq-dump` as shown here, we want a "run" accession, e.g., SRR6853380: ```bash fasterq-dump --split-files --progress SRR6853380 ``` Not detailing the programs currently, so see `fasterq-dump -h` for more info on the options available and set here. When that's done it will print out something like: ``` spots read : 6,900,587 reads read : 13,801,174 reads written : 13,801,174 ``` And we have two files in our working directory that are each about 1.7GB: ``` 1.7G SRR6853380_1.fastq 1.7G SRR6853380_2.fastq ``` --- ### Starting with "experiment accession" (SRX...) We first need to get the run accession. We can find that using `entrez-direct`. This toolset is powerful and complicated – more examples can be found on the [EDirect page](https://astrobiomike.github.io/unix/ncbi_eutils). We can see the xml format of some summary information with this alone: ```bash esearch -db sra -query SRX3808505 | esummary ``` ``` <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE DocumentSummarySet PUBLIC "-//NLM//DTD esummary sra 20130524//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130524/esummary_sra.dtd"> <DocumentSummarySet status="OK"> <DocumentSummary> <Id>5245643</Id> <ExpXml> <Summary><Title>Metagenome</Title><Platform instrument_model="Illumina HiSeq 4000">ILLUMINA</Platform><Statistics total_runs="1" total_spots="6900587" total_bases="1264594030" total_size="707674318" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA667369" center_name="Jet Propulsion Laboratory, California Institute of" contact_name="Kasthuri Venkateswaran" lab_name="Biotechnology and Planetary Protection Group"/><Experiment acc="SRX3808505" ver="1" status="public" name="Metagenome"/><Study acc="SRP135937" name="ISS Metagenomes"/><Organism taxid="1256227" ScientificName="indoor metagenome"/><Sample acc="SRS2065862" name=""/><Instrument ILLUMINA="Illumina HiSeq 4000"/><Library_descriptor><LIBRARY_NAME>IIIF1SWP</LIBRARY_NAME><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>METAGENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA438545</Bioproject><Biosample>SAMN05581714</Biosample> </ExpXml> <Runs> <Run acc="SRR6853380" total_spots="6900587" total_bases="1264594030" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/> </Runs> <ExtLinks></ExtLinks> <CreateDate>2018/03/18</CreateDate> <UpdateDate>2018/03/18</UpdateDate> </DocumentSummary> </DocumentSummarySet> ``` And we can get some specific info, like here returning the input experiment accession, run accession, and the platform it was sequenced on like so: ```bash esearch -db sra -query SRX3808505 | esummary | xtract -pattern DocumentSummary -element Experiment@acc,Run@acc,Platform@instrument_model ``` ``` SRX3808505 SRR6853380 Illumina HiSeq 4000 ``` Then we could download with the run accession (SRR6853380 in this case), the same way we did above with `fasterq-dump`. --- ## Downloading fastq files based on a bioproject (PRJNA...) ### Getting info on all runs in the bioproject We still need to get to the run accessions, here we'll start with searching for the bioproject. This alone, tells us there are 42 hits (in the "Count" field): ```bash esearch -db sra -query PRJNA438545 ``` ``` <ENTREZ_DIRECT> <Db>sra</Db> <WebEnv>MCID_60e24c20c687287612105432</WebEnv> <QueryKey>1</QueryKey> <Count>42</Count> <Step>1</Step> </ENTREZ_DIRECT> ``` We can look at it by letting it print some of the the `esummary` to the terminal: ```bash esearch -db sra -query PRJNA438545 | esummary | head -n 20 ``` ``` <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE DocumentSummarySet PUBLIC "-//NLM//DTD esummary sra 20130524//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130524/esummary_sra.dtd"> <DocumentSummarySet status="OK"> <DocumentSummary> <Id>5245681</Id> <ExpXml> <Summary><Title>Metagenome</Title><Platform instrument_model="Illumina HiSeq 4000">ILLUMINA</Platform><Statistics total_runs="1" total_spots="13999788" total_bases="2827957176" total_size="909033108" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA667369" center_name="Jet Propulsion Laboratory, California Institute of" contact_name="Kasthuri Venkateswaran" lab_name="Biotechnology and Planetary Protection Group"/><Experiment acc="SRX3808543" ver="1" status="public" name="Metagenome"/><Study acc="SRP135937" name="ISS Metagenomes"/><Organism taxid="1256227" ScientificName="indoor metagenome"/><Sample acc="SRS1630136" name=""/><Instrument ILLUMINA="Illumina HiSeq 4000"/><Library_descriptor><LIBRARY_NAME>IIF4SWP</LIBRARY_NAME><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>METAGENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA438545</Bioproject><Biosample>SAMN05581708</Biosample> </ExpXml> <Runs> <Run acc="SRR6853342" total_spots="13999788" total_bases="2827957176" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/> </Runs> <ExtLinks></ExtLinks> <CreateDate>2018/03/18</CreateDate> <UpdateDate>2018/03/18</UpdateDate> </DocumentSummary> <DocumentSummary> <Id>5245680</Id> <ExpXml> <Summary><Title>Metagenome</Title><Platform instrument_model="Illumina HiSeq 4000">ILLUMINA</Platform><Statistics total_runs="1" total_spots="14128994" total_bases="2854056788" total_size="937349516" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA667369" center_name="Jet Propulsion Laboratory, California Institute of" contact_name="Kasthuri Venkateswaran" lab_name="Biotechnology and Planetary Protection Group"/><Experiment acc="SRX3808542" ver="1" status="public" name="Metagenome"/><Study acc="SRP135937" name="ISS Metagenomes"/><Organism taxid="1256227" ScientificName="indoor metagenome"/><Sample acc="SRS1630137" name=""/><Instrument ILLUMINA="Illumina HiSeq 4000"/><Library_descriptor><LIBRARY_NAME>IIF5SWP</LIBRARY_NAME><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>METAGENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA438545</Bioproject><Biosample>SAMN05581709</Biosample> </ExpXml> <Runs> <Run acc="SRR6853343" total_spots="14128994" total_bases="2854056788" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/> </Runs> <ExtLinks></ExtLinks> <CreateDate>2018/03/18</CreateDate> <UpdateDate>2018/03/18</UpdateDate> </DocumentSummary> ``` And from there we can pull out all the run accessions associated with this BioProject we are searching for (along with the same additional information we pulled above), and write it to a file: ```bash esearch -db sra -query PRJNA438545 | esummary | xtract -pattern DocumentSummary -element Experiment@acc,Run@acc,Platform@instrument_model > PRJNA438545-info.tsv head PRJNA438545-info.tsv ``` ``` SRX3808543 SRR6853342 Illumina HiSeq 4000 SRX3808542 SRR6853343 Illumina HiSeq 4000 SRX3808541 SRR6853344 Illumina HiSeq 4000 SRX3808540 SRR6853345 Illumina HiSeq 4000 SRX3808539 SRR6853346 Illumina HiSeq 4000 SRX3808538 SRR6853347 Illumina HiSeq 4000 SRX3808537 SRR6853348 Illumina HiSeq 4000 SRX3808536 SRR6853349 Illumina HiSeq 4000 SRX3808535 SRR6853350 Illumina HiSeq 4000 SRX3808534 SRR6853351 Illumina HiSeq 4000 ``` --- ### Downloading with a loop If we wanted to download them all, one way we could is with a loop like this now. First shortening the table to just 2 for the example here: ```bash head -n 2 PRJNA438545-info.tsv > PRJNA438545-info-subset.tsv ``` ```bash for accession in $(cut -f 2 PRJNA438545-info-subset.tsv) do printf "\n Working on: ${accession}\n\n" fasterq-dump --split-files ${accession} done ``` --- # Note on `prefetch` If doing dozens to hundreds or more, or if samples are just very large in size, in my experience running `prefetch` first, then `fasterq-dump`, tends to be faster – though this way may also take up a lot more storage. Just noting it here as something to consider in case interested. An example modifying the above could be like so: ```bash for accession in $(cut -f 2 PRJNA438545-info-subset.tsv) do printf "\n Working on: ${accession}\n\n" # downloading SRA object prefetch ${accession} # pulling fastq files out fasterq-dump --split-files ${accession}/${accession}.sra # removing SRA object rm -rf ${accession} done ``` Testing that vs the above for me finished about a minute faster using the `prefetch` route ¯\\\_(ツ)\_\/¯ For very large numbers of samples, building in some checks for things like if the SRA object downloads successfully can be helpful. E.g., see [this script](https://github.com/AstrobioMike/Thaum-mapping/blob/main/metagenome-dl-and-mapping-workflow/scripts/dl-and-map-sample.sh) for some minor efforts at doing that that helped me. --- --- # For Genelab ## General process - to download fastq files from SRA, we need what are called "run" accessions, these are the ones that start with "SRR", e.g.: SRR7695235 - we can use the helpful program [`pysradb`](https://github.com/saketkc/pysradb) to search for run accessions based on whatever identifiers we have - once we have a run accession, we can download them using [`sratools`](https://github.com/ncbi/sra-tools) first with `prefetch`, then `fasterq-dump` ## Conda env ```bash # this is already set up on N288 # conda create -n sra-tools -c conda-forge -c bioconda -c defaults pysradb=1.2.0 sra-tools=2.11.0 python=3.7 pigz conda activate sra-tools ``` ## Example getting data for GLDS-427 From this [jira page](https://genelab-tools.arc.nasa.gov/jira/browse/GENELABDAT-602), [GSE118502](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118502) was indicated. ### Getting SRA project ID from the GSE ID we have ```bash pysradb gse-to-srp GSE118502 ``` ```bash # study_alias study_accession # GSE118502 SRP158306 ``` The "study_accession" is the SRA project ID that we can use to get all run accessions for that project. ### Getting info based on the SRA project ID ```bash pysradb metadata --detailed SRP158306 --saveto SRP158306-info.tsv ``` And slimming that down to fewer columns: ```bash cut -f 1,2,3,22,24 SRP158306-info.tsv > SRP158306-info-slimmed.tsv ``` This table looks like this: ```bash head -n 20 SRP158306-info-slimmed.tsv | sed 's/^/# /' | column -ts $'\t' ``` ``` # run_accession study_accession experiment_accession run_alias experiment_alias # SRR7695235 SRP158306 SRX4553616 GSM3331118_r1 GSM3331118 # SRR7695236 SRP158306 SRX4553616 GSM3331118_r10 GSM3331118 # SRR7695237 SRP158306 SRX4553616 GSM3331118_r11 GSM3331118 # SRR7695238 SRP158306 SRX4553616 GSM3331118_r12 GSM3331118 # SRR7695239 SRP158306 SRX4553616 GSM3331118_r2 GSM3331118 # SRR7695240 SRP158306 SRX4553616 GSM3331118_r3 GSM3331118 # SRR7695241 SRP158306 SRX4553616 GSM3331118_r4 GSM3331118 # SRR7695242 SRP158306 SRX4553616 GSM3331118_r5 GSM3331118 # SRR7695243 SRP158306 SRX4553616 GSM3331118_r6 GSM3331118 # SRR7695244 SRP158306 SRX4553616 GSM3331118_r7 GSM3331118 # SRR7695245 SRP158306 SRX4553616 GSM3331118_r8 GSM3331118 # SRR7695246 SRP158306 SRX4553616 GSM3331118_r9 GSM3331118 # SRR7695223 SRP158306 SRX4553615 GSM3331117_r1 GSM3331117 # SRR7695224 SRP158306 SRX4553615 GSM3331117_r10 GSM3331117 # SRR7695225 SRP158306 SRX4553615 GSM3331117_r11 GSM3331117 # SRR7695226 SRP158306 SRX4553615 GSM3331117_r12 GSM3331117 # SRR7695227 SRP158306 SRX4553615 GSM3331117_r2 GSM3331117 # SRR7695228 SRP158306 SRX4553615 GSM3331117_r3 GSM3331117 # SRR7695229 SRP158306 SRX4553615 GSM3331117_r4 GSM3331117 ``` In this case, multiple "runs" (first column IDs) belong to a single sample – these are delineated by the "experiment_accession" column, and the "experiment_alias" column matches up to the sample names in the [GSE118502](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118502) page linked above. About midway through the page, it says "Samples (48)", and if we count the number of unique entries in the "experiment_alias" column, we get 48: ```bash tail -n +2 SRP158306-info-slimmed.tsv | cut -f 5 | sort -u | wc -l # 48 ``` ### Getting fastq data for a single SRR The process for downloading is quickest when done in two steps: 1. Using the `prefetch` command to download the SRA object 2. Splitting out the fastq info from that object #### Downloading SRA object Here is that process with one of the SRR accessions from above, with some general options, and the accession we want is provided as a positional argument at the end: ```bash prefetch --max-size 500G --progress -O SRR7695235-tmp/ SRR7695235 ``` ```bash # 2022-01-21T19:58:28 prefetch.2.11.0: 1) Downloading 'SRR7695235'... # 2022-01-21T19:58:28 prefetch.2.11.0: Downloading via HTTPS... # |-------------------------------------------------- 100% # 2022-01-21T19:58:49 prefetch.2.11.0: HTTPS download succeed # 2022-01-21T19:58:53 prefetch.2.11.0: 'SRR7695235' is valid # 2022-01-21T19:58:53 prefetch.2.11.0: 1) 'SRR7695235' was downloaded successfully # 2022-01-21T19:58:53 prefetch.2.11.0: 'SRR7695235' has 0 unresolved dependencies ``` ```bash ls SRR7695235-tmp/SRR7695235/ # SRR7695235.sra ``` #### Pulling out fastq files Here we are pulling the fastq files out of the SRA object we downloaded, which is provided by a positional argument at the end: ```bash fasterq-dump --progress SRR7695235-tmp/SRR7695235/SRR7695235.sra ``` ```bash # join :|-------------------------------------------------- 100% # concat :|-------------------------------------------------- 100% # spots read : 4,388,472 # reads read : 8,776,944 # reads written : 8,776,944 ``` They come uncompressed: ```bash ls *.fastq # SRR7695235_1.fastq SRR7695235_2.fastq ``` We can delete the SRA object once we are done pulling the reads from it: ```bash rm -rf SRR7695235-tmp/ ``` ### Doing for all of SRP158306, including combining multiple "runs" that belong to the same sample That example above was just downloading the fastq data for a single SRR. But as noted above, multiple SRR's belong to a single sample in this case (of GLDS-427). So we want to download all the SRR's for a given sample, then zip them and cat them together so we have one forward and one reverse read file for each sample. **Exactly how this is done will depend on what the target dataset and its stored info look like, this example is just how it was done for GLDS-427.** To do that for this dataset, a script was written that did one sample at a time, pulling all runs for that sample, compressing, combining, and renaming for the name to be in typical GeneLab format (ending in "_R?_raw.fastq.gz"). And that script was run on each sample individually. This could be done linearly with a for loop, or it can be done with `xargs` to do it in parallel as shown below. First we want a list of all unique sample IDs as they are in our info table: ```bash cut -f 5 SRP158306-info-slimmed.tsv | tail -n +2 | sort -u > unique-sample-IDs.txt wc -l unique-sample-IDs.txt # 48 unique-sample-IDs.txt ``` #### Script for doing each individual sample (multiple runs) for GLDS-427 ```bash cat GLDS-427-dl-and-combine-runs-for-a-sample.sh ``` ```bash= #!/usr/bin/env bash set -e ## takes one positional argument, the GSM unique sample ID in this case ## e.g: bash GLDS-427-dl-and-combine-runs-for-a-sample.sh GSM3331071 sample_ID=${1} # making tmp directory for work mkdir -p ${sample_ID}-tmp # getting all the run accessions (SRR) stored in a variable curr_run_accs=$(grep -w "${sample_ID}" SRP158306-info-slimmed.tsv | cut -f 1 | tr "\n" " ") printf "\n\n Downloading run sra objects...\n\n" # downloading all of their sra objects with prefetch prefetch --max-size 500G --progress -O ${sample_ID}-tmp/ ${curr_run_accs} # moving them all up a level to be easier to work with (if they were put in nested directories by prefetch) mv ${sample_ID}-tmp/*/*.sra ${sample_ID}-tmp/ 2> /dev/null || true # getting object locations in a variable curr_obj_paths=$(echo $curr_run_accs | sed 's/ /.sra /g' | sed 's/$/.sra/' | sed "s/SRR/${sample_ID}-tmp\/SRR/g") printf "\n\n Splitting sra objects into fastq files...\n\n" # splitting into forward and reverse reads fasterq-dump --progress -O ${sample_ID}-tmp/ ${curr_obj_paths} printf "\n\n Compressing...\n\n" # compressing all pigz -p 8 ${sample_ID}-tmp/*.fastq printf "\n\n Concatenating...\n\n" # concatenating cat ${sample_ID}-tmp/*_1.fastq.gz > ${sample_ID}_R1_raw.fastq.gz cat ${sample_ID}-tmp/*_2.fastq.gz > ${sample_ID}_R2_raw.fastq.gz # removing temp files and objects rm -rf ${sample_ID}-tmp/ ``` #### Example of how script could be run linearly in a for loop ```bash for sample in $(cat unique-sample-IDs.txt) do printf "\n Doing sample: ${sample}\n\n" bash GLDS-427-dl-and-combine-runs-for-a-sample.sh ${sample} done ``` #### Example of how script could be run in parallel using `xargs` ```bash cat unique-sample-IDs.txt | xargs -n 1 -P 4 bash GLDS-427-dl-and-combine-runs-for-a-sample.sh ``` - `-n 1` is telling `xargs` to read only one line at a time from the standard input it is getting from our `cat` command, so each command is getting one sample ID. The ID it gets is placed as a positional argumnt at the end of the command we provided - `-P 4` is telling `xargs` to run at max 4 processes at a time, so this will be 4 times faster than running in linearly with the for loop above It took about 20 hours running the parallel way with `xargs` here on this dataset, which had 576 runs spread over 48 samples. Compressed file sizes at the end were roughly in the 3-4 GB range. --- --- # Random notes from GL figuring ## From bioproject PRJNA486575 ### getting run accessions associated with the bioproject ```bash prefetch ``` ## From GSE GSE118502 ```bash pysradb gse-to-srp GSE118502 # study_alias study_accession # GSE118502 SRP158306 ``` ```bash pysradb metadata --detailed SRP158306 --saveto SRP158306-info.tsv ``` Slimming it down to something smaller: ```bash cut -f 1,2,3,22,24 SRP158306-info.tsv > SRP158306-info-slimmed.tsv head SRP158306-info-slimmed.tsv | column -ts $'\t' | sed 's/^/# /' # run_accession study_accession experiment_accession run_alias experiment_alias # SRR7695235 SRP158306 SRX4553616 GSM3331118_r1 GSM3331118 # SRR7695236 SRP158306 SRX4553616 GSM3331118_r10 GSM3331118 # SRR7695237 SRP158306 SRX4553616 GSM3331118_r11 GSM3331118 # SRR7695238 SRP158306 SRX4553616 GSM3331118_r12 GSM3331118 # SRR7695239 SRP158306 SRX4553616 GSM3331118_r2 GSM3331118 # SRR7695240 SRP158306 SRX4553616 GSM3331118_r3 GSM3331118 # SRR7695241 SRP158306 SRX4553616 GSM3331118_r4 GSM3331118 # SRR7695242 SRP158306 SRX4553616 GSM3331118_r5 GSM3331118 # SRR7695243 SRP158306 SRX4553616 GSM3331118_r6 GSM3331118 ``` Getting list of all unique "experiment_alias" IDs, as that's what we have linked on the jira page: ```bash cut -f 5 SRP158306-info-slimmed.tsv | tail -n +2 | sort -u > unique-sample-IDs.txt wc -l unique-sample-IDs.txt # 48 head unique-sample-IDs.txt | sed 's/^/# /' # GSM3331071 # GSM3331072 # GSM3331073 # GSM3331074 # GSM3331075 # GSM3331076 # GSM3331077 # GSM3331078 # GSM3331079 # GSM3331080 ``` Doing some time tests on a single run accession to fi: ```bash # making tmp directory for work mkdir -p GSM3331071-tmp # getting all the run accessions (SRR) stored in a variable curr_run_accs=$(grep -w "GSM3331071" SRP158306-info-slimmed.tsv | cut -f 1 | tr "\n" " ") echo ${curr_run_accs} # SRR7694671 SRR7694672 SRR7694673 SRR7694674 SRR7694675 SRR7694676 SRR7694677 SRR7694678 SRR7694679 SRR7694680 SRR7694681 SRR7694682 # downloading all of their sra objects with prefetch time prefetch --max-size 500G --progress -O GSM3331071-tmp/ ${curr_run_accs} # that was 12 and took 7 minutes # getting object locations in a variable curr_obj_paths=$(echo $curr_run_accs | sed 's/ /.sra /g' | sed 's/$/.sra/' | sed 's/SRR/GSM3331071-tmp\/SRR/g') # splitting into forward and reverse reads time fasterq-dump --split-files --progress --threads 8 -O GSM3331071-tmp/ ${curr_obj_paths} # took 90 seconds for these 12 # concatenating forward and reverse reads and gzipping time cat GSM3331071-tmp/*_1.fastq | gzip > GSM3331071_R1_raw.fastq.gz real 23m30.603s user 22m41.688s sys 0m48.308s time cat GSM3331071-tmp/*_2.fastq | gzip > GSM3331071_R2_raw.fastq.gz real 29m58.691s user 29m6.388s sys 0m46.200s ### so almost an hour to cat | gzip both ### trying pigz way time pigz -p 8 GSM3331071-tmp/*.fastq real 10m23.171s user 81m19.204s sys 0m38.792s time cat GSM3331071-tmp/*_1.fastq.gz > GSM3331071_R1_raw-pigz.fastq.gz real 0m4.344s user 0m0.012s sys 0m4.264s time cat GSM3331071-tmp/*_2.fastq.gz > GSM3331071_R2_raw-pigz.fastq.gz real 0m5.973s user 0m0.028s sys 0m5.916s ### so like 10 minutes with the pigz way, let's look at the sizes ls -lt GSM*.fastq.gz # -rw-rw-r-- 1 mlee mlee 3694225587 Jan 19 19:54 GSM3331071_R2_raw-pigz.fastq.gz # -rw-rw-r-- 1 mlee mlee 2899172500 Jan 19 19:54 GSM3331071_R1_raw-pigz.fastq.gz # -rw-rw-r-- 1 mlee mlee 3689460183 Jan 19 19:39 GSM3331071_R2_raw.fastq.gz # -rw-rw-r-- 1 mlee mlee 2896960207 Jan 19 19:09 GSM3331071_R1_raw.fastq.gz ### virtually identical, pigz it is! # removing temp files and objects rm -rf GSM3331071-tmp/ ``` So for this dataset, that one sample with 12 separate runs took like 90 minutes with cat | gzip way, 20 minutes with pigz -> cat way. If we extrapolate that to the 48 total samples, we're looking at maybe 16 hours to do all in a linear fashion as this is currently written. We could parallelize it though, i'll write the script so it can be run that way. And here's all that in a script for doing it on a single sample, so we can parallel it with `xargs` if we want: ```bash cat GLDS-427-dl-and-combine-runs-for-a-sample.sh ``` ```bash= #!/usr/bin/env bash set -e ## takes one positional argument, the GSM unique sample ID in this case ## e.g: bash GLDS-427-dl-and-combine-runs-for-a-sample.sh GSM3331071 sample_ID=${1} # making tmp directory for work mkdir -p ${sample_ID}-tmp # getting all the run accessions (SRR) stored in a variable curr_run_accs=$(grep -w "${sample_ID}" SRP158306-info-slimmed.tsv | cut -f 1 | tr "\n" " ") printf "\n\n Downloading run sra objects...\n\n" # downloading all of their sra objects with prefetch prefetch --max-size 500G --progress -O ${sample_ID}-tmp/ ${curr_run_accs} # moving them all up a level to be easier to work with (if they were put in nested directories by prefetch) mv ${sample_ID}-tmp/*/*.sra ${sample_ID}-tmp/ 2> /dev/null || true # getting object locations in a variable curr_obj_paths=$(echo $curr_run_accs | sed 's/ /.sra /g' | sed 's/$/.sra/' | sed "s/SRR/${sample_ID}-tmp\/SRR/g") printf "\n\n Splitting sra objects into fastq files...\n\n" # splitting into forward and reverse reads fasterq-dump --split-files --progress --threads 8 -O ${sample_ID}-tmp/ ${curr_obj_paths} printf "\n\n Compressing...\n\n" # compressing all pigz -p 8 ${sample_ID}-tmp/*.fastq printf "\n\n Concatenating...\n\n" # concatenating cat ${sample_ID}-tmp/*_1.fastq.gz > ${sample_ID}_R1_raw.fastq.gz cat ${sample_ID}-tmp/*_2.fastq.gz > ${sample_ID}_R2_raw.fastq.gz # removing temp files and objects rm -rf ${sample_ID}-tmp/ ``` --- Here's what the process looks like: for sample in... prefetch all sra objects for a given sample split them into reads cat all reads together and gzip them remove sra object and individual read files done Running one on an SRX accession to see what it does, maybe it grabs all SRRs under that SRX: ```bash time fasterq-dump --split-files --progress --threads 8 SRX4553616 SRX4553616 is not a run accession. For more information, see https://www.ncbi.nlm.nih.gov/sra/?term=SRX4553616 Automatic expansion of container accessions is not currently available. See the above link(s) for information about the accessions. ## no such luck, maybe prefetch on an SRX? prefetch --type fastq --max-size 100G --progress SRX4553616 SRX4553616 is not a run accession. For more information, see https://www.ncbi.nlm.nih.gov/sra/?term=SRX4553616 Automatic expansion of container accessions is not currently available. See the above link(s) for information about the accessions. ## nope ``` Running on one SRR: ```bash time fasterq-dump --split-files --progress --threads 8 SRR7695235 join :|-------------------------------------------------- 100% concat :|-------------------------------------------------- 100% spots read : 4,388,472 reads read : 8,776,944 reads written : 8,776,944 real 5m31.933s user 0m50.580s sys 0m14.864s # prefetch time prefetch --max-size 100G --progress SRR7695235 -O ./ 2022-01-20T01:47:26 prefetch.2.11.0: 1) Downloading 'SRR7695235'... 2022-01-20T01:47:26 prefetch.2.11.0: Downloading via HTTPS... |-------------------------------------------------- 100% 2022-01-20T01:48:00 prefetch.2.11.0: HTTPS download succeed 2022-01-20T01:48:01 prefetch.2.11.0: 'SRR7695235' is valid 2022-01-20T01:48:01 prefetch.2.11.0: 1) 'SRR7695235' was downloaded successfully 2022-01-20T01:48:01 prefetch.2.11.0: 'SRR7695235' has 0 unresolved dependencies real 0m35.910s user 0m5.112s sys 0m2.392s # pulling out fastq time fasterq-dump --split-files --progress --threads 8 --gz SRR7695235.sra join :|-------------------------------------------------- 100% concat :|-------------------------------------------------- 100% spots read : 4,388,472 reads read : 8,776,944 reads written : 8,776,944 real 0m9.779s user 0m42.632s sys 0m11.168s ``` So much faster doing prefetch first, then splitting that. Running on multiple and seeing if we can get it to cat them together for us: ```bash time prefetch --max-size 100G --progress -O ./ SRR7695235 SRR7695236 time prefetch --max-size 100G --progress -O ./ SRR7695235 SRR7695236 2022-01-20T01:55:25 prefetch.2.11.0: 1) Downloading 'SRR7695235'... 2022-01-20T01:55:25 prefetch.2.11.0: Downloading via HTTPS... |-------------------------------------------------- 100% 2022-01-20T01:56:00 prefetch.2.11.0: HTTPS download succeed 2022-01-20T01:56:01 prefetch.2.11.0: 'SRR7695235' is valid 2022-01-20T01:56:01 prefetch.2.11.0: 1) 'SRR7695235' was downloaded successfully 2022-01-20T01:56:01 prefetch.2.11.0: 'SRR7695235' has 0 unresolved dependencies 2022-01-20T01:56:02 prefetch.2.11.0: 2) Downloading 'SRR7695236'... 2022-01-20T01:56:02 prefetch.2.11.0: Downloading via HTTPS... |-------------------------------------------------- 100% 2022-01-20T01:56:39 prefetch.2.11.0: HTTPS download succeed 2022-01-20T01:56:41 prefetch.2.11.0: 'SRR7695236' is valid 2022-01-20T01:56:41 prefetch.2.11.0: 2) 'SRR7695236' was downloaded successfully 2022-01-20T01:56:41 prefetch.2.11.0: 'SRR7695236' has 0 unresolved dependencies real 1m16.476s user 0m10.612s sys 0m4.900s ``` That grabs them independently. Trying fasterq-dump on multiple local sra objects: ```bash time fasterq-dump --split-files --progress --threads 8 SRR7695235.sra SRR7695236.sra join :|-------------------------------------------------- 100% concat :|-------------------------------------------------- 100% spots read : 4,388,472 reads read : 8,776,944 reads written : 8,776,944 join :|-------------------------------------------------- 100% concat :|-------------------------------------------------- 100% spots read : 4,427,070 reads read : 8,854,140 reads written : 8,854,140 real 0m19.240s user 1m25.100s sys 0m20.948s ``` That did them separately, seeing if concatenate option does what we want: ```bash time fasterq-dump --split-files --progress --threads 8 --concatenate-reads SRR7695235.sra SRR7695236.sra ``` Can run fasterq-dump on multiple SRRs, provided as space-delimited list i believe. So that may be the most straightforward way to go. testing cat | gzip vs gzip then cat ```bash time cat SRR7695235_1.fastq SRR7695235_2.fastq | gzip > t1.fastq.gz real 4m53.752s user 4m47.360s sys 0m7.392s ``` ```bash= cat test.sh ``` ``` gzip SRR7695235_1.fastq > p1.fastq.gz gzip SRR7695235_2.fastq > p2.fastq.gz cat p1.fastq.gz p2.fastq.gz > both.fastq.gz ``` ```bash time bash test.sh real 5m28.951s user 5m23.072s sys 0m4.836s ``` Trying with pigz: ```bash= cat test-pigz.sh ``` ``` pigz SRR7695235_1.fastq SRR7695235_2.fastq cat SRR7695235_1.fastq.gz SRR7695235_2.fastq.gz > pigz-both.fq.gz ``` ```bash time bash test-pigz.sh real 0m40.200s user 7m33.280s sys 0m5.284s ``` Trying with gzip instead of pigz just to be sure it's due to pigz and not the other changes: ```bash cat test-gzip.sh ``` ``` gzip SRR7695235_1.fastq SRR7695235_2.fastq cat SRR7695235_1.fastq.gz SRR7695235_2.fastq.gz > gzip-both.fq.gz ``` ```bash= time bash test-gzip.sh real 5m14.677s user 5m7.064s sys 0m4.660s ``` ```bash pigz SRR7695235_1.fastq SRR7695235_2.fastq ``` Should try pigz on directory of files, then cat the gzipped ones together