---
tags: happy belly
title: Examples downloading fastq data from NCBI's SRA
---
> **Starter page for a happy belly tutorial on downloading fastq data from SRA.**
# Example downloading fastq data from NCBI's SRA
[toc]
---
# Environment
```bash
conda create -y -n ncbi-tools -c conda-forge -c bioconda -c defaults sra-tools=2.10.1 entrez-direct=13.9
conda create -y -n pysradb -c conda-forge -c bioconda -c defaults python=3.7 pysradb
conda activate pysradb
```
---
# Examples
## Downloading one sample's fastq files
### Starting with "run accession" (SRR...)
To download things with `fasterq-dump` as shown here, we want a "run" accession, e.g., SRR6853380:
```bash
fasterq-dump --split-files --progress SRR6853380
```
Not detailing the programs currently, so see `fasterq-dump -h` for more info on the options available and set here.
When that's done it will print out something like:
```
spots read : 6,900,587
reads read : 13,801,174
reads written : 13,801,174
```
And we have two files in our working directory that are each about 1.7GB:
```
1.7G SRR6853380_1.fastq
1.7G SRR6853380_2.fastq
```
---
### Starting with "experiment accession" (SRX...)
We first need to get the run accession. We can find that using `entrez-direct`. This toolset is powerful and complicated – more examples can be found on the [EDirect page](https://astrobiomike.github.io/unix/ncbi_eutils).
We can see the xml format of some summary information with this alone:
```bash
esearch -db sra -query SRX3808505 | esummary
```
```
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE DocumentSummarySet PUBLIC "-//NLM//DTD esummary sra 20130524//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130524/esummary_sra.dtd">
<DocumentSummarySet status="OK">
<DocumentSummary>
<Id>5245643</Id>
<ExpXml> <Summary><Title>Metagenome</Title><Platform instrument_model="Illumina HiSeq 4000">ILLUMINA</Platform><Statistics total_runs="1" total_spots="6900587" total_bases="1264594030" total_size="707674318" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA667369" center_name="Jet Propulsion Laboratory, California Institute of" contact_name="Kasthuri Venkateswaran" lab_name="Biotechnology and Planetary Protection Group"/><Experiment acc="SRX3808505" ver="1" status="public" name="Metagenome"/><Study acc="SRP135937" name="ISS Metagenomes"/><Organism taxid="1256227" ScientificName="indoor metagenome"/><Sample acc="SRS2065862" name=""/><Instrument ILLUMINA="Illumina HiSeq 4000"/><Library_descriptor><LIBRARY_NAME>IIIF1SWP</LIBRARY_NAME><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>METAGENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA438545</Bioproject><Biosample>SAMN05581714</Biosample> </ExpXml>
<Runs> <Run acc="SRR6853380" total_spots="6900587" total_bases="1264594030" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/> </Runs>
<ExtLinks></ExtLinks>
<CreateDate>2018/03/18</CreateDate>
<UpdateDate>2018/03/18</UpdateDate>
</DocumentSummary>
</DocumentSummarySet>
```
And we can get some specific info, like here returning the input experiment accession, run accession, and the platform it was sequenced on like so:
```bash
esearch -db sra -query SRX3808505 | esummary | xtract -pattern DocumentSummary -element Experiment@acc,Run@acc,Platform@instrument_model
```
```
SRX3808505 SRR6853380 Illumina HiSeq 4000
```
Then we could download with the run accession (SRR6853380 in this case), the same way we did above with `fasterq-dump`.
---
## Downloading fastq files based on a bioproject (PRJNA...)
### Getting info on all runs in the bioproject
We still need to get to the run accessions, here we'll start with searching for the bioproject.
This alone, tells us there are 42 hits (in the "Count" field):
```bash
esearch -db sra -query PRJNA438545
```
```
<ENTREZ_DIRECT>
<Db>sra</Db>
<WebEnv>MCID_60e24c20c687287612105432</WebEnv>
<QueryKey>1</QueryKey>
<Count>42</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
```
We can look at it by letting it print some of the the `esummary` to the terminal:
```bash
esearch -db sra -query PRJNA438545 | esummary | head -n 20
```
```
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE DocumentSummarySet PUBLIC "-//NLM//DTD esummary sra 20130524//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130524/esummary_sra.dtd">
<DocumentSummarySet status="OK">
<DocumentSummary>
<Id>5245681</Id>
<ExpXml> <Summary><Title>Metagenome</Title><Platform instrument_model="Illumina HiSeq 4000">ILLUMINA</Platform><Statistics total_runs="1" total_spots="13999788" total_bases="2827957176" total_size="909033108" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA667369" center_name="Jet Propulsion Laboratory, California Institute of" contact_name="Kasthuri Venkateswaran" lab_name="Biotechnology and Planetary Protection Group"/><Experiment acc="SRX3808543" ver="1" status="public" name="Metagenome"/><Study acc="SRP135937" name="ISS Metagenomes"/><Organism taxid="1256227" ScientificName="indoor metagenome"/><Sample acc="SRS1630136" name=""/><Instrument ILLUMINA="Illumina HiSeq 4000"/><Library_descriptor><LIBRARY_NAME>IIF4SWP</LIBRARY_NAME><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>METAGENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA438545</Bioproject><Biosample>SAMN05581708</Biosample> </ExpXml>
<Runs> <Run acc="SRR6853342" total_spots="13999788" total_bases="2827957176" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/> </Runs>
<ExtLinks></ExtLinks>
<CreateDate>2018/03/18</CreateDate>
<UpdateDate>2018/03/18</UpdateDate>
</DocumentSummary>
<DocumentSummary>
<Id>5245680</Id>
<ExpXml> <Summary><Title>Metagenome</Title><Platform instrument_model="Illumina HiSeq 4000">ILLUMINA</Platform><Statistics total_runs="1" total_spots="14128994" total_bases="2854056788" total_size="937349516" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA667369" center_name="Jet Propulsion Laboratory, California Institute of" contact_name="Kasthuri Venkateswaran" lab_name="Biotechnology and Planetary Protection Group"/><Experiment acc="SRX3808542" ver="1" status="public" name="Metagenome"/><Study acc="SRP135937" name="ISS Metagenomes"/><Organism taxid="1256227" ScientificName="indoor metagenome"/><Sample acc="SRS1630137" name=""/><Instrument ILLUMINA="Illumina HiSeq 4000"/><Library_descriptor><LIBRARY_NAME>IIF5SWP</LIBRARY_NAME><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>METAGENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA438545</Bioproject><Biosample>SAMN05581709</Biosample> </ExpXml>
<Runs> <Run acc="SRR6853343" total_spots="14128994" total_bases="2854056788" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/> </Runs>
<ExtLinks></ExtLinks>
<CreateDate>2018/03/18</CreateDate>
<UpdateDate>2018/03/18</UpdateDate>
</DocumentSummary>
```
And from there we can pull out all the run accessions associated with this BioProject we are searching for (along with the same additional information we pulled above), and write it to a file:
```bash
esearch -db sra -query PRJNA438545 | esummary | xtract -pattern DocumentSummary -element Experiment@acc,Run@acc,Platform@instrument_model > PRJNA438545-info.tsv
head PRJNA438545-info.tsv
```
```
SRX3808543 SRR6853342 Illumina HiSeq 4000
SRX3808542 SRR6853343 Illumina HiSeq 4000
SRX3808541 SRR6853344 Illumina HiSeq 4000
SRX3808540 SRR6853345 Illumina HiSeq 4000
SRX3808539 SRR6853346 Illumina HiSeq 4000
SRX3808538 SRR6853347 Illumina HiSeq 4000
SRX3808537 SRR6853348 Illumina HiSeq 4000
SRX3808536 SRR6853349 Illumina HiSeq 4000
SRX3808535 SRR6853350 Illumina HiSeq 4000
SRX3808534 SRR6853351 Illumina HiSeq 4000
```
---
### Downloading with a loop
If we wanted to download them all, one way we could is with a loop like this now.
First shortening the table to just 2 for the example here:
```bash
head -n 2 PRJNA438545-info.tsv > PRJNA438545-info-subset.tsv
```
```bash
for accession in $(cut -f 2 PRJNA438545-info-subset.tsv)
do
printf "\n Working on: ${accession}\n\n"
fasterq-dump --split-files ${accession}
done
```
---
# Note on `prefetch`
If doing dozens to hundreds or more, or if samples are just very large in size, in my experience running `prefetch` first, then `fasterq-dump`, tends to be faster – though this way may also take up a lot more storage. Just noting it here as something to consider in case interested.
An example modifying the above could be like so:
```bash
for accession in $(cut -f 2 PRJNA438545-info-subset.tsv)
do
printf "\n Working on: ${accession}\n\n"
# downloading SRA object
prefetch ${accession}
# pulling fastq files out
fasterq-dump --split-files ${accession}/${accession}.sra
# removing SRA object
rm -rf ${accession}
done
```
Testing that vs the above for me finished about a minute faster using the `prefetch` route ¯\\\_(ツ)\_\/¯
For very large numbers of samples, building in some checks for things like if the SRA object downloads successfully can be helpful. E.g., see [this script](https://github.com/AstrobioMike/Thaum-mapping/blob/main/metagenome-dl-and-mapping-workflow/scripts/dl-and-map-sample.sh) for some minor efforts at doing that that helped me.
---
---
# For Genelab
## General process
- to download fastq files from SRA, we need what are called "run" accessions, these are the ones that start with "SRR", e.g.: SRR7695235
- we can use the helpful program [`pysradb`](https://github.com/saketkc/pysradb) to search for run accessions based on whatever identifiers we have
- once we have a run accession, we can download them using [`sratools`](https://github.com/ncbi/sra-tools) first with `prefetch`, then `fasterq-dump`
## Conda env
```bash
# this is already set up on N288
# conda create -n sra-tools -c conda-forge -c bioconda -c defaults pysradb=1.2.0 sra-tools=2.11.0 python=3.7 pigz
conda activate sra-tools
```
## Example getting data for GLDS-427
From this [jira page](https://genelab-tools.arc.nasa.gov/jira/browse/GENELABDAT-602), [GSE118502](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118502) was indicated.
### Getting SRA project ID from the GSE ID we have
```bash
pysradb gse-to-srp GSE118502
```
```bash
# study_alias study_accession
# GSE118502 SRP158306
```
The "study_accession" is the SRA project ID that we can use to get all run accessions for that project.
### Getting info based on the SRA project ID
```bash
pysradb metadata --detailed SRP158306 --saveto SRP158306-info.tsv
```
And slimming that down to fewer columns:
```bash
cut -f 1,2,3,22,24 SRP158306-info.tsv > SRP158306-info-slimmed.tsv
```
This table looks like this:
```bash
head -n 20 SRP158306-info-slimmed.tsv | sed 's/^/# /' | column -ts $'\t'
```
```
# run_accession study_accession experiment_accession run_alias experiment_alias
# SRR7695235 SRP158306 SRX4553616 GSM3331118_r1 GSM3331118
# SRR7695236 SRP158306 SRX4553616 GSM3331118_r10 GSM3331118
# SRR7695237 SRP158306 SRX4553616 GSM3331118_r11 GSM3331118
# SRR7695238 SRP158306 SRX4553616 GSM3331118_r12 GSM3331118
# SRR7695239 SRP158306 SRX4553616 GSM3331118_r2 GSM3331118
# SRR7695240 SRP158306 SRX4553616 GSM3331118_r3 GSM3331118
# SRR7695241 SRP158306 SRX4553616 GSM3331118_r4 GSM3331118
# SRR7695242 SRP158306 SRX4553616 GSM3331118_r5 GSM3331118
# SRR7695243 SRP158306 SRX4553616 GSM3331118_r6 GSM3331118
# SRR7695244 SRP158306 SRX4553616 GSM3331118_r7 GSM3331118
# SRR7695245 SRP158306 SRX4553616 GSM3331118_r8 GSM3331118
# SRR7695246 SRP158306 SRX4553616 GSM3331118_r9 GSM3331118
# SRR7695223 SRP158306 SRX4553615 GSM3331117_r1 GSM3331117
# SRR7695224 SRP158306 SRX4553615 GSM3331117_r10 GSM3331117
# SRR7695225 SRP158306 SRX4553615 GSM3331117_r11 GSM3331117
# SRR7695226 SRP158306 SRX4553615 GSM3331117_r12 GSM3331117
# SRR7695227 SRP158306 SRX4553615 GSM3331117_r2 GSM3331117
# SRR7695228 SRP158306 SRX4553615 GSM3331117_r3 GSM3331117
# SRR7695229 SRP158306 SRX4553615 GSM3331117_r4 GSM3331117
```
In this case, multiple "runs" (first column IDs) belong to a single sample – these are delineated by the "experiment_accession" column, and the "experiment_alias" column matches up to the sample names in the [GSE118502](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118502) page linked above. About midway through the page, it says "Samples (48)", and if we count the number of unique entries in the "experiment_alias" column, we get 48:
```bash
tail -n +2 SRP158306-info-slimmed.tsv | cut -f 5 | sort -u | wc -l
# 48
```
### Getting fastq data for a single SRR
The process for downloading is quickest when done in two steps:
1. Using the `prefetch` command to download the SRA object
2. Splitting out the fastq info from that object
#### Downloading SRA object
Here is that process with one of the SRR accessions from above, with some general options, and the accession we want is provided as a positional argument at the end:
```bash
prefetch --max-size 500G --progress -O SRR7695235-tmp/ SRR7695235
```
```bash
# 2022-01-21T19:58:28 prefetch.2.11.0: 1) Downloading 'SRR7695235'...
# 2022-01-21T19:58:28 prefetch.2.11.0: Downloading via HTTPS...
# |-------------------------------------------------- 100%
# 2022-01-21T19:58:49 prefetch.2.11.0: HTTPS download succeed
# 2022-01-21T19:58:53 prefetch.2.11.0: 'SRR7695235' is valid
# 2022-01-21T19:58:53 prefetch.2.11.0: 1) 'SRR7695235' was downloaded successfully
# 2022-01-21T19:58:53 prefetch.2.11.0: 'SRR7695235' has 0 unresolved dependencies
```
```bash
ls SRR7695235-tmp/SRR7695235/
# SRR7695235.sra
```
#### Pulling out fastq files
Here we are pulling the fastq files out of the SRA object we downloaded, which is provided by a positional argument at the end:
```bash
fasterq-dump --progress SRR7695235-tmp/SRR7695235/SRR7695235.sra
```
```bash
# join :|-------------------------------------------------- 100%
# concat :|-------------------------------------------------- 100%
# spots read : 4,388,472
# reads read : 8,776,944
# reads written : 8,776,944
```
They come uncompressed:
```bash
ls *.fastq
# SRR7695235_1.fastq SRR7695235_2.fastq
```
We can delete the SRA object once we are done pulling the reads from it:
```bash
rm -rf SRR7695235-tmp/
```
### Doing for all of SRP158306, including combining multiple "runs" that belong to the same sample
That example above was just downloading the fastq data for a single SRR. But as noted above, multiple SRR's belong to a single sample in this case (of GLDS-427). So we want to download all the SRR's for a given sample, then zip them and cat them together so we have one forward and one reverse read file for each sample. **Exactly how this is done will depend on what the target dataset and its stored info look like, this example is just how it was done for GLDS-427.**
To do that for this dataset, a script was written that did one sample at a time, pulling all runs for that sample, compressing, combining, and renaming for the name to be in typical GeneLab format (ending in "_R?_raw.fastq.gz"). And that script was run on each sample individually. This could be done linearly with a for loop, or it can be done with `xargs` to do it in parallel as shown below.
First we want a list of all unique sample IDs as they are in our info table:
```bash
cut -f 5 SRP158306-info-slimmed.tsv | tail -n +2 | sort -u > unique-sample-IDs.txt
wc -l unique-sample-IDs.txt
# 48 unique-sample-IDs.txt
```
#### Script for doing each individual sample (multiple runs) for GLDS-427
```bash
cat GLDS-427-dl-and-combine-runs-for-a-sample.sh
```
```bash=
#!/usr/bin/env bash
set -e
## takes one positional argument, the GSM unique sample ID in this case
## e.g: bash GLDS-427-dl-and-combine-runs-for-a-sample.sh GSM3331071
sample_ID=${1}
# making tmp directory for work
mkdir -p ${sample_ID}-tmp
# getting all the run accessions (SRR) stored in a variable
curr_run_accs=$(grep -w "${sample_ID}" SRP158306-info-slimmed.tsv | cut -f 1 | tr "\n" " ")
printf "\n\n Downloading run sra objects...\n\n"
# downloading all of their sra objects with prefetch
prefetch --max-size 500G --progress -O ${sample_ID}-tmp/ ${curr_run_accs}
# moving them all up a level to be easier to work with (if they were put in nested directories by prefetch)
mv ${sample_ID}-tmp/*/*.sra ${sample_ID}-tmp/ 2> /dev/null || true
# getting object locations in a variable
curr_obj_paths=$(echo $curr_run_accs | sed 's/ /.sra /g' | sed 's/$/.sra/' | sed "s/SRR/${sample_ID}-tmp\/SRR/g")
printf "\n\n Splitting sra objects into fastq files...\n\n"
# splitting into forward and reverse reads
fasterq-dump --progress -O ${sample_ID}-tmp/ ${curr_obj_paths}
printf "\n\n Compressing...\n\n"
# compressing all
pigz -p 8 ${sample_ID}-tmp/*.fastq
printf "\n\n Concatenating...\n\n"
# concatenating
cat ${sample_ID}-tmp/*_1.fastq.gz > ${sample_ID}_R1_raw.fastq.gz
cat ${sample_ID}-tmp/*_2.fastq.gz > ${sample_ID}_R2_raw.fastq.gz
# removing temp files and objects
rm -rf ${sample_ID}-tmp/
```
#### Example of how script could be run linearly in a for loop
```bash
for sample in $(cat unique-sample-IDs.txt)
do
printf "\n Doing sample: ${sample}\n\n"
bash GLDS-427-dl-and-combine-runs-for-a-sample.sh ${sample}
done
```
#### Example of how script could be run in parallel using `xargs`
```bash
cat unique-sample-IDs.txt | xargs -n 1 -P 4 bash GLDS-427-dl-and-combine-runs-for-a-sample.sh
```
- `-n 1` is telling `xargs` to read only one line at a time from the standard input it is getting from our `cat` command, so each command is getting one sample ID. The ID it gets is placed as a positional argumnt at the end of the command we provided
- `-P 4` is telling `xargs` to run at max 4 processes at a time, so this will be 4 times faster than running in linearly with the for loop above
It took about 20 hours running the parallel way with `xargs` here on this dataset, which had 576 runs spread over 48 samples. Compressed file sizes at the end were roughly in the 3-4 GB range.
---
---
# Random notes from GL figuring
## From bioproject
PRJNA486575
### getting run accessions associated with the bioproject
```bash
prefetch
```
## From GSE
GSE118502
```bash
pysradb gse-to-srp GSE118502
# study_alias study_accession
# GSE118502 SRP158306
```
```bash
pysradb metadata --detailed SRP158306 --saveto SRP158306-info.tsv
```
Slimming it down to something smaller:
```bash
cut -f 1,2,3,22,24 SRP158306-info.tsv > SRP158306-info-slimmed.tsv
head SRP158306-info-slimmed.tsv | column -ts $'\t' | sed 's/^/# /'
# run_accession study_accession experiment_accession run_alias experiment_alias
# SRR7695235 SRP158306 SRX4553616 GSM3331118_r1 GSM3331118
# SRR7695236 SRP158306 SRX4553616 GSM3331118_r10 GSM3331118
# SRR7695237 SRP158306 SRX4553616 GSM3331118_r11 GSM3331118
# SRR7695238 SRP158306 SRX4553616 GSM3331118_r12 GSM3331118
# SRR7695239 SRP158306 SRX4553616 GSM3331118_r2 GSM3331118
# SRR7695240 SRP158306 SRX4553616 GSM3331118_r3 GSM3331118
# SRR7695241 SRP158306 SRX4553616 GSM3331118_r4 GSM3331118
# SRR7695242 SRP158306 SRX4553616 GSM3331118_r5 GSM3331118
# SRR7695243 SRP158306 SRX4553616 GSM3331118_r6 GSM3331118
```
Getting list of all unique "experiment_alias" IDs, as that's what we have linked on the jira page:
```bash
cut -f 5 SRP158306-info-slimmed.tsv | tail -n +2 | sort -u > unique-sample-IDs.txt
wc -l unique-sample-IDs.txt
# 48
head unique-sample-IDs.txt | sed 's/^/# /'
# GSM3331071
# GSM3331072
# GSM3331073
# GSM3331074
# GSM3331075
# GSM3331076
# GSM3331077
# GSM3331078
# GSM3331079
# GSM3331080
```
Doing some time tests on a single run accession to fi:
```bash
# making tmp directory for work
mkdir -p GSM3331071-tmp
# getting all the run accessions (SRR) stored in a variable
curr_run_accs=$(grep -w "GSM3331071" SRP158306-info-slimmed.tsv | cut -f 1 | tr "\n" " ")
echo ${curr_run_accs}
# SRR7694671 SRR7694672 SRR7694673 SRR7694674 SRR7694675 SRR7694676 SRR7694677 SRR7694678 SRR7694679 SRR7694680 SRR7694681 SRR7694682
# downloading all of their sra objects with prefetch
time prefetch --max-size 500G --progress -O GSM3331071-tmp/ ${curr_run_accs}
# that was 12 and took 7 minutes
# getting object locations in a variable
curr_obj_paths=$(echo $curr_run_accs | sed 's/ /.sra /g' | sed 's/$/.sra/' | sed 's/SRR/GSM3331071-tmp\/SRR/g')
# splitting into forward and reverse reads
time fasterq-dump --split-files --progress --threads 8 -O GSM3331071-tmp/ ${curr_obj_paths}
# took 90 seconds for these 12
# concatenating forward and reverse reads and gzipping
time cat GSM3331071-tmp/*_1.fastq | gzip > GSM3331071_R1_raw.fastq.gz
real 23m30.603s
user 22m41.688s
sys 0m48.308s
time cat GSM3331071-tmp/*_2.fastq | gzip > GSM3331071_R2_raw.fastq.gz
real 29m58.691s
user 29m6.388s
sys 0m46.200s
### so almost an hour to cat | gzip both
### trying pigz way
time pigz -p 8 GSM3331071-tmp/*.fastq
real 10m23.171s
user 81m19.204s
sys 0m38.792s
time cat GSM3331071-tmp/*_1.fastq.gz > GSM3331071_R1_raw-pigz.fastq.gz
real 0m4.344s
user 0m0.012s
sys 0m4.264s
time cat GSM3331071-tmp/*_2.fastq.gz > GSM3331071_R2_raw-pigz.fastq.gz
real 0m5.973s
user 0m0.028s
sys 0m5.916s
### so like 10 minutes with the pigz way, let's look at the sizes
ls -lt GSM*.fastq.gz
# -rw-rw-r-- 1 mlee mlee 3694225587 Jan 19 19:54 GSM3331071_R2_raw-pigz.fastq.gz
# -rw-rw-r-- 1 mlee mlee 2899172500 Jan 19 19:54 GSM3331071_R1_raw-pigz.fastq.gz
# -rw-rw-r-- 1 mlee mlee 3689460183 Jan 19 19:39 GSM3331071_R2_raw.fastq.gz
# -rw-rw-r-- 1 mlee mlee 2896960207 Jan 19 19:09 GSM3331071_R1_raw.fastq.gz
### virtually identical, pigz it is!
# removing temp files and objects
rm -rf GSM3331071-tmp/
```
So for this dataset, that one sample with 12 separate runs took like 90 minutes with cat | gzip way, 20 minutes with pigz -> cat way. If we extrapolate that to the 48 total samples, we're looking at maybe 16 hours to do all in a linear fashion as this is currently written. We could parallelize it though, i'll write the script so it can be run that way.
And here's all that in a script for doing it on a single sample, so we can parallel it with `xargs` if we want:
```bash
cat GLDS-427-dl-and-combine-runs-for-a-sample.sh
```
```bash=
#!/usr/bin/env bash
set -e
## takes one positional argument, the GSM unique sample ID in this case
## e.g: bash GLDS-427-dl-and-combine-runs-for-a-sample.sh GSM3331071
sample_ID=${1}
# making tmp directory for work
mkdir -p ${sample_ID}-tmp
# getting all the run accessions (SRR) stored in a variable
curr_run_accs=$(grep -w "${sample_ID}" SRP158306-info-slimmed.tsv | cut -f 1 | tr "\n" " ")
printf "\n\n Downloading run sra objects...\n\n"
# downloading all of their sra objects with prefetch
prefetch --max-size 500G --progress -O ${sample_ID}-tmp/ ${curr_run_accs}
# moving them all up a level to be easier to work with (if they were put in nested directories by prefetch)
mv ${sample_ID}-tmp/*/*.sra ${sample_ID}-tmp/ 2> /dev/null || true
# getting object locations in a variable
curr_obj_paths=$(echo $curr_run_accs | sed 's/ /.sra /g' | sed 's/$/.sra/' | sed "s/SRR/${sample_ID}-tmp\/SRR/g")
printf "\n\n Splitting sra objects into fastq files...\n\n"
# splitting into forward and reverse reads
fasterq-dump --split-files --progress --threads 8 -O ${sample_ID}-tmp/ ${curr_obj_paths}
printf "\n\n Compressing...\n\n"
# compressing all
pigz -p 8 ${sample_ID}-tmp/*.fastq
printf "\n\n Concatenating...\n\n"
# concatenating
cat ${sample_ID}-tmp/*_1.fastq.gz > ${sample_ID}_R1_raw.fastq.gz
cat ${sample_ID}-tmp/*_2.fastq.gz > ${sample_ID}_R2_raw.fastq.gz
# removing temp files and objects
rm -rf ${sample_ID}-tmp/
```
---
Here's what the process looks like:
for sample in...
prefetch all sra objects for a given sample
split them into reads
cat all reads together and gzip them
remove sra object and individual read files
done
Running one on an SRX accession to see what it does, maybe it grabs all SRRs under that SRX:
```bash
time fasterq-dump --split-files --progress --threads 8 SRX4553616
SRX4553616 is not a run accession. For more information, see https://www.ncbi.nlm.nih.gov/sra/?term=SRX4553616
Automatic expansion of container accessions is not currently available. See the above link(s) for information about the accessions.
## no such luck, maybe prefetch on an SRX?
prefetch --type fastq --max-size 100G --progress SRX4553616
SRX4553616 is not a run accession. For more information, see https://www.ncbi.nlm.nih.gov/sra/?term=SRX4553616
Automatic expansion of container accessions is not currently available. See the above link(s) for information about the accessions.
## nope
```
Running on one SRR:
```bash
time fasterq-dump --split-files --progress --threads 8 SRR7695235
join :|-------------------------------------------------- 100%
concat :|-------------------------------------------------- 100%
spots read : 4,388,472
reads read : 8,776,944
reads written : 8,776,944
real 5m31.933s
user 0m50.580s
sys 0m14.864s
# prefetch
time prefetch --max-size 100G --progress SRR7695235 -O ./
2022-01-20T01:47:26 prefetch.2.11.0: 1) Downloading 'SRR7695235'...
2022-01-20T01:47:26 prefetch.2.11.0: Downloading via HTTPS...
|-------------------------------------------------- 100%
2022-01-20T01:48:00 prefetch.2.11.0: HTTPS download succeed
2022-01-20T01:48:01 prefetch.2.11.0: 'SRR7695235' is valid
2022-01-20T01:48:01 prefetch.2.11.0: 1) 'SRR7695235' was downloaded successfully
2022-01-20T01:48:01 prefetch.2.11.0: 'SRR7695235' has 0 unresolved dependencies
real 0m35.910s
user 0m5.112s
sys 0m2.392s
# pulling out fastq
time fasterq-dump --split-files --progress --threads 8 --gz SRR7695235.sra
join :|-------------------------------------------------- 100%
concat :|-------------------------------------------------- 100%
spots read : 4,388,472
reads read : 8,776,944
reads written : 8,776,944
real 0m9.779s
user 0m42.632s
sys 0m11.168s
```
So much faster doing prefetch first, then splitting that.
Running on multiple and seeing if we can get it to cat them together for us:
```bash
time prefetch --max-size 100G --progress -O ./ SRR7695235 SRR7695236
time prefetch --max-size 100G --progress -O ./ SRR7695235 SRR7695236
2022-01-20T01:55:25 prefetch.2.11.0: 1) Downloading 'SRR7695235'...
2022-01-20T01:55:25 prefetch.2.11.0: Downloading via HTTPS...
|-------------------------------------------------- 100%
2022-01-20T01:56:00 prefetch.2.11.0: HTTPS download succeed
2022-01-20T01:56:01 prefetch.2.11.0: 'SRR7695235' is valid
2022-01-20T01:56:01 prefetch.2.11.0: 1) 'SRR7695235' was downloaded successfully
2022-01-20T01:56:01 prefetch.2.11.0: 'SRR7695235' has 0 unresolved dependencies
2022-01-20T01:56:02 prefetch.2.11.0: 2) Downloading 'SRR7695236'...
2022-01-20T01:56:02 prefetch.2.11.0: Downloading via HTTPS...
|-------------------------------------------------- 100%
2022-01-20T01:56:39 prefetch.2.11.0: HTTPS download succeed
2022-01-20T01:56:41 prefetch.2.11.0: 'SRR7695236' is valid
2022-01-20T01:56:41 prefetch.2.11.0: 2) 'SRR7695236' was downloaded successfully
2022-01-20T01:56:41 prefetch.2.11.0: 'SRR7695236' has 0 unresolved dependencies
real 1m16.476s
user 0m10.612s
sys 0m4.900s
```
That grabs them independently. Trying fasterq-dump on multiple local sra objects:
```bash
time fasterq-dump --split-files --progress --threads 8 SRR7695235.sra SRR7695236.sra
join :|-------------------------------------------------- 100%
concat :|-------------------------------------------------- 100%
spots read : 4,388,472
reads read : 8,776,944
reads written : 8,776,944
join :|-------------------------------------------------- 100%
concat :|-------------------------------------------------- 100%
spots read : 4,427,070
reads read : 8,854,140
reads written : 8,854,140
real 0m19.240s
user 1m25.100s
sys 0m20.948s
```
That did them separately, seeing if concatenate option does what we want:
```bash
time fasterq-dump --split-files --progress --threads 8 --concatenate-reads SRR7695235.sra SRR7695236.sra
```
Can run fasterq-dump on multiple SRRs, provided as space-delimited list i believe. So that may be the most straightforward way to go.
testing cat | gzip vs gzip then cat
```bash
time cat SRR7695235_1.fastq SRR7695235_2.fastq | gzip > t1.fastq.gz
real 4m53.752s
user 4m47.360s
sys 0m7.392s
```
```bash=
cat test.sh
```
```
gzip SRR7695235_1.fastq > p1.fastq.gz
gzip SRR7695235_2.fastq > p2.fastq.gz
cat p1.fastq.gz p2.fastq.gz > both.fastq.gz
```
```bash
time bash test.sh
real 5m28.951s
user 5m23.072s
sys 0m4.836s
```
Trying with pigz:
```bash=
cat test-pigz.sh
```
```
pigz SRR7695235_1.fastq SRR7695235_2.fastq
cat SRR7695235_1.fastq.gz SRR7695235_2.fastq.gz > pigz-both.fq.gz
```
```bash
time bash test-pigz.sh
real 0m40.200s
user 7m33.280s
sys 0m5.284s
```
Trying with gzip instead of pigz just to be sure it's due to pigz and not the other changes:
```bash
cat test-gzip.sh
```
```
gzip SRR7695235_1.fastq SRR7695235_2.fastq
cat SRR7695235_1.fastq.gz SRR7695235_2.fastq.gz > gzip-both.fq.gz
```
```bash=
time bash test-gzip.sh
real 5m14.677s
user 5m7.064s
sys 0m4.660s
```
```bash
pigz SRR7695235_1.fastq SRR7695235_2.fastq
```
Should try pigz on directory of files, then cat the gzipped ones together