Example downloading from SRA for GLDS-427

--- tags: GeneLab title: Example downloading from SRA for GLDS-427 --- # Example downloading from SRA for GLDS-427 [toc] --- ## General process - to download fastq files from SRA, we need what are called "run" accessions, these are the ones that start with "SRR", e.g.: SRR7695235 - we can use the helpful program [`pysradb`](https://github.com/saketkc/pysradb) to search for run accessions based on whatever identifiers we have - once we have a run accession, we can download them using [`sratools`](https://github.com/ncbi/sra-tools) first with `prefetch`, then `fasterq-dump` --- ## Conda env ```bash # this is already set up on N288 # conda create -n sra-tools -c conda-forge -c bioconda -c defaults pysradb=1.2.0 sra-tools=2.11.0 python=3.7 pigz conda activate sra-tools ``` --- ## Example getting data for GLDS-427 From this [jira page](https://genelab-tools.arc.nasa.gov/jira/browse/GENELABDAT-602), [GSE118502](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118502) was indicated. ### Getting SRA project ID from the GSE ID we have ```bash pysradb gse-to-srp GSE118502 ``` ```bash # study_alias study_accession # GSE118502 SRP158306 ``` The "study_accession" is the SRA project ID that we can use to get all run accessions for that project. ### Getting info based on the SRA project ID ```bash pysradb metadata --detailed SRP158306 --saveto SRP158306-info.tsv ``` And slimming that down to fewer columns: ```bash cut -f 1,2,3,22,24 SRP158306-info.tsv > SRP158306-info-slimmed.tsv ``` This table looks like this: ```bash head -n 20 SRP158306-info-slimmed.tsv | sed 's/^/# /' | column -ts $'\t' ``` ``` # run_accession study_accession experiment_accession run_alias experiment_alias # SRR7695235 SRP158306 SRX4553616 GSM3331118_r1 GSM3331118 # SRR7695236 SRP158306 SRX4553616 GSM3331118_r10 GSM3331118 # SRR7695237 SRP158306 SRX4553616 GSM3331118_r11 GSM3331118 # SRR7695238 SRP158306 SRX4553616 GSM3331118_r12 GSM3331118 # SRR7695239 SRP158306 SRX4553616 GSM3331118_r2 GSM3331118 # SRR7695240 SRP158306 SRX4553616 GSM3331118_r3 GSM3331118 # SRR7695241 SRP158306 SRX4553616 GSM3331118_r4 GSM3331118 # SRR7695242 SRP158306 SRX4553616 GSM3331118_r5 GSM3331118 # SRR7695243 SRP158306 SRX4553616 GSM3331118_r6 GSM3331118 # SRR7695244 SRP158306 SRX4553616 GSM3331118_r7 GSM3331118 # SRR7695245 SRP158306 SRX4553616 GSM3331118_r8 GSM3331118 # SRR7695246 SRP158306 SRX4553616 GSM3331118_r9 GSM3331118 # SRR7695223 SRP158306 SRX4553615 GSM3331117_r1 GSM3331117 # SRR7695224 SRP158306 SRX4553615 GSM3331117_r10 GSM3331117 # SRR7695225 SRP158306 SRX4553615 GSM3331117_r11 GSM3331117 # SRR7695226 SRP158306 SRX4553615 GSM3331117_r12 GSM3331117 # SRR7695227 SRP158306 SRX4553615 GSM3331117_r2 GSM3331117 # SRR7695228 SRP158306 SRX4553615 GSM3331117_r3 GSM3331117 # SRR7695229 SRP158306 SRX4553615 GSM3331117_r4 GSM3331117 ``` In this case, multiple "runs" (first column IDs) belong to a single sample – these are delineated by the "experiment_accession" column, and the "experiment_alias" column matches up to the sample names in the [GSE118502](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118502) page linked above. About midway through the page, it says "Samples (48)", and if we count the number of unique entries in the "experiment_alias" column, we get 48: ```bash tail -n +2 SRP158306-info-slimmed.tsv | cut -f 5 | sort -u | wc -l # 48 ``` ### Getting fastq data for a single SRR The process for downloading is quickest when done in two steps: 1. Using the `prefetch` command to download the SRA object 2. Splitting out the fastq info from that object #### Downloading SRA object Here is that process with one of the SRR accessions from above, with some general options, and the accession we want is provided as a positional argument at the end: ```bash prefetch --max-size 500G --progress -O SRR7695235-tmp/ SRR7695235 ``` ```bash # 2022-01-21T19:58:28 prefetch.2.11.0: 1) Downloading 'SRR7695235'... # 2022-01-21T19:58:28 prefetch.2.11.0: Downloading via HTTPS... # |-------------------------------------------------- 100% # 2022-01-21T19:58:49 prefetch.2.11.0: HTTPS download succeed # 2022-01-21T19:58:53 prefetch.2.11.0: 'SRR7695235' is valid # 2022-01-21T19:58:53 prefetch.2.11.0: 1) 'SRR7695235' was downloaded successfully # 2022-01-21T19:58:53 prefetch.2.11.0: 'SRR7695235' has 0 unresolved dependencies ``` ```bash ls SRR7695235-tmp/SRR7695235/ # SRR7695235.sra ``` #### Pulling out fastq files Here we are pulling the fastq files out of the SRA object we downloaded, which is provided by a positional argument at the end: ```bash fasterq-dump --progress SRR7695235-tmp/SRR7695235/SRR7695235.sra ``` ```bash # join :|-------------------------------------------------- 100% # concat :|-------------------------------------------------- 100% # spots read : 4,388,472 # reads read : 8,776,944 # reads written : 8,776,944 ``` They come uncompressed: ```bash ls *.fastq # SRR7695235_1.fastq SRR7695235_2.fastq ``` We can delete the SRA object once we are done pulling the reads from it: ```bash rm -rf SRR7695235-tmp/ ``` ### Doing for all of SRP158306, including combining multiple "runs" that belong to the same sample That example above was just downloading the fastq data for a single SRR. But as noted above, multiple SRR's belong to a single sample in this case (of GLDS-427). So we want to download all the SRR's for a given sample, then zip them and cat them together so we have one forward and one reverse read file for each sample. **Exactly how this is done will depend on what the target dataset and its stored info look like, this example is just how it was done for GLDS-427.** To do that for this dataset, a script was written that did one sample at a time, pulling all runs for that sample, compressing, combining, and renaming for the name to be in typical GeneLab format (ending in "_R?_raw.fastq.gz"). And that script was run on each sample individually. This could be done linearly with a for loop, or it can be done with `xargs` to do it in parallel as shown below. First we want a list of all unique sample IDs as they are in our info table: ```bash cut -f 5 SRP158306-info-slimmed.tsv | tail -n +2 | sort -u > unique-sample-IDs.txt wc -l unique-sample-IDs.txt # 48 unique-sample-IDs.txt ``` #### Script for doing each individual sample (multiple runs) for GLDS-427 ```bash cat GLDS-427-dl-and-combine-runs-for-a-sample.sh ``` ```bash= #!/usr/bin/env bash set -e ## takes one positional argument, the GSM unique sample ID in this case ## e.g: bash GLDS-427-dl-and-combine-runs-for-a-sample.sh GSM3331071 sample_ID=${1} # making tmp directory for work mkdir -p ${sample_ID}-tmp # getting all the run accessions (SRR) stored in a variable curr_run_accs=$(grep -w "${sample_ID}" SRP158306-info-slimmed.tsv | cut -f 1 | tr "\n" " ") printf "\n\n Downloading run sra objects...\n\n" # downloading all of their sra objects with prefetch prefetch --max-size 500G --progress -O ${sample_ID}-tmp/ ${curr_run_accs} # moving them all up a level to be easier to work with (if they were put in nested directories by prefetch) mv ${sample_ID}-tmp/*/*.sra ${sample_ID}-tmp/ 2> /dev/null || true # getting object locations in a variable curr_obj_paths=$(echo $curr_run_accs | sed 's/ /.sra /g' | sed 's/$/.sra/' | sed "s/SRR/${sample_ID}-tmp\/SRR/g") printf "\n\n Splitting sra objects into fastq files...\n\n" # splitting into forward and reverse reads fasterq-dump --progress -O ${sample_ID}-tmp/ ${curr_obj_paths} printf "\n\n Compressing...\n\n" # compressing all pigz -p 8 ${sample_ID}-tmp/*.fastq printf "\n\n Concatenating...\n\n" # concatenating cat ${sample_ID}-tmp/*_1.fastq.gz > ${sample_ID}_R1_raw.fastq.gz cat ${sample_ID}-tmp/*_2.fastq.gz > ${sample_ID}_R2_raw.fastq.gz # removing temp files and objects rm -rf ${sample_ID}-tmp/ ``` #### Example of how script could be run linearly in a for loop ```bash for sample in $(cat unique-sample-IDs.txt) do printf "\n Doing sample: ${sample}\n\n" bash GLDS-427-dl-and-combine-runs-for-a-sample.sh ${sample} done ``` #### Example of how script could be run in parallel using `xargs` ```bash cat unique-sample-IDs.txt | xargs -n 1 -P 4 bash GLDS-427-dl-and-combine-runs-for-a-sample.sh ``` - `-n 1` is telling `xargs` to read only one line at a time from the standard input it is getting from our `cat` command, so each command is getting one sample ID. The ID it gets is placed as a positional argumnt at the end of the command we provided - `-P 4` is telling `xargs` to run at max 4 processes at a time, so this will be 4 times faster than running in linearly with the for loop above It took about 20 hours running the parallel way with `xargs` here on this dataset, which had 576 runs spread over 48 samples. Compressed file sizes at the end were roughly in the 3-4 GB range. --- --- # Random notes from GL figuring ## From bioproject PRJNA486575 ### getting run accessions associated with the bioproject ```bash prefetch ``` ## From GSE GSE118502 ```bash pysradb gse-to-srp GSE118502 # study_alias study_accession # GSE118502 SRP158306 ``` ```bash pysradb metadata --detailed SRP158306 --saveto SRP158306-info.tsv ``` Slimming it down to something smaller: ```bash cut -f 1,2,3,22,24 SRP158306-info.tsv > SRP158306-info-slimmed.tsv head SRP158306-info-slimmed.tsv | column -ts $'\t' | sed 's/^/# /' # run_accession study_accession experiment_accession run_alias experiment_alias # SRR7695235 SRP158306 SRX4553616 GSM3331118_r1 GSM3331118 # SRR7695236 SRP158306 SRX4553616 GSM3331118_r10 GSM3331118 # SRR7695237 SRP158306 SRX4553616 GSM3331118_r11 GSM3331118 # SRR7695238 SRP158306 SRX4553616 GSM3331118_r12 GSM3331118 # SRR7695239 SRP158306 SRX4553616 GSM3331118_r2 GSM3331118 # SRR7695240 SRP158306 SRX4553616 GSM3331118_r3 GSM3331118 # SRR7695241 SRP158306 SRX4553616 GSM3331118_r4 GSM3331118 # SRR7695242 SRP158306 SRX4553616 GSM3331118_r5 GSM3331118 # SRR7695243 SRP158306 SRX4553616 GSM3331118_r6 GSM3331118 ``` Getting list of all unique "experiment_alias" IDs, as that's what we have linked on the jira page: ```bash cut -f 5 SRP158306-info-slimmed.tsv | tail -n +2 | sort -u > unique-sample-IDs.txt wc -l unique-sample-IDs.txt # 48 head unique-sample-IDs.txt | sed 's/^/# /' # GSM3331071 # GSM3331072 # GSM3331073 # GSM3331074 # GSM3331075 # GSM3331076 # GSM3331077 # GSM3331078 # GSM3331079 # GSM3331080 ``` Doing some time tests on a single run accession to fi: ```bash # making tmp directory for work mkdir -p GSM3331071-tmp # getting all the run accessions (SRR) stored in a variable curr_run_accs=$(grep -w "GSM3331071" SRP158306-info-slimmed.tsv | cut -f 1 | tr "\n" " ") echo ${curr_run_accs} # SRR7694671 SRR7694672 SRR7694673 SRR7694674 SRR7694675 SRR7694676 SRR7694677 SRR7694678 SRR7694679 SRR7694680 SRR7694681 SRR7694682 # downloading all of their sra objects with prefetch time prefetch --max-size 500G --progress -O GSM3331071-tmp/ ${curr_run_accs} # that was 12 and took 7 minutes # getting object locations in a variable curr_obj_paths=$(echo $curr_run_accs | sed 's/ /.sra /g' | sed 's/$/.sra/' | sed 's/SRR/GSM3331071-tmp\/SRR/g') # splitting into forward and reverse reads time fasterq-dump --split-spot --split-files --progress --threads 8 -O GSM3331071-tmp/ ${curr_obj_paths} # took 90 seconds for these 12 # concatenating forward and reverse reads and gzipping time cat GSM3331071-tmp/*_1.fastq | gzip > GSM3331071_R1_raw.fastq.gz real 23m30.603s user 22m41.688s sys 0m48.308s time cat GSM3331071-tmp/*_2.fastq | gzip > GSM3331071_R2_raw.fastq.gz real 29m58.691s user 29m6.388s sys 0m46.200s ### so almost an hour to cat | gzip both ### trying pigz way time pigz -p 8 GSM3331071-tmp/*.fastq real 10m23.171s user 81m19.204s sys 0m38.792s time cat GSM3331071-tmp/*_1.fastq.gz > GSM3331071_R1_raw-pigz.fastq.gz real 0m4.344s user 0m0.012s sys 0m4.264s time cat GSM3331071-tmp/*_2.fastq.gz > GSM3331071_R2_raw-pigz.fastq.gz real 0m5.973s user 0m0.028s sys 0m5.916s ### so like 10 minutes with the pigz way, let's look at the sizes ls -lt GSM*.fastq.gz # -rw-rw-r-- 1 mlee mlee 3694225587 Jan 19 19:54 GSM3331071_R2_raw-pigz.fastq.gz # -rw-rw-r-- 1 mlee mlee 2899172500 Jan 19 19:54 GSM3331071_R1_raw-pigz.fastq.gz # -rw-rw-r-- 1 mlee mlee 3689460183 Jan 19 19:39 GSM3331071_R2_raw.fastq.gz # -rw-rw-r-- 1 mlee mlee 2896960207 Jan 19 19:09 GSM3331071_R1_raw.fastq.gz ### virtually identical, pigz it is! # removing temp files and objects rm -rf GSM3331071-tmp/ ``` So for this dataset, that one sample with 12 separate runs took like 90 minutes with cat | gzip way, 20 minutes with pigz -> cat way. If we extrapolate that to the 48 total samples, we're looking at maybe 16 hours to do all in a linear fashion as this is currently written. We could parallelize it though, i'll write the script so it can be run that way. And here's all that in a script for doing it on a single sample, so we can parallel it with `xargs` if we want: ```bash cat GLDS-427-dl-and-combine-runs-for-a-sample.sh ``` ```bash= #!/usr/bin/env bash set -e ## takes one positional argument, the GSM unique sample ID in this case ## e.g: bash GLDS-427-dl-and-combine-runs-for-a-sample.sh GSM3331071 sample_ID=${1} # making tmp directory for work mkdir -p ${sample_ID}-tmp # getting all the run accessions (SRR) stored in a variable curr_run_accs=$(grep -w "${sample_ID}" SRP158306-info-slimmed.tsv | cut -f 1 | tr "\n" " ") printf "\n\n Downloading run sra objects...\n\n" # downloading all of their sra objects with prefetch prefetch --max-size 500G --progress -O ${sample_ID}-tmp/ ${curr_run_accs} # moving them all up a level to be easier to work with (if they were put in nested directories by prefetch) mv ${sample_ID}-tmp/*/*.sra ${sample_ID}-tmp/ 2> /dev/null || true # getting object locations in a variable curr_obj_paths=$(echo $curr_run_accs | sed 's/ /.sra /g' | sed 's/$/.sra/' | sed "s/SRR/${sample_ID}-tmp\/SRR/g") printf "\n\n Splitting sra objects into fastq files...\n\n" # splitting into forward and reverse reads fasterq-dump --split-spot --split-files --progress --threads 8 -O ${sample_ID}-tmp/ ${curr_obj_paths} printf "\n\n Compressing...\n\n" # compressing all pigz -p 8 ${sample_ID}-tmp/*.fastq printf "\n\n Concatenating...\n\n" # concatenating cat ${sample_ID}-tmp/*_1.fastq.gz > ${sample_ID}_R1_raw.fastq.gz cat ${sample_ID}-tmp/*_2.fastq.gz > ${sample_ID}_R2_raw.fastq.gz # removing temp files and objects rm -rf ${sample_ID}-tmp/ ``` --- Here's what the process looks like: for sample in... prefetch all sra objects for a given sample split them into reads cat all reads together and gzip them remove sra object and individual read files done Running one on an SRX accession to see what it does, maybe it grabs all SRRs under that SRX: ```bash time fasterq-dump --split-spot --split-files --progress --threads 8 SRX4553616 SRX4553616 is not a run accession. For more information, see https://www.ncbi.nlm.nih.gov/sra/?term=SRX4553616 Automatic expansion of container accessions is not currently available. See the above link(s) for information about the accessions. ## no such luck, maybe prefetch on an SRX? prefetch --type fastq --max-size 100G --progress SRX4553616 SRX4553616 is not a run accession. For more information, see https://www.ncbi.nlm.nih.gov/sra/?term=SRX4553616 Automatic expansion of container accessions is not currently available. See the above link(s) for information about the accessions. ## nope ``` Running on one SRR: ```bash time fasterq-dump --split-spot --split-files --progress --threads 8 SRR7695235 join :|-------------------------------------------------- 100% concat :|-------------------------------------------------- 100% spots read : 4,388,472 reads read : 8,776,944 reads written : 8,776,944 real 5m31.933s user 0m50.580s sys 0m14.864s # prefetch time prefetch --max-size 100G --progress SRR7695235 -O ./ 2022-01-20T01:47:26 prefetch.2.11.0: 1) Downloading 'SRR7695235'... 2022-01-20T01:47:26 prefetch.2.11.0: Downloading via HTTPS... |-------------------------------------------------- 100% 2022-01-20T01:48:00 prefetch.2.11.0: HTTPS download succeed 2022-01-20T01:48:01 prefetch.2.11.0: 'SRR7695235' is valid 2022-01-20T01:48:01 prefetch.2.11.0: 1) 'SRR7695235' was downloaded successfully 2022-01-20T01:48:01 prefetch.2.11.0: 'SRR7695235' has 0 unresolved dependencies real 0m35.910s user 0m5.112s sys 0m2.392s # pulling out fastq time fasterq-dump --split-spot --split-files --progress --threads 8 --gz SRR7695235.sra join :|-------------------------------------------------- 100% concat :|-------------------------------------------------- 100% spots read : 4,388,472 reads read : 8,776,944 reads written : 8,776,944 real 0m9.779s user 0m42.632s sys 0m11.168s ``` So much faster doing prefetch first, then splitting that. Running on multiple and seeing if we can get it to cat them together for us: ```bash time prefetch --max-size 100G --progress -O ./ SRR7695235 SRR7695236 time prefetch --max-size 100G --progress -O ./ SRR7695235 SRR7695236 2022-01-20T01:55:25 prefetch.2.11.0: 1) Downloading 'SRR7695235'... 2022-01-20T01:55:25 prefetch.2.11.0: Downloading via HTTPS... |-------------------------------------------------- 100% 2022-01-20T01:56:00 prefetch.2.11.0: HTTPS download succeed 2022-01-20T01:56:01 prefetch.2.11.0: 'SRR7695235' is valid 2022-01-20T01:56:01 prefetch.2.11.0: 1) 'SRR7695235' was downloaded successfully 2022-01-20T01:56:01 prefetch.2.11.0: 'SRR7695235' has 0 unresolved dependencies 2022-01-20T01:56:02 prefetch.2.11.0: 2) Downloading 'SRR7695236'... 2022-01-20T01:56:02 prefetch.2.11.0: Downloading via HTTPS... |-------------------------------------------------- 100% 2022-01-20T01:56:39 prefetch.2.11.0: HTTPS download succeed 2022-01-20T01:56:41 prefetch.2.11.0: 'SRR7695236' is valid 2022-01-20T01:56:41 prefetch.2.11.0: 2) 'SRR7695236' was downloaded successfully 2022-01-20T01:56:41 prefetch.2.11.0: 'SRR7695236' has 0 unresolved dependencies real 1m16.476s user 0m10.612s sys 0m4.900s ``` That grabs them independently. Trying fasterq-dump on multiple local sra objects: ```bash time fasterq-dump --split-spot --split-files --progress --threads 8 SRR7695235.sra SRR7695236.sra join :|-------------------------------------------------- 100% concat :|-------------------------------------------------- 100% spots read : 4,388,472 reads read : 8,776,944 reads written : 8,776,944 join :|-------------------------------------------------- 100% concat :|-------------------------------------------------- 100% spots read : 4,427,070 reads read : 8,854,140 reads written : 8,854,140 real 0m19.240s user 1m25.100s sys 0m20.948s ``` That did them separately, seeing if concatenate option does what we want: ```bash time fasterq-dump --split-spot --split-files --progress --threads 8 --concatenate-reads SRR7695235.sra SRR7695236.sra ``` Can run fasterq-dump on multiple SRRs, provided as space-delimited list i believe. So that may be the most straightforward way to go. testing cat | gzip vs gzip then cat ```bash time cat SRR7695235_1.fastq SRR7695235_2.fastq | gzip > t1.fastq.gz real 4m53.752s user 4m47.360s sys 0m7.392s ``` ```bash= cat test.sh ``` ``` gzip SRR7695235_1.fastq > p1.fastq.gz gzip SRR7695235_2.fastq > p2.fastq.gz cat p1.fastq.gz p2.fastq.gz > both.fastq.gz ``` ```bash time bash test.sh real 5m28.951s user 5m23.072s sys 0m4.836s ``` Trying with pigz: ```bash= cat test-pigz.sh ``` ``` pigz SRR7695235_1.fastq SRR7695235_2.fastq cat SRR7695235_1.fastq.gz SRR7695235_2.fastq.gz > pigz-both.fq.gz ``` ```bash time bash test-pigz.sh real 0m40.200s user 7m33.280s sys 0m5.284s ``` Trying with gzip instead of pigz just to be sure it's due to pigz and not the other changes: ```bash cat test-gzip.sh ``` ``` gzip SRR7695235_1.fastq SRR7695235_2.fastq cat SRR7695235_1.fastq.gz SRR7695235_2.fastq.gz > gzip-both.fq.gz ``` ```bash= time bash test-gzip.sh real 5m14.677s user 5m7.064s sys 0m4.660s ```