Using genelab-utils to download GLDS data

--- tags: GeneLab title: Using genelab-utils to download GLDS data --- # Using genelab-utils to download GLDS data > This page demonstrates using programs in my [genelab-utils](https://github.com/AstrobioMike/genelab-utils#genelab-utils) package to programmatically download specific files from a specific GLDS. [toc] --- ## UPDATE 10-Aug-2022 > The only program needed now is `GL-download-GLDS-data`, see the help menu there. Not sure when I'll get to updating this page, but it works basically the same way, except all in one program now. So the example below in the tl;dr section would just become this: > ```bash > GL-download-GLDS-data -g GLDS-104 -p tar.gz,epigenomics -j 12 > ``` --- ## tl;dr Everything is detailed below, but here is the short version of how to, for example, download the bisulfite-sequencing read files for [GLDS-104](https://genelab-data.ndc.nasa.gov/genelab/accession/GLDS-104/): ```bash conda activate genelab-utils GL-get-GLDS-files-info -g GLDS-104 GL-download-GLDS-data -i GLDS-104-files-and-links.tsv -p tar.gz,epigenomics -j 12 ``` - There will be no print out while the parallel downloads are happening, but the files will be growing in size in the current directory if we check from another terminal. We will get a finished notification when it's done and our normal prompt returns. - If downloading a lot of data, like the case in this example, I recommend running the download command in a fashion that wouldn't be interrupted if/when we disconnect: such as in a [screen](https://astrobiomike.github.io/unix/screen-intro) or through a job-scheduler. - By default, the program requires us to enter `y` to confirm beginning the download. If running through a job-scheduler, or anyway wanting to do it non-interactively, add the `-f` flag to the command to avoid needing this confirmation. --- ## Installation if needed This already exists on our GeneLab cluster, but if wanting to install it elsewhere, it can be done like so: ```bash conda create -n genelab-utils -c conda-forge -c bioconda -c defaults \ -c astrobiomike genelab-utils ``` --- ## Activating conda environment ```bash conda activate genelab-utils ``` --- ## Downloading data This is done in two steps: 1. `GL-get-GLDS-files-info` - this needs to be run first, specifying which GLDS we want - it builds a table of all available files and their download links - it can also be helpful to look at to see what files are available, and to figure out what patterns we want to provide to the actual download program 2. `GL-download-GLDS-data` - this takes the output table from the previous step, and this is also where we specify what patterns to look for in deciding which files to download > For the below examples, we are pretending we want all the bisulfite sequencing read files for [GLDS-104](https://genelab-data.ndc.nasa.gov/genelab/accession/GLDS-104/). For this particular dataset, and these data, there are a ton of reads per sample, and each sample holds all of them in files that end with ".tar.gz": <a href="https://i.imgur.com/C3imfPN.png"><img src="https://i.imgur.com/C3imfPN.png"></a> ### Generating the table of all available files and links for a dataset This utilizes `GL-get-GLDS-files-info`, here's a look at its help menu: ```bash GL-get-GLDS-files-info -h # usage: GL-get-GLDS-files-info [-h] -g GLDS_ID # # This is a program for getting the files and links available for a given GLDS # dataset. The output table can be given to the `GL-download-GLDS-data` program # to download the files. For version info, run `GL-version`. # # options: # -h, --help show this help message and exit # # required arguments: # -g GLDS_ID, --GLDS-ID GLDS_ID # GLDS ID (e.g. "GLDS-276") ``` Generating table: ```bash GL-get-GLDS-files-info -g GLDS-104 # Retrieving GLDS-104 study data from: # https://genelab-data.ndc.nasa.gov/genelab/data/study/data/GLDS-104 # # # Retrieving GLDS-104 files data from: # https://genelab-data.ndc.nasa.gov/genelab/data/study/filelistings/5f2c4b5e9119fdcfdec2143d # # # Files and urls written to GLDS-104-files-and-links.tsv # # Pass that file onto the `GL-download-GLDS-data` program for downloading the data :) ``` Peaking at that table: ```bash column -ts $'\t' GLDS-104-files-and-links.tsv | head -n 5 | sed 's/^/# /' # filename url # GLDS-104_transcriptomics_M23-SLS.tar.gz https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-104_transcriptomics_M23-SLS.tar.gz?version=1 # GLDS-104_transcriptomics_M24-SLS.tar.gz https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-104_transcriptomics_M24-SLS.tar.gz?version=1 # GLDS-104_transcriptomics_M25-SLS.tar.gz https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-104_transcriptomics_M25-SLS.tar.gz?version=1 # GLDS-104_transcriptomics_M26-SLS.tar.gz https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-104_transcriptomics_M26-SLS.tar.gz?version=1 ``` ### Downloading the files we want This utilizes `GL-download-GLDS-data`, here's a look at that help menu: ```bash GL-download-GLDS-data -h # usage: GL-download-GLDS-data [-h] -i INPUT_TABLE [-p PATTERN] [-j JOBS] # [--print-only] # # This is a program for downloading GLDS data files using the output table # generated by the `GL-get-GLDS-files-info` program. So see that first if you # haven't yet. For version info, run `GL-version`. # # options: # -h, --help show this help message and exit # -j JOBS, --jobs JOBS Number of downloads to run in parallel (default: 10) # --print-only Just print out the files that would be downloaded, # rather than downloading them (useful to check we are # getting what we want first) # # required arguments: # -i INPUT_TABLE, --input-table INPUT_TABLE # This should be the tsv table produced by the `GL-get- # GLDS-files-info` program # -p PATTERN, --pattern PATTERN # If we do not want to download all files (which we # often won't), we can specify a pattern here to subset # the total files. For example, if we know we want to # download just the fastq.gz files, we can say '-p # fastq.gz'. We can also provide multiple patterns as a # comma-separated list. For example, If we want to # download the fastq.gz files that also have 'NxtaFlex', # 'metagenomics', and 'raw' in their filenames, we can # provide '-p fastq.gz,NxtaFlex,metagenomics,raw'. # Looking through the table produced by the `GL-get- # GLDS-files-info` program can help figure this out if # needed. (Note that this is case-sensitive.) ``` We need to give that the table we made in the above step (passed to the `-i` argument), and we need to tell it a pattern in the filenames to help specify the files we want. Since downloading can take a while, when we are figuring out and specifying the files we want to download, it is good to add the `--print-only` flag. This won't run the download, but will show all the files that will be downloaded if we run it without that flag. From looking at the GLDS page, in this case we know the files we want end in ".tar.gz", so here is seeing what specifying that pattern looks like: ```bash GL-download-GLDS-data -i GLDS-104-files-and-links.tsv -p tar.gz --print-only # # The input table holds a total of 276 files. # # # 24 files were found matching the provided pattern(s). # # # As requested, here are the 24 files that would be downloaded by this command # if run without the '--print-only' flag: # # GLDS-104_transcriptomics_M23-SLS.tar.gz # GLDS-104_transcriptomics_M24-SLS.tar.gz # GLDS-104_transcriptomics_M25-SLS.tar.gz # GLDS-104_transcriptomics_M26-SLS.tar.gz # GLDS-104_transcriptomics_M27-SLS.tar.gz # GLDS-104_transcriptomics_M28-SLS.tar.gz # GLDS-104_transcriptomics_M33-SLS.tar.gz # GLDS-104_transcriptomics_M34-SLS.tar.gz # GLDS-104_transcriptomics_M35-SLS.tar.gz # GLDS-104_transcriptomics_M36-SLS.tar.gz # GLDS-104_transcriptomics_M37-SLS.tar.gz # GLDS-104_transcriptomics_M38-SLS.tar.gz # GLDS-104_epigenomics_M23-SLS.tar.gz # GLDS-104_epigenomics_M24-SLS.tar.gz # GLDS-104_epigenomics_M25-SLS.tar.gz # GLDS-104_epigenomics_M26-SLS.tar.gz # GLDS-104_epigenomics_M27-SLS.tar.gz # GLDS-104_epigenomics_M28-SLS.tar.gz # GLDS-104_epigenomics_M33-SLS.tar.gz # GLDS-104_epigenomics_M34-SLS.tar.gz # GLDS-104_epigenomics_M35-SLS.tar.gz # GLDS-104_epigenomics_M36-SLS.tar.gz # GLDS-104_epigenomics_M37-SLS.tar.gz # GLDS-104_epigenomics_M38-SLS.tar.gz # ``` That tells us it found 24 files that match this pattern, and we can see that from what was printed, that included RNA-Seq files like this: `GLDS-104_transcriptomics_M23-SLS.tar.gz` The pattern (`-p`) argument can accept a comma-delimited list. So we can add another pattern to filter further. Here we will add in "epigenomics" too: ```bash GL-download-GLDS-data -i GLDS-104-files-and-links.tsv -p tar.gz,epigenomics --print-only # # The input table holds a total of 276 files. # # # 12 files were found matching the provided pattern(s). # # # As requested, here are the 12 files that would be downloaded by this command # if run without the '--print-only' flag: # # GLDS-104_epigenomics_M23-SLS.tar.gz # GLDS-104_epigenomics_M24-SLS.tar.gz # GLDS-104_epigenomics_M25-SLS.tar.gz # GLDS-104_epigenomics_M26-SLS.tar.gz # GLDS-104_epigenomics_M27-SLS.tar.gz # GLDS-104_epigenomics_M28-SLS.tar.gz # GLDS-104_epigenomics_M33-SLS.tar.gz # GLDS-104_epigenomics_M34-SLS.tar.gz # GLDS-104_epigenomics_M35-SLS.tar.gz # GLDS-104_epigenomics_M36-SLS.tar.gz # GLDS-104_epigenomics_M37-SLS.tar.gz # GLDS-104_epigenomics_M38-SLS.tar.gz # ``` And that finds solely the 12 files we want, so we could then run the program to do the download of them by just removing the `--print-only` flag from the previous command. Since these files are all very large, in this case I would also modify the default of running 10 downloads in parallel to be running 12 (to do all samples at once) by setting the `-j` parameter. So I'd run this download like this: ```bash GL-download-GLDS-data -i GLDS-104-files-and-links.tsv -p tar.gz,epigenomics -j 12 ``` - There will be no print out while the parallel downloads are happening, but the files will be growing in size in the current directory if we check from another terminal. We will get a finished notification when it's done and our normal prompt returns. - If downloading a lot of data, like the case in this example, I recommend running the download command in a fashion that wouldn't be interrupted if/when we disconnect: such as in a [screen](https://astrobiomike.github.io/unix/screen-intro) or through a job-scheduler. - By default, the program requires us to enter `y` to confirm beginning the download. If running through a job-scheduler, or anyway wanting to do it non-interactively, add the `-f` flag to the command to avoid needing this confirmation. --- ---