--- tags: GeneLab title: Using genelab-utils to download GLDS data --- # Using `genelab-utils` to download GLDS data > This page demonstrates using programs in the [genelab-utils](https://github.com/AstrobioMike/genelab-utils#genelab-utils) package to programmatically download specific files from a specific OSD or GLDS ID. > > Contact Mike.Lee@nasa.gov if having trouble. [toc] --- ## tl;dr example usage ```bash conda activate genelab-utils # get raw fastq files from OSD-170 GL-download-GLDS-data -g OSD-170 -p raw.fastq.gz # get a file that holds all files and their links for OSD-170 GL-download-GLDS-data -g OSD-170 --just-get-file-info-table ``` > **NOTE** > The "OSD-170-file-info.tsv" file produced by the above holds urls to all files in the dataset. --- ## Installing `genelab-utils` if needed The genelab-utils package should be installed with conda/mamba. If you are not familiar with conda, you can find an introduction [here](https://astrobiomike.github.io/unix/conda-intro) if wanted, and if you are not familiar with mamba, there is a super-short introduction on that same page [here](https://astrobiomike.github.io/unix/conda-intro#bonus-mamba-no-5) – it's definitely worth using mamba if you use conda at all :+1: ```bash # if needing mamba conda install -c conda-forge -n base mamba mamba create -n genelab-utils -c conda-forge -c bioconda -c defaults \ -c astrobiomike 'genelab-utils>=1.3' ``` --- ## Activating conda environment ```bash conda activate genelab-utils ``` --- ## Downloading data **If we wanted the raw fastq files from OSD-170 for example.** First here adding the `--print-only` flag to see the files listed that would be downloaded. ```bash GL-download-GLDS-data -g OSD-170 -p raw.fastq.gz --print-only # # Attempting to retrieve 'OSD-170' file data from: # https://genelab-data.ndc.nasa.gov/genelab/data/glds/files/170 # # # A table with available filenames and URLs has been written to: # OSD-170-file-info.tsv # # # The downloaded table holds a total of 194 files. # # # 60 file(s) found matching the provided pattern(s). # # # As requested, here are the 60 files that would be downloaded by this command # if run without the '--print-only' flag: # # GLDS-170_GAmplicon_L18002388_S25_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002373_S10_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002392_S29_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002386_S23_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002376_S13_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002375_S12_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002377_S14_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002383_S20_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002393_S30_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002372_S9_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002364_S1_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002379_S16_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002383_S20_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002385_S22_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002388_S25_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002375_S12_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002371_S8_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002374_S11_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002382_S19_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002385_S22_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002370_S7_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002376_S13_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002367_S4_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002371_S8_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002372_S9_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002366_S3_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002366_S3_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002364_S1_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002382_S19_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002365_S2_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002373_S10_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002386_S23_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002392_S29_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002389_S26_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002393_S30_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002380_S17_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002390_S27_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002370_S7_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002369_S6_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002380_S17_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002377_S14_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002369_S6_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002374_S11_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002367_S4_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002368_S5_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002365_S2_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002381_S18_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002378_S15_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002378_S15_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002389_S26_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002381_S18_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002391_S28_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002379_S16_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002384_S21_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002387_S24_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002390_S27_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002391_S28_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002387_S24_L001_R2_raw.fastq.gz # GLDS-170_GAmplicon_L18002368_S5_L001_R1_raw.fastq.gz # GLDS-170_GAmplicon_L18002384_S21_L001_R1_raw.fastq.gz # ``` And running it without the `--print-only` flag like this would ask us if we want to download them: ```bash GL-download-GLDS-data -g GLDS-170 -p raw.fastq.gz ``` We can add the `-f` flag to avoid being asked to confirm, if wanting to be able to use it non-interactively. Multiple patterns can be given to the `-p` argument, separated by a comma. See help menu for note on 'additive' vs 'exclusive' flags for this argument. --- ## Help menu See the help menu with `GL-download-GLDS-data -h` for information on things like controlling how many downloads to run in parallel and whether to use additive or exclusive filtering when providing multiple patterns to search for in filenames. This is the help menu as of genelab-utils version 1.2.11: ```bash GL-download-GLDS-data -h # usage: GL-download-GLDS-data [-h] -g OSD_OR_GLDS_ID [-p PATTERN] # [-a ASSAY_TABLE] [-o OUTPUT_DIR] [-j JOBS] [-P] # [-f] [--just-get-file-info-table] # [-m {additive,exclusive}] # # This is a program for downloading GLDS data files. See options below for # usage. For version info, run `GL-version`. # # options: # -h, --help show this help message and exit # -p PATTERN, --pattern PATTERN # If we do not want to download all files (which we # often won't), we can specify a pattern here to subset # the total files. For example, if we know we want to # download just the fastq.gz files, we can say '-p # fastq.gz'. We can also provide multiple patterns as a # comma-separated list. For example, If we want to # download the fastq.gz files that also have 'NxtaFlex', # 'metagenomics', and 'raw' in their filenames, we can # provide '-p fastq.gz,NxtaFlex,metagenomics,raw'. # Looking through the *-file-info.tsv table produced by # this program when run with the '--just-get-file-info- # table' flag can help figure this out if needed. (Note # that this is case-sensitive.) # -a ASSAY_TABLE, --assay-table ASSAY_TABLE # Assay table from a given OSD ISA - if providing an # assay table, all files from the 'Raw Data File' column # will be downloaded (incompatible with --pattern and # --mode) # -o OUTPUT_DIR, --output-dir OUTPUT_DIR # Directory to put downloaded files (default is current # working directory) # -j JOBS, --jobs JOBS Number of downloads to run in parallel (default: 10) # -P, --print-only Just print out the files that would be downloaded, # rather than downloading them (useful to check we are # getting what we want first) # -f, --force Don't ask for confirmation to begin download (helpful # if wanting to run non-interactively) # --just-get-file-info-table # Just download a table of all available files and their # URLs (doesn't incorporate pattern searching) # -m {additive,exclusive}, --mode {additive,exclusive} # If providing multiple patterns to the `-p` argument, # this option determines how to handle them. For # example, if we provide `-p raw,fastq`, in the default # mode of 'exclusive', we will only grab files that have # *both* 'raw' and 'fastq' in their filenames. If we set # `-m additive`, we would get all files that have # *either* 'raw' or ' fastq' in their filenames. # (default: exclusive) # # required arguments: # -g OSD_OR_GLDS_ID, --OSD-or-GLDS-ID OSD_OR_GLDS_ID # OSD ID (e.g., "ODS-276") or GLDS ID (e.g. "GLDS-276"); # be sure to read the NOTICE at the bottom of the help # menu # # NOTICE: Some confusion may arise due to recent changes. It is possible # that a GLDS ID and an OSD ID may not match up, e.g., 'OSD-561' # (https://osdr.nasa.gov/bio/repo/data/studies/OSD-561) holds 'GLDS-556' (which # we can see at the very top, just under the image next to the title). Moving # forward, IT IS RECOMMENDED to search for the OSD ID (which you can search for # based on a given GLDS ID here: https://osdr.nasa.gov/bio/repo/search) - as # that will find all the associated GLDS files no matter what their GLDS ID's # are. E.g., 'GL-download-GLDS-data -g OSD-561 --print-only'. Contact Mike Lee # at Mike.Lee@nasa.gov if having trouble. ``` --- # A note on using GLDS-IDs As noted at the end of the help menu above, some confusion may arise due to recent changes in the [OSD/GeneLab repository](https://osdr.nasa.gov/bio/repo/). It is possible that a GLDS ID and an OSD ID may not match up, e.g., 'OSD-561' (https://osdr.nasa.gov/bio/repo/data/studies/OSD-561) holds 'GLDS-556' (which we can see at the very top, just under the image next to the title): ![](https://i.imgur.com/zDTmXrp.png) Moving forward, **it is recommended** to search for the OSD ID (which you can search for based on a given GLDS ID here: https://osdr.nasa.gov/bio/repo/search) – as searching by OSD will find all the associated GLDS files no matter what their GLDS ID's are. E.g., `GL-download-GLDS-data -g OSD-561 --print-only` Contact Mike Lee at Mike.Lee@nasa.gov if having trouble. --- ---