--- tags: Random title: RefSeq all complete genomes download --- # RefSeq all complete genomes download [toc] Root refseq location is here: https://ftp.ncbi.nih.gov/refseq/release/ This page is downloading the genomic fasta files from RefSeq "complete". ## Getting download script The [script used](https://figshare.com/articles/software/refseq-complete-genome-dl_sh/19736044) can be downloaded with the following (and is presented below for quick reference): ```bash curl -L -o refseq-complete-genome-dl.sh https://figshare.com/ndownloader/files/35063551 ``` ## Running script The script gets the filenames ending in "genomic.fna.gz" available at this html page: https://ftp.ncbi.nih.gov/refseq/release/complete/ Then downloads them in parallel with `xargs`. Generated files will have today's date in them, as well as the directory holding the genomes at the end. ```bash bash refseq-complete-genome-dl.sh ``` ## Script contents ```bash #!/usr/bin/env bash # Contact: Mike Lee (Mike.Lee@nasa.gov; github.com/AstrobioMike) if [ "$#" != 0 ]; then printf "\n Helper script to download all refseq complete genomes as of whatever today is.\n" printf " See script for details. There are currently no guardrails or safety nets if a\n" printf " download fails. So check the starting file count vs the total downloaded at end.\n\n" printf " \tUsage:\n\t bash refseq-complete-genome-dl.sh\n\n" printf " \tContact:\n\t Mike Lee (Mike.Lee@nasa.gov; github.com/AstrobioMike)\n\n" exit fi # we can pull through https or ftp # protocol="ftp" protocol="https" refseq_complete_base_link="${protocol}://ftp.ncbi.nlm.nih.gov/refseq/release/complete" curr_date_marker=$(date +%d-%B-%Y) refseq_html_file="refseq-${curr_date_marker}.html" refseq_filenames_file="refseq-${curr_date_marker}-genome-files.txt" genomes_dir="refseq-${curr_date_marker}-complete-genomes" mkdir -p ${genomes_dir} # downloading html page (using this to get all the files we want to download) curl -L -s -o ${refseq_html_file} ${refseq_complete_base_link} # parsing out genomic.fna.gz filenames (which are also their link suffixes) grep "genomic.fna.gz" ${refseq_html_file} | cut -f 2 -d '"' > ${refseq_filenames_file} # this is messy so that it works on darwin (mac) too num_files=$(wc -l ${refseq_filenames_file} | sed 's/^ *//' | tr -s " " "\t" | cut -f 1) printf "\n We are beginning the download of ${num_files} files now...\n" printf " See you in a bit :)\n\n" # downloading in parallel with xargs (num run in parallel is set with -P option) xargs -I % -P 10 curl -L -s -O "${refseq_complete_base_link}/%" < ${refseq_filenames_file} mv *genomic.fna.gz ${genomes_dir} ```