RefSeq all complete genomes download

RefSeq all complete genomes download

Root refseq location is here: https://ftp.ncbi.nih.gov/refseq/release/

This page is downloading the genomic fasta files from RefSeq "complete".

Getting download script

The script used can be downloaded with the following (and is presented below for quick reference):

curl -L -o refseq-complete-genome-dl.sh https://figshare.com/ndownloader/files/35063551

Running script

The script gets the filenames ending in "genomic.fna.gz" available at this html page: https://ftp.ncbi.nih.gov/refseq/release/complete/

Then downloads them in parallel with xargs. Generated files will have today's date in them, as well as the directory holding the genomes at the end.

bash refseq-complete-genome-dl.sh

Script contents

#!/usr/bin/env bash

# Contact: Mike Lee (Mike.Lee@nasa.gov; github.com/AstrobioMike)

if [ "$#" != 0 ]; then
    printf "\n  Helper script to download all refseq complete genomes as of whatever today is.\n"
    printf "  See script for details. There are currently no guardrails or safety nets if a\n"
    printf "  download fails. So check the starting file count vs the total downloaded at end.\n\n"
    printf "  \tUsage:\n\t  bash refseq-complete-genome-dl.sh\n\n"

    printf "  \tContact:\n\t  Mike Lee (Mike.Lee@nasa.gov; github.com/AstrobioMike)\n\n"
    exit
fi

# we can pull through https or ftp
# protocol="ftp"
protocol="https"

refseq_complete_base_link="${protocol}://ftp.ncbi.nlm.nih.gov/refseq/release/complete"
curr_date_marker=$(date +%d-%B-%Y)
refseq_html_file="refseq-${curr_date_marker}.html"
refseq_filenames_file="refseq-${curr_date_marker}-genome-files.txt"
genomes_dir="refseq-${curr_date_marker}-complete-genomes"

mkdir -p ${genomes_dir}

# downloading html page (using this to get all the files we want to download)
curl -L -s -o ${refseq_html_file} ${refseq_complete_base_link}

# parsing out genomic.fna.gz filenames (which are also their link suffixes)
grep "genomic.fna.gz" ${refseq_html_file} | cut -f 2 -d '"' > ${refseq_filenames_file}

# this is messy so that it works on darwin (mac) too
num_files=$(wc -l ${refseq_filenames_file} | sed 's/^ *//' | tr -s " " "\t" | cut -f 1)

printf "\n  We are beginning the download of ${num_files} files now...\n"
printf "  See you in a bit :)\n\n"

# downloading in parallel with xargs (num run in parallel is set with -P option)
xargs -I % -P 10 curl -L -s -O "${refseq_complete_base_link}/%" < ${refseq_filenames_file}

mv *genomic.fna.gz ${genomes_dir}

RefSeq all complete genomes download

Getting download script

Running script

Script contents

Read more

BRAILLE update (31-Mar-2021)

STAMPS 2024 MAG concepts

STAMPS 2023 GToTree tutorial

Creating STAMPS 2023 JetStream2 image