Create genome reference database

# Create genome reference database ## Download genomes from NCBI Open Terminal (or Windows powershell) and install the NCBI genome download tool. (This step is not needed if you are working on a server on which the program is already installed). ``` pip install ncbi-genome-download ``` Find more info on the genome download tool [here](https://pypi.org/project/ncbi-genome-download/) You can select which organism to download using the options of the command ``` ncbi-genome-download --genus Desulfo ``` The term "Desulfo" in the above command will select only genomes of organisms with "Desulfo" in their name. Note: The string match is not sensitive to caps. You can make the string match fuzzy using the `--fuzzy-genus` option. This can be handy if you need to match a value in the middle of the NCBI organism name, like so: ``` ncbi-genome-download --genus Desulfo --fuzzy-genus bacteria ``` Note that for archaeal clades the above command needs to be changed to ``` ncbi-genome-download --genus whatever --fuzzy-genus archaea --format fasta ``` For our purpose we need the genomes in fasta format, but other formats like Genbank are available as well. ``` ncbi-genome-download --genus Desulfo --fuzzy-genus bacteria --format fasta ``` This tool should work like a charm and the download of hundreds of genomes takes only few minutes :timer_clock: The working directory now contains a directory named "refseq", which contains one sub-directory for each downloaded genome. The sub-directories are named based on the organisms RefSeq assembly accessions. The format for RefSeq (NCBI-derived) assembly accessions is: `[ GCF ][ _ ][nine digits][.][version number]` An example of the directories ![](https://i.imgur.com/27wNboG.png) ## Concatenate files Now we concatenate all fasta files into one file to obtain one large file containing all genome sequences. The below command searches all directories in the working directory for gzipped fasta files (*.fna.gz) and then creates a new file with all sequences named `desulfo_genomes_DB.fna.gz` in the parent directory. ``` find ./ -type f -name '*.fna.gz' -exec cat {} + > ../desulfo_genomes_DB.fna.gz ``` To count the number of genomes that were downloaded use: ``` ls -1 | wc -l ``` We now repeat that for as many clades/guilds we are interested in.