Try   HackMD

Running GTDB-Tk On a Google Cloud Virtual Machine

Quick MD note on how to run the GTDB-Tk pipeline on a google machine. GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy GTDB.

This is useful to estimate which taxonomic unit the bin you get from a metagenomic assembly belongs to.

I started a Google E2 standard machine with 16 CPUs, 64 GB RAM, and a 200 GB HDD to run the pipeline.

Google CLI

The Google CLI is a very convenient way to interact with the Google VMs.

More details TBD

There is already an extensive done documentation here

Installing the basics

Nothing like a fresh Ubuntu install to mess up!

Correct python

Python3 is already there; just symlink it

sudo ln -s /usr/bin/python3 /usr/bin/python

Same for pip

sudo apt install -y python3-pip
sudo ln -s /usr/bin/pip3 /usr/bin/pip

And a quick check to see if it worked

python --version

Useful accessories

wget

sudo apt-get install wget

Mamba

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh

Install GTDB-Tk

Make it simple get the mamba version

mamba create -n gtdbtk-2.2.6 -c conda-forge -c bioconda gtdbtk=2.2.6

And activate it

mamba activate gtdbtk-2.2.6

GTDB-Tk database

Download the beast

wget https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz

This step will take a while :/

Extract the beast

tar -xvfz gtdbtk_r207_v2_data.tar.gz
export GTDBTK_DATA_PATH=/path/to/release/package/

If you are using the pipeline often, it might be faster to make an image of it.

Set the Environment

mamba env config vars set GTDBTK_DATA_PATH=/home/wormbiome/release207_v2/ #reload mamba deactivate mamba activate gtdbtk-2.2.6

Check everything is working

gtdbtk check_install

Running GTDB-Tk

You did the most difficult part. From now on, there is nothing extraordinary; just follow the manual

Setting up the bin/genome to analyze

We want to place our genome in a reference folder.

On the glcoud VM side:

mkdir Assembly1

If you are using the google cloud CLI on your local machine, you can run:

gcloud compute scp /path/to/your/Fasta/bins/* USER@SERVER:/home/USER/Assembly1/

If you are using the web SSH window you can use the interactive upload option

Running the pipeline

gtdbtk classify_wf --genome_dir Assembly1 --out_dir Assembly1/gtdbtk/ --extension fasta --mash_db mash.db  --cpus 16

The --mash command will calculate the mash database the first time you run it, then it should go faster.

There are more details on how the pipeline works on this page

tags: tutorials Metagenomic taxonomy Mini pipeline