# Running GTDB-Tk On a Google Cloud Virtual Machine Quick MD note on how to run the [GTDB-Tk](https://ecogenomics.github.io/GTDBTk/index.html) pipeline on a google machine. GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy [GTDB](https://gtdb.ecogenomic.org/). This is useful to estimate which taxonomic unit the bin you get from a metagenomic assembly belongs to. I started a Google E2 standard machine with 16 CPUs, 64 GB RAM, and a 200 GB HDD to run the pipeline. # Google CLI The Google CLI is a very convenient way to interact with the Google VMs. More details TBD There is already an extensive done documentation [here](https://cloud.google.com/sdk/gcloud/reference) # Installing the basics Nothing like a fresh Ubuntu install to mess up! ## Correct python Python3 is already there; just symlink it ```bash sudo ln -s /usr/bin/python3 /usr/bin/python ``` Same for pip ```bash sudo apt install -y python3-pip sudo ln -s /usr/bin/pip3 /usr/bin/pip ``` And a quick check to see if it worked ```bash python --version ``` ## Useful accessories ### wget ```bash sudo apt-get install wget ``` ### Mamba ```bash=! curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh" bash Mambaforge-$(uname)-$(uname -m).sh ``` ## Install GTDB-Tk Make it simple get the mamba version ```bash mamba create -n gtdbtk-2.2.6 -c conda-forge -c bioconda gtdbtk=2.2.6 ``` And activate it ```bash mamba activate gtdbtk-2.2.6 ``` ## GTDB-Tk database ### Download the beast ```bash! wget https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz ``` This step will take a while :/ ### Extract the beast ```bash tar -xvfz gtdbtk_r207_v2_data.tar.gz export GTDBTK_DATA_PATH=/path/to/release/package/ ``` If you are using the pipeline often, it might be faster to make an image of it. ## Set the Environment ```bash= mamba env config vars set GTDBTK_DATA_PATH=/home/wormbiome/release207_v2/ #reload mamba deactivate mamba activate gtdbtk-2.2.6 ``` ## Check everything is working ```bash gtdbtk check_install ``` # Running GTDB-Tk You did the most difficult part. From now on, there is nothing extraordinary; just follow the [manual](https://ecogenomics.github.io/GTDBTk/commands/classify_wf.html) ## Setting up the bin/genome to analyze We want to place our genome in a reference folder. On the glcoud VM side: ```bash mkdir Assembly1 ``` If you are using the google cloud CLI on your local machine, you can run: ```bash gcloud compute scp /path/to/your/Fasta/bins/* USER@SERVER:/home/USER/Assembly1/ ``` If you are using the web SSH window you can use the interactive upload option ## Running the pipeline ```bash gtdbtk classify_wf --genome_dir Assembly1 --out_dir Assembly1/gtdbtk/ --extension fasta --mash_db mash.db --cpus 16 ``` The `--mash` command will calculate the mash database the first time you run it, then it should go faster. There are more details on how the pipeline works on this [page](https://ecogenomics.github.io/GTDBTk/examples/classify_wf.html) ###### tags: `tutorials` `Metagenomic` `taxonomy` `Mini` `pipeline`