# Running GTDB-Tk On a Google Cloud Virtual Machine
Quick MD note on how to run the [GTDB-Tk](https://ecogenomics.github.io/GTDBTk/index.html) pipeline on a google machine. GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy [GTDB](https://gtdb.ecogenomic.org/).
This is useful to estimate which taxonomic unit the bin you get from a metagenomic assembly belongs to.
I started a Google E2 standard machine with 16 CPUs, 64 GB RAM, and a 200 GB HDD to run the pipeline.
# Google CLI
The Google CLI is a very convenient way to interact with the Google VMs.
More details TBD
There is already an extensive done documentation [here](https://cloud.google.com/sdk/gcloud/reference)
# Installing the basics
Nothing like a fresh Ubuntu install to mess up!
## Correct python
Python3 is already there; just symlink it
```bash
sudo ln -s /usr/bin/python3 /usr/bin/python
```
Same for pip
```bash
sudo apt install -y python3-pip
sudo ln -s /usr/bin/pip3 /usr/bin/pip
```
And a quick check to see if it worked
```bash
python --version
```
## Useful accessories
### wget
```bash
sudo apt-get install wget
```
### Mamba
```bash=!
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
```
## Install GTDB-Tk
Make it simple get the mamba version
```bash
mamba create -n gtdbtk-2.2.6 -c conda-forge -c bioconda gtdbtk=2.2.6
```
And activate it
```bash
mamba activate gtdbtk-2.2.6
```
## GTDB-Tk database
### Download the beast
```bash!
wget https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz
```
This step will take a while :/
### Extract the beast
```bash
tar -xvfz gtdbtk_r207_v2_data.tar.gz
export GTDBTK_DATA_PATH=/path/to/release/package/
```
If you are using the pipeline often, it might be faster to make an image of it.
## Set the Environment
```bash=
mamba env config vars set GTDBTK_DATA_PATH=/home/wormbiome/release207_v2/
#reload
mamba deactivate
mamba activate gtdbtk-2.2.6
```
## Check everything is working
```bash
gtdbtk check_install
```
# Running GTDB-Tk
You did the most difficult part. From now on, there is nothing extraordinary; just follow the [manual](https://ecogenomics.github.io/GTDBTk/commands/classify_wf.html)
## Setting up the bin/genome to analyze
We want to place our genome in a reference folder.
On the glcoud VM side:
```bash
mkdir Assembly1
```
If you are using the google cloud CLI on your local machine, you can run:
```bash
gcloud compute scp /path/to/your/Fasta/bins/* USER@SERVER:/home/USER/Assembly1/
```
If you are using the web SSH window you can use the interactive upload option
## Running the pipeline
```bash
gtdbtk classify_wf --genome_dir Assembly1 --out_dir Assembly1/gtdbtk/ --extension fasta --mash_db mash.db --cpus 16
```
The `--mash` command will calculate the mash database the first time you run it, then it should go faster.
There are more details on how the pipeline works on this [page](https://ecogenomics.github.io/GTDBTk/examples/classify_wf.html)
###### tags: `tutorials` `Metagenomic` `taxonomy` `Mini` `pipeline`