GL genome intake processing

--- tags: GeneLab --- # GL genome intake processing [toc] ## Overview of what we're generating for each submitted assembly (genome) * assembly summary statistics (with [bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)) * estimated completion/redundancy (with [CheckM](https://github.com/Ecogenomics/CheckM/wiki)) * taxonomy (with [GTDB-Tk](https://ecogenomics.github.io/GTDBTk/)) ## Setting up environment **All of the following only needs to be done once.** ### Conda See more details on getting the correct version for your system and responses to prompts during installation [here (a conda intro tutorial)](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda): Downloading (this one is for linux systems, need a different link if working on a Mac): ```bash curl -LO https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh ``` ```bash bash Miniconda3-latest-Linux-x86_64.sh ``` > **NOTE** > Depending on where we are working, we might need to put the conda environment somewhere other than the default location (it asks this at a prompt during the install). But using the default location seems to be fine on MMOC. Sourcing the environment so changes take effect: ```bash source ~/.bashrc ``` Installing snakemake: ```bash conda install -y -c conda-forge -c bioconda -c defaults snakemake=5.19.3 ``` #### Setting up conda "profile" Copy and paste the entire codeblock to setup our [profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles): ```bash mkdir -p ~/.config/snakemake/genome/ echo "jobs: 8 use-conda: true printshellcmds: true conda-prefix: \"${CONDA_PREFIX}/envs\" latency-wait: 60" > ~/.config/snakemake/genome/config.yaml ``` ### Checkm reference databases setup --- > **NOTE** > This takes < 5 minutes, and they are about 1.5 GB when decompressed. If working on MMOC, this section (Checkm reference databases setup)can be skipped, but **if skipping this section, this code block needs to be run in the conda environment**: ```bash mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d/ echo 'export checkm_ref_db='\"/netapp/disk1/home/mdlee4/checkm-ref-dbs\" >> ${CONDA_PREFIX}/etc/conda/activate.d/vars.sh source ~/.bashrc ``` --- If doing somewhere new, go somewhere we can store the checkm reference databases and then create a new directory for them, e.g.: ```bash mkdir checkm-ref-dbs cd checkm-ref-dbs/ ``` Downloading checkm references: ```bash curl -LO https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz tar -xzvf checkm_data_2015_01_16.tar.gz rm checkm_data_2015_01_16.tar.gz ``` And adding a variable to our conda environment that stores where this is setup: ```bash mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d/ echo 'export checkm_ref_db='\"$(pwd)\" >> ${CONDA_PREFIX}/etc/conda/activate.d/vars.sh ``` Can check it's there: ```bash source ~/.bashrc echo $checkm_ref_db ``` This variable will be used to set the checkm directory while running through snakemake. ### GTDB-Tk reference databases setup --- > **NOTE** > This takes maybe 90 minutes (on MMOC), and they are about 27 GB when decompressed. If working on MMOC, this section (GTDB-Tk reference databases setup) can be skipped, but **if skipping this section, this needs to be run in the conda environment**: ```bash mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d/ echo 'export gtdb_ref_db='\"/netapp/disk1/home/mdlee4/gtdb-tk-ref-dbs/release89\" >> ${CONDA_PREFIX}/etc/conda/activate.d/vars.sh source ~/.bashrc ``` --- If doing somewhere new, go somewhere we can store the gtdb-tk reference databases and then created a new directory for them, e.g.: ```bash mkdir gtdb-tk-ref-dbs cd gtdb-tk-ref-dbs/ ``` Downloading gtdb-tk references: ```bash curl -LO https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/gtdbtk_r89_data.tar.gz tar -xzvf gtdbtk_r89_data.tar.gz rm gtdbtk_r89_data.tar.gz cd release89/ ``` And adding a variable to our conda environment that stores where this is setup: ```bash mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d/ echo 'export gtdb_ref_db='\"$(pwd)\" >> ${CONDA_PREFIX}/etc/conda/activate.d/vars.sh ``` Can check it's there: ```bash source ~/.bashrc echo $gtdb_ref_db ``` ### Getting the pre-formatted Snakemake files Here we are pulling in the snakemake directory structure and files we're going to use and storing them in our conda environment shared location: ```bash curl -L -o GL-snakemake-genomes-intake-processing.tar.gz https://ndownloader.figshare.com/files/23706011 tar -xzvf GL-snakemake-genomes-intake-processing.tar.gz rm GL-snakemake-genomes-intake-processing.tar.gz mv GL-snakemake-genomes-intake-processing/ ${CONDA_PREFIX}/share/ ``` Getting and adding script to conda environment that will generate directory structure and starting files wherever we are: ```bash curl -L -o GL-setup-intake-genomes-processing-dir.sh https://ndownloader.figshare.com/files/23596601 chmod +x GL-setup-intake-genomes-processing-dir.sh mv GL-setup-intake-genomes-processing-dir.sh ${CONDA_PREFIX}/bin/ ``` We can now test that like so: ```bash GL-setup-intake-genomes-processing-dir.sh ``` Which should create the `genome-intake-processing/` directory, that we can delete for now: ```bash rm -rf genome-intake-processing/ ``` ## Example usage Here we will demonstrate how we can use all this to generate some of the associated info we're going to include with newly submitted genomes. Even operating on just 2 input genomes in the example, the full process will take about 30 minutes. But it only takes a minute to kick off the whole process. ### Making an example directory to work in We'll make a directory to work in, doesn't matter where we are: ```bash mkdir new-genomes-submission-ex cd new-genomes-submission-ex/ ``` ### Downloading example genomes Making a directory to hold our genomes: ```bash mkdir example-genomes cd example-genomes/ ``` Downloading and unzipping 2 example genomes: ```bash curl -LO https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/345/GCF_000007345.1_ASM734v1/GCF_000007345.1_ASM734v1_genomic.fna.gz gunzip GCF_000007345.1_ASM734v1_genomic.fna.gz curl -LO https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/865/GCF_000008865.2_ASM886v2/GCF_000008865.2_ASM886v2_genomic.fna.gz gunzip GCF_000008865.2_ASM886v2_genomic.fna.gz # renaming them to have ".fasta" extensions more in alignment with other things in GeneLab (longer extensions rather than shorter) mv GCF_000007345.1_ASM734v1_genomic.fna GCF_000007345.1_ASM734v1_genomic.fasta mv GCF_000008865.2_ASM886v2_genomic.fna GCF_000008865.2_ASM886v2_genomic.fasta ``` We need to copy the [path](https://astrobiomike.github.io/unix/getting-started#absolute-vs-relative-path) of where these input genomes are, so we are going to run this next command, and be sure to copy the output (we'll see in the next section where we will be pasting it in): ```bash pwd ``` Then let's go back up a directory to where we want to work: ```bash cd ../ ``` ### Pulling in the Snakemake setup We can setup the Snakemake files with the script we grabbed above. To run it we just need to do this: ```bash GL-setup-intake-genomes-processing-dir.sh ``` And now let's change into that directory: ```bash cd genome-intake-processing/ ls ``` ## What we need to change The `config.yaml` file holds the things we will change. Right now, that just involves changing the GLDS ID we are working on, and the location of the where the input genomes are located. The template comes in setup for when I was working on GLDS-302 on Oberyn and looks like this: ```bash cat config.yaml ``` ``` glds_ID: "GLDS-302" genomes_dir: "/data1/Data_Processing/Genome_Datasets/GLDS-302/genomes" threads: "8" ``` On the server, it'll be easiest to do this with a command-line text editor, e.g. [`nano`](https://astrobiomike.github.io/unix/working-with-files-and-dirs#a-terminal-text-editor). We want to change the "GLDS-302" to whichever GLDS number we are working on. This will be appended to the output file name. Here it's just an example, so not putting in a real GLDS number. And we want to paste in the location we copied above within the quotes of the `genomes_dir` entry. So after changing, it may look like this (holding the location I copied above): ``` glds_ID: "GLDS-302" genomes_dir: "/data1/Data_Processing/mlee/Genomes-area/example-genomes" threads: "8" ``` > **NOTE** > As currently written, the input genomes need to have the extension ".fasta"). E.g. `GLDS-262_wgs_NHNT01.fasta`. > ## Running the snakemake workflow ### Regular (no job manager program) We are using Snakemake to handle this for us. After pointing to where the genomes are in that `config.yaml` file, all we need to do to run it is the following: ```bash snakemake --profile genome ``` The taxonomy assignment takes the longest, and depending on the number of genomes and if both bacteria and archaea, it may be like 30+ minutes. If wanting to run it in the background to avoid needing to stay connected, it can be done in `screen` or called like this: `nohup snakemake --profile genome &` When done, we will have a file like `GLDS-XXX-genome-summaries.tsv` that looks like this: ``` Assembly Total contigs Total length Ambiguous characters GC content Maximum contig length Minimum contig length N50 L50 Est. Completeness (%) Est. Redundancy (%) Domain Phylum Class Order Family Genus Species contigs_JC-0091_S1021_L001 106 5,028,784 0 55.91 569,761 204 133,612 11 100.0 0.22 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Pantoea Pantoea brenneri contigs_JC-0031_S959_L001 35 4,894,675 0 55.91 1,801,920 203 808,304 2 99.97 0.33 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Enterobacter Enterobacter bugandensis contigs_JC-0048_S976_L001 106 5,028,470 0 55.91 569,761 204 133,612 10 100.0 0.22 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Pantoea Pantoea brenneri contigs_JC-0090_S1020_L001 106 5,028,395 0 55.91 569,761 204 129,551 11 100.0 0.22 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Pantoea Pantoea brenneri ``` ### With Slurm (needed on MMOC) --- >Worked on this on MMOC here: /netapp/disk1/home/mdlee4/new-genomes-submission-ex/genome-intake-processing > >**I'm currently messing with the slurm profile at `~/.config/snakemake/slurm/config.yaml` and figuring things out learning from [this very helpful page](https://www.sichong.site/2020/02/25/snakemake-and-slurm-how-to-manage-workflow-with-resource-constraint-on-hpc/)** > >**There is no `screen`. Currently running in `nohup` like so: `nohup snakemake --profile slurm -p &`** >**This all works, but the computational resources on MMOC are very limited. So not doing any on there. Leaving this for me to look at if needed in the future though.** --- # Slurm-only way (no snakemake) ere's a regular slurm avenue. ## Conda installations Do base conda install as noted above (can skip snakemake install), then we need to create 3 environments (they go much faster with [mamba](https://github.com/TheSnakePit/mamba#mamba-an-experiment-to-make-conda-faster), so adding that in first): ```bash conda install -y -c conda-forge mamba ``` ```bash mamba create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit ``` ```bash mamba create -y -n checkm -c conda-forge -c bioconda -c defaults checkm-genome ``` ```bash mamba create -y -n gtdb-tk -c conda-forge -c bioconda -c defaults gtdbtk ``` ### Setting some variables Setting checkm references location needs to be done within the checkm conda environment (using the one I setup above on MMOC): ```bash conda activate checkm checkm data setRoot /netapp/disk1/home/mdlee4/checkm-ref-dbs conda deactivate ``` Setting gtdb-tk references location, needs to be within that environment: ```bash conda activate gtdb-tk mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d/ echo 'export GTDBTK_DATA_PATH='\"/netapp/disk1/home/mdlee4/gtdb-tk-ref-dbs/release89\" >> ${CONDA_PREFIX}/etc/conda/activate.d/vars.sh conda deactivate ```