GeneLab metagenomics workflow example run

--- tags: GeneLab title: GeneLab metagenomics workflow example run --- # GeneLab metagenomics workflow example run --- > General workflow usage info can be found in the [NASA GeneLab repo here](https://github.com/nasa/GeneLab_Data_Processing/tree/master/Metagenomics/Illumina/Workflow_Documentation/SW_MGIllumina). **This page streamlines that a little and includes explicitly grabbing and running a small example dataset.** This workflow as currently written relies on conda and snakemake. > > The metagenomics workflow does rely on large reference databases. They will be installed and setup automatically by the workflow the first time it is run. All together, after downloaded and unpacked, they will take up about 240GB of storage, but they also may require up to 500GB during installation and initial un-packing. > > Due to the large time required for downloading and setting up reference databases, it can take 12 hours or longer to run the example data here the first time. The example data are roughly 800 MB (the relatively large size for example data is so that MAGs are still recovered). --- [toc] --- ## Installing conda, mamba, and genelab-utils ### conda If conda is not already present, we recommend miniconda. Installers can be found from conda [here](https://conda.io/en/latest/miniconda.html), and if helpful, command-line installation is walked-through [here](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda). ### mamba Once conda is installed on your system, we recommend installing [mamba](https://github.com/mamba-org/mamba#mamba), as it generally allows for much faster conda installations: ```bash conda install -y -n base -c conda-forge mamba ``` ### genelab-utils The workflows are retrieved and meant to be run within this packages environment created here: ```bash mamba create -y -n genelab-utils -c conda-forge -c bioconda -c defaults -c astrobiomike 'genelab-utils>=1.2.19' ``` The rest below expects to be done in the genelab-utils environment activated here: ```bash conda activate genelab-utils ``` --- ## Getting workflow and example data ```bash # this downloads the workflow GL-get-workflow MG-Illumina ``` ```bash # this grabs example data for 2 samples (about 800 MB) GL-get-Illumina-metagenomics-test-data ``` --- ## Modifying config.yaml and creating input samples file ```bash # changing into the workflow directory cd SW_MGIllumina*/ # setting the input reads directory variable sed -i 's|../Raw_Sequence_Data/|../example-metagenomic-reads/|' config.yaml # or if on mac/Darwin system and the above error'd # sed -i "" 's|../Raw_Sequence_Data/|../example-metagenomics-reads/|' config.yaml # creating input file with unique sample IDs printf "Sample-1\nSample-2\n" > unique-sample-IDs.txt ``` You must also set the primary location that will hold the reference databases in the config.yaml file. It is the "REF_DB_ROOT_DIR" variable. ## Running the workflow As mentioned above, this workflow relies on some large reference databases. They will be installed and setup automatically by the workflow the first time it is run. All together, after downloaded and unpacked, they will take up about 240GB of storage, but they also may require up to 500GB during installation and initial un-packing. Largely due to database download and setup the first time, the first run of the workflow takes a while, likely greater than 12 hours. Running the workflow the first time will also create all needed environments, and then if run as shown below (specifying the `--conda-prefix` as shown), they will be re-used in future runs. This is expected to be executed within the genelab-utils environment. ### Standard execution The below is one example command to execute the workflow: ```bash snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p ``` ### Telling snakemake to use slurm There is info about [snakemake/slurm here](https://snakemake.readthedocs.io/en/stable/executing/cluster.html#executing-on-slurm-clusters), and a default location it checks for a slurm configuration file is `~/.config/snakemake/slurm/config.yaml`. Here is a template of what one of mine looks like: ``` cluster: mkdir -p slurm-logs && if [ -z {wildcards} ]; then log_wildcard=""; else log_wildcard=$(echo "-{wildcards}" | sed 's/ID=//'); fi && sbatch --mem={resources.mem_mb} -c {resources.cpus} -J {rule} -o slurm-logs/{rule}${{log_wildcard}}-%j.log -e slurm-logs/{rule}${{log_wildcard}}-%j.log use-conda: True cores: 50 jobs: 10 printshellcmds: True reason: True rerun-incomplete: True scheduler: greedy latency-wait: 60 default-resources: - cpus=1 - mem_mb=2000 ``` And here is an example execution telling snakemake to manage things with slurm: ```bash snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs --profile slurm ``` --- ## Peeking at some outputs From the location where the workflow was run: **Gene-level function coverages** ```bash head ../Assembly-based_Processing/combined-outputs/Combined-gene-level-KO-function-coverages.tsv | column -t | sed 's/^/# /' # KO_ID KO_function Sample-1 Sample-2 # K00001 alcohol dehydrogenase [EC:1.1.1.1] 0.0 105.02940000000001 # K00003 homoserine dehydrogenase [EC:1.1.1.3] 0.0 102.3683 # K00005 glycerol dehydrogenase [EC:1.1.1.6] 76.3986 0.0 # K00009 mannitol-1-phosphate 5-dehydrogenase [EC:1.1.1.17] 55.6388 5.1954 # K00010 myo-inositol 2-dehydrogenase / D-chiro-inositol 1-dehydrogenase [EC:1.1.1.18 1.1.1.369] 93.0 0.0 # K00012 UDPglucose 6-dehydrogenase [EC:1.1.1.22] 88.2442 103.9793 # K00013 histidinol dehydrogenase [EC:1.1.1.23] 82.0575 80.1397 # K00014 shikimate dehydrogenase [EC:1.1.1.25] 173.8462 356.61289999999997 # K00015 glyoxylate reductase [EC:1.1.1.26] 0.0 106.5199 ``` **Recovered MAGs** ```bash head ../Assembly-based_Processing/MAGs/MAGs-overview.tsv | column -t | sed 's/^/# /' # Assembly Total contigs Total length GC content Maximum contig length N50 L50 Num. contigs >= 10000 Num. contigs >= 50000 Num. contigs >= 100000 est. completeness est. redundancy est. strain heterogeneity domain phylum class order family genus species # Sample-2-MAG-2 132 5932921 71.61 305214 69375 27 115 40 12 98.58 1.57 41.67 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Beijerinckiaceae Methylobacterium Methylobacterium aquaticum_B # Sample-2-MAG-3 118 6399692 69.54 410159 100140 19 96 45 20 98.65 0.94 25.00 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Beijerinckiaceae Methylobacterium NA ```