---
tags: GeneLab
title: GeneLab amplicon workflow example run
---
# GeneLab amplicon workflow example run
---
> General workflow usage info can be found in the [NASA GeneLab repo here](https://github.com/nasa/GeneLab_Data_Processing/tree/master/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-A). **This page streamlines that a little and includes explicitly grabbing and running a tiny example dataset.** This workflow as currently written relies on conda and snakemake. There are no large reference requirements and it takes about 5 minutes to run the small example dataset used below on a standard laptop.
---
[toc]
---
## Installing conda, mamba, and genelab-utils
### conda
If conda is not already present, we recommend miniconda. Installers can be found from conda [here](https://conda.io/en/latest/miniconda.html), and if helpful, command-line installation is walked-through [here](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
### mamba
Once conda is installed on your system, we recommend installing [mamba](https://github.com/mamba-org/mamba#mamba), as it generally allows for much faster conda installations:
```bash
conda install -y -n base -c conda-forge mamba
```
### genelab-utils
The workflows are retrieved and meant to be run within this packages environment created here:
```bash
mamba create -y -n genelab-utils -c conda-forge -c bioconda -c defaults -c astrobiomike 'genelab-utils>=1.2.19'
```
The rest below expects to be done in the genelab-utils environment activated here:
```bash
conda activate genelab-utils
```
---
## Getting workflow and example data
```bash
# this downloads the workflow
GL-get-workflow Amplicon-Illumina
```
```bash
# this grabs tiny example data for 2 samples (about 300 KB)
GL-get-Illumina-amplicon-test-data
```
---
## Modifying config.yaml and creating input samples file
```bash
# changing into the workflow directory
cd SW_AmpIllumina-*/
# setting the input reads directory variable
sed -i 's|../Raw_Sequence_Data/|../example-amplicon-reads/|' config.yaml
# or if on mac/Darwin system and the above error'd
# sed -i "" 's|../Raw_Sequence_Data/|../example-amplicon-reads/|' config.yaml
# creating input file with unique sample IDs
printf "Sample-1\nSample-2\n" > unique-sample-IDs.txt
```
## Running the workflow
The test data run should only take about 5 minutes. Running the workflow the first time will create all needed environments, and then if run as shown below (specifying the `--conda-prefix` as shown), they will be re-used in future runs. As mentioned above, this is expected to be executed within the genelab-utils environment.
### Standard execution
The below is one example command to execute the workflow:
```bash
snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p
```
### Telling snakemake to use slurm
There is info about [snakemake/slurm here](https://snakemake.readthedocs.io/en/stable/executing/cluster.html#executing-on-slurm-clusters), and a default location it checks for a slurm configuration file is `~/.config/snakemake/slurm/config.yaml`. Here is a template of what one of mine looks like:
```
cluster:
mkdir -p slurm-logs &&
if [ -z {wildcards} ]; then log_wildcard=""; else log_wildcard=$(echo "-{wildcards}" | sed 's/ID=//'); fi &&
sbatch --mem={resources.mem_mb} -c {resources.cpus} -J {rule} -o slurm-logs/{rule}${{log_wildcard}}-%j.log -e slurm-logs/{rule}${{log_wildcard}}-%j.log
use-conda:
True
cores:
50
jobs:
10
printshellcmds:
True
reason:
True
rerun-incomplete:
True
scheduler:
greedy
latency-wait:
60
default-resources:
- cpus=1
- mem_mb=2000
```
And here is an example execution telling snakemake to manage things with slurm:
```bash
snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs --profile slurm
```
---
## Peeking at some outputs
From the location where the workflow was run:
**Count table**
```bash
head ../Final_Outputs/counts.tsv | column -t | sed 's/^/# /'
# ASV_ID Sample-1 Sample-2
# ASV_1 1943 0
# ASV_2 1437 0
# ASV_3 0 823
# ASV_4 803 0
# ASV_5 79 316
# ASV_6 298 0
# ASV_7 0 209
# ASV_8 0 181
# ASV_9 164 0
```
**Taxonomy table**
```bash
head ../Final_Outputs/taxonomy.tsv | column -t | sed 's/^/# /'
# ASV_ID domain phylum class order family genus species
# ASV_1 Bacteria Cyanobacteria Cyanobacteriia Chloroplast NA NA NA
# ASV_2 Bacteria Firmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus NA
# ASV_3 Bacteria Proteobacteria Alphaproteobacteria Sphingomonadales Sphingomonadaceae NA NA
# ASV_4 Bacteria Firmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus NA
# ASV_5 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Moraxellaceae NA NA
# ASV_6 Bacteria Actinobacteriota Actinobacteria Corynebacteriales Corynebacteriaceae Corynebacterium NA
# ASV_7 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Rhizobiaceae NA NA
# ASV_8 Bacteria Proteobacteria Alphaproteobacteria Sphingomonadales Sphingomonadaceae NA NA
# ASV_9 Bacteria Actinobacteriota Actinobacteria Corynebacteriales Corynebacteriaceae Corynebacterium NA
```
**Recovered sequences**
```bash
head ../Final_Outputs/ASVs.fasta | sed 's/^/# /'
# >ASV_1
# TACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTTAAGTCCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTTGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCGTAGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAATGGGA
# >ASV_2
# TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGGA
# >ASV_3
# TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTGCTCAAGTCAGAGGTGAAAGCCCGGGGCTCAACCCCGGAACTGCCTTTGAAACTAGGTAGCTAGAATCTTGGAGAGGTCAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGACTGACTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGA
# >ASV_4
# TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGGA
# >ASV_5
# TACAGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGAGTGTAGGTGGCTCATTAAGTCACATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATGTGATACTGGTGGTGCTAGAATATGTGAGAGGGAAGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCTTCCTGGCATAATATTGACACTGAGATTCGAAAGCGTGGGTAGCAAACAGGA
```
**Read counts throughout**
```bash
column -t ../Final_Outputs/read-count-tracking.tsv | sed 's/^/# /'
# sample raw_reads cutadapt_trimmed dada2_filtered dada2_denoised_F dada2_denoised_R dada2_merged dada2_chimera_removed final_perc_reads_retained
# Sample-1 10000 9960 6841 6648 6665 6291 6291 62.9
# Sample-2 10000 9680 2634 2545 2580 2442 2233 22.3
```