GeneLab amplicon workflow example run

--- tags: GeneLab title: GeneLab amplicon workflow example run --- # GeneLab amplicon workflow example run --- > General workflow usage info can be found in the [NASA GeneLab repo here](https://github.com/nasa/GeneLab_Data_Processing/tree/master/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-A). **This page streamlines that a little and includes explicitly grabbing and running a tiny example dataset.** This workflow as currently written relies on conda and snakemake. There are no large reference requirements and it takes about 5 minutes to run the small example dataset used below on a standard laptop. --- [toc] --- ## Installing conda, mamba, and genelab-utils ### conda If conda is not already present, we recommend miniconda. Installers can be found from conda [here](https://conda.io/en/latest/miniconda.html), and if helpful, command-line installation is walked-through [here](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda). ### mamba Once conda is installed on your system, we recommend installing [mamba](https://github.com/mamba-org/mamba#mamba), as it generally allows for much faster conda installations: ```bash conda install -y -n base -c conda-forge mamba ``` ### genelab-utils The workflows are retrieved and meant to be run within this packages environment created here: ```bash mamba create -y -n genelab-utils -c conda-forge -c bioconda -c defaults -c astrobiomike 'genelab-utils>=1.2.19' ``` The rest below expects to be done in the genelab-utils environment activated here: ```bash conda activate genelab-utils ``` --- ## Getting workflow and example data ```bash # this downloads the workflow GL-get-workflow Amplicon-Illumina ``` ```bash # this grabs tiny example data for 2 samples (about 300 KB) GL-get-Illumina-amplicon-test-data ``` --- ## Modifying config.yaml and creating input samples file ```bash # changing into the workflow directory cd SW_AmpIllumina-*/ # setting the input reads directory variable sed -i 's|../Raw_Sequence_Data/|../example-amplicon-reads/|' config.yaml # or if on mac/Darwin system and the above error'd # sed -i "" 's|../Raw_Sequence_Data/|../example-amplicon-reads/|' config.yaml # creating input file with unique sample IDs printf "Sample-1\nSample-2\n" > unique-sample-IDs.txt ``` ## Running the workflow The test data run should only take about 5 minutes. Running the workflow the first time will create all needed environments, and then if run as shown below (specifying the `--conda-prefix` as shown), they will be re-used in future runs. As mentioned above, this is expected to be executed within the genelab-utils environment. ### Standard execution The below is one example command to execute the workflow: ```bash snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p ``` ### Telling snakemake to use slurm There is info about [snakemake/slurm here](https://snakemake.readthedocs.io/en/stable/executing/cluster.html#executing-on-slurm-clusters), and a default location it checks for a slurm configuration file is `~/.config/snakemake/slurm/config.yaml`. Here is a template of what one of mine looks like: ``` cluster: mkdir -p slurm-logs && if [ -z {wildcards} ]; then log_wildcard=""; else log_wildcard=$(echo "-{wildcards}" | sed 's/ID=//'); fi && sbatch --mem={resources.mem_mb} -c {resources.cpus} -J {rule} -o slurm-logs/{rule}${{log_wildcard}}-%j.log -e slurm-logs/{rule}${{log_wildcard}}-%j.log use-conda: True cores: 50 jobs: 10 printshellcmds: True reason: True rerun-incomplete: True scheduler: greedy latency-wait: 60 default-resources: - cpus=1 - mem_mb=2000 ``` And here is an example execution telling snakemake to manage things with slurm: ```bash snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs --profile slurm ``` --- ## Peeking at some outputs From the location where the workflow was run: **Count table** ```bash head ../Final_Outputs/counts.tsv | column -t | sed 's/^/# /' # ASV_ID Sample-1 Sample-2 # ASV_1 1943 0 # ASV_2 1437 0 # ASV_3 0 823 # ASV_4 803 0 # ASV_5 79 316 # ASV_6 298 0 # ASV_7 0 209 # ASV_8 0 181 # ASV_9 164 0 ``` **Taxonomy table** ```bash head ../Final_Outputs/taxonomy.tsv | column -t | sed 's/^/# /' # ASV_ID domain phylum class order family genus species # ASV_1 Bacteria Cyanobacteria Cyanobacteriia Chloroplast NA NA NA # ASV_2 Bacteria Firmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus NA # ASV_3 Bacteria Proteobacteria Alphaproteobacteria Sphingomonadales Sphingomonadaceae NA NA # ASV_4 Bacteria Firmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus NA # ASV_5 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Moraxellaceae NA NA # ASV_6 Bacteria Actinobacteriota Actinobacteria Corynebacteriales Corynebacteriaceae Corynebacterium NA # ASV_7 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Rhizobiaceae NA NA # ASV_8 Bacteria Proteobacteria Alphaproteobacteria Sphingomonadales Sphingomonadaceae NA NA # ASV_9 Bacteria Actinobacteriota Actinobacteria Corynebacteriales Corynebacteriaceae Corynebacterium NA ``` **Recovered sequences** ```bash head ../Final_Outputs/ASVs.fasta | sed 's/^/# /' # >ASV_1 # TACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTTAAGTCCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTTGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCGTAGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAATGGGA # >ASV_2 # TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGGA # >ASV_3 # TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTGCTCAAGTCAGAGGTGAAAGCCCGGGGCTCAACCCCGGAACTGCCTTTGAAACTAGGTAGCTAGAATCTTGGAGAGGTCAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGACTGACTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGA # >ASV_4 # TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGGA # >ASV_5 # TACAGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGAGTGTAGGTGGCTCATTAAGTCACATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATGTGATACTGGTGGTGCTAGAATATGTGAGAGGGAAGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCTTCCTGGCATAATATTGACACTGAGATTCGAAAGCGTGGGTAGCAAACAGGA ``` **Read counts throughout** ```bash column -t ../Final_Outputs/read-count-tracking.tsv | sed 's/^/# /' # sample raw_reads cutadapt_trimmed dada2_filtered dada2_denoised_F dada2_denoised_R dada2_merged dada2_chimera_removed final_perc_reads_retained # Sample-1 10000 9960 6841 6648 6665 6291 6291 62.9 # Sample-2 10000 9680 2634 2545 2580 2442 2233 22.3 ```