STAMPS 2022 microbial diversity metagenomes

--- tags: STAMPS 2022 title: STAMPS 2022 microbial diversity metagenomes --- # STAMPS 2022 microbial diversity metagenomes Processing metagenomes from [this paper](https://environmentalmicrobiome.biomedcentral.com/articles/10.1186/s40793-019-0348-0), which is based on earlier MBL data from the Microbial Diversity course :+1: --- [toc] --- # Download outputs/files produced by the code below ## Atlas WF stuff It's ~2.3GB compressed, and like 3GB uncompressed: ```bash curl -L -o all-atlas-files-and-outputs.tar.gz https://figshare.com/ndownloader/files/36353529 tar -xzvf all-atlas-files-and-outputs.tar.gz ``` ## GeneLab WF stuff It's ~1.4GB compressed, and like 1.5GB uncompressed: ```bash curl -L -o all-GL-files-and-outputs.tar.gz https://figshare.com/ndownloader/files/36370509 tar -xzvf all-GL-files-and-outputs.tar.gz ``` --- # Processing ## Getting raw data There are 4 metagenomes: - 3 that are from 3 different holes - [SRR8859675](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8859675) - [SRR8859676](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8859676) - [SRR8859678](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8859678) - and 1 from an enrichment culture - [SRR8859677](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8859677) Using SRA tools to download the raw reads as follows: ### Env creation > Using conda/mamba throughout, introduction page [here](https://astrobiomike.github.io/unix/conda-intro) if wanted. ```bash mamba create -y -n sra-tools -c conda-forge -c bioconda -c defaults sra-tools=2.11.0 conda activate sra-tools ``` ### Downloading > There are faster ways to do this if doing a ton of them (namely using `prefetch` first before `fasterq-dump`), but this is fine for just a few like here. The following was run in a [screen](https://astrobiomike.github.io/unix/screen-intro): ```bash fasterq-dump --split-spot --split-files --progress SRR8859675 SRR8859676 SRR8859677 SRR8859678 ``` For some reason 2 of them came through in single read files: ```bash ls -l *.fastq | sed 's/^/# /' # -rw-rw-r-- 1 mike mike 1000128906 Jul 18 16:44 SRR8859675_1.fastq # -rw-rw-r-- 1 mike mike 1000128906 Jul 18 16:44 SRR8859675_2.fastq # -rw-rw-r-- 1 mike mike 108698164 Jul 18 16:44 SRR8859676.fastq # -rw-rw-r-- 1 mike mike 102900824 Jul 18 16:44 SRR8859677.fastq # -rw-rw-r-- 1 mike mike 913532544 Jul 18 16:49 SRR8859678_1.fastq # -rw-rw-r-- 1 mike mike 913532544 Jul 18 16:49 SRR8859678_2.fastq ``` Though their experiment pages (e.g. for [76 here](https://www.ncbi.nlm.nih.gov/sra/SRX5647178[accn])) say they are paired like the others, their run pages (e.g. for [76 here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR8859676&display=metadata)) say there is one read per spot ¯\\\_(ツ)\_/¯ --- ## Processing with atlas workflow Using the [atlas](https://github.com/metagenome-atlas/atlas/#metagenome-atlas) workflow to process. Documentation [here](https://metagenome-atlas.readthedocs.io/en/latest/). ### Env creation ```bash mamba create -y -n atlas -c conda-forge -c bioconda -c defaults metagenome-atlas=2.9.1 conda activate atlas ``` ### Running This will setup the required reference databases on the first run. The following was also run in a [screen](https://astrobiomike.github.io/unix/screen-intro): ```bash # making directory for atlas refs mkdir ~/atlas-refs # running init atlas init --db-dir ~/atlas-refs/ ./ # [Atlas] INFO: Configuration file written to /home/mike/STAMPS/2022/MG-with-Taylor/config.yaml # You may want to edit it using any text editor. # [Atlas] INFO: I inferred that _1 and _2 distinguish paired end reads. # [Atlas] ERROR: Did't find '_1' or '_2' in fastq SRR8859677 : /home/mike/STAMPS/2022/MG-with-Taylor/SRR8859677.fastqIgnore file. # [Atlas] ERROR: Did't find '_1' or '_2' in fastq SRR8859676 : /home/mike/STAMPS/2022/MG-with-Taylor/SRR8859676.fastqIgnore file. # [Atlas] INFO: Found 2 samples ``` Don't know if Atlas can be configured to work with paired and single together, but for now just moving forward with the paired ones (the single-end are tiny anyway, like 180,000 reads). So moving forward with SRR8859675 and SRR8859678 for now. ```bash time atlas run all conda deactivate ``` ### Downloading atlas WF results All files (not including initial reads) copied into `all-atlas-files-and-outputs/`, tar'd/gzipped, and put on figshare (which I have a script in my [bit](https://github.com/AstrobioMike/bit#conda-install) package that can do it from the command-line, `bit-figshare-upload`, if that's every helpful – like it was here because downloading to local on MBLGuest was going to take forever...). It's ~2.3GB compressed, and like 3GB uncompressed, and can be downloaded and unpacked with, e.g.: ```bash curl -L -o all-atlas-files-and-outputs.tar.gz https://figshare.com/ndownloader/files/36353529 tar -xzvf all-atlas-files-and-outputs.tar.gz ``` --- ## Processing with GeneLab workflow My [GeneLab metagenomics workflow](https://github.com/nasa/GeneLab_Data_Processing/tree/master/Metagenomics/Illumina#genelab-bioinformatics-processing-protocol-for-illumina-metagenomics-data) (most up-to-date version [here](https://github.com/asaravia-butler/GeneLab_Data_Processing/tree/mikes-branch/Metagenomics/Illumina), which is what is used below), is just GeneLab's "standardized" way of processing all its metagenomics data. It's not built to be an expansive/flexible workflow like atlas above. Which is why we're primarily using atlas, but here is running the GeneLab workflow too so we have those outputs also. ### Env creation ```bash mamba create -y -n genelab-utils -c conda-forge -c bioconda -c defaults -c astrobiomike genelab-utils=1.0.42 conda activate genelab-utils ``` ### Running > **Note on reference databases** Many reference databases are relied upon throughout this workflow. They will be installed and setup automatically the first time the workflow is run. All together, after installed and unpacked, they will take up about 240 GB of storage, but they may also require up to 500GB during installation and initial un-packing, so be sure there is enough room on your system before running the workflow. The required reference databases will be setup on the first run if they don't already exist on the system. The following was also run in a [screen](https://astrobiomike.github.io/unix/screen-intro): ```bash # making directory genelab MG refs # (yes probably some overlap, but don't feel like figuring that out right now, ha) mkdir ~/genelab-MG-refs # getting workflow GL-get-Illumina-metagenomics-wf # Pulled Illumina metagenomic workflow from: # github.com/asaravia-butler/GeneLab_Data_Processing/tree/mikes-branch/Metagenomics/Illumina ``` The GeneLab workflow isn't built to work with single and paired together currently, so just like with atlas above, just doing the 2 that are paired: SRR8859675 and SRR8859678. ```bash # making file holding unique sample IDs ls *_1.fastq | cut -f 1 -d "_" > Illumina-metagenomics-workflow/unique-sample-IDs.txt ``` Changing into workflow directory and modifying the config.yaml as follows: ```bash cd Illumina-metagenomics-workflow/ nano config.yaml # raw_reads_dir: # "../" # raw_R1_suffix: # "_1.fastq.gz" # raw_R2_suffix: # "_2.fastq.gz" # REF_DB_ROOT_DIR: # "~/genelab-MG-refs/" # gzipping the input reads gzip ../*_?.fastq snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 8 --latency-wait 60 -p ``` ### Downloading GeneLab WF results All files (not including initial reads) copied into `all-GL-files-and-outputs/`, tar'd/gzipped, and put on figshare (which I have a script in my [bit](https://github.com/AstrobioMike/bit#conda-install) package that can do it from the command-line, `bit-figshare-upload`, if that's every helpful – like it was here because downloading to local on MBLGuest was going to take forever...). It's ~1.4GB compressed, and like 1.5GB uncompressed, and can be downloaded and unpacked with, e.g.: ```bash curl -L -o all-GL-files-and-outputs.tar.gz https://figshare.com/ndownloader/files/36370509 tar -xzvf all-GL-files-and-outputs.tar.gz ``` --- ---