Mike Lee

@AstrobioMike

microbialomics.org/research github.com/AstrobioMike

Joined on Jun 29, 2018

  • Interactive function plots and tables are here. BRAILLE update (31-Mar-2021) Previous update docs https://hackmd.io/@astrobiomike/BRAILLE-notes-17-Mar-2021 https://hackmd.io/@astrobiomike/BRAILLE-update-24-Feb-2021 https://hackmd.io/@astrobiomike/BRAILLE-update-3-Feb-2021 https://hackmd.io/@astrobiomike/BRAILLE-notes-12-Dec-2020
     Like 1 Bookmark
  • Overview Metagenomics attempts to sequence all the DNA present in a sample. It can provide a window into the taxonomy and functional potential of a mixed community. There are a ton of things that can be done with metagenomics data as this non-exhaustive overview figure begins to highlight: <a href="https://astrobiomike.github.io/images/metagenomics_overview.png "><img src="https://astrobiomike.github.io/images/metagenomics_overview.png "></a> This page is an introduction to some concepts about one of the things we can try to do with metagenomics data: recovering metagenome-assembled genomes (MAGs). Key concepts
     Like  Bookmark
  • <a href="https://github.com/AstrobioMike/AstrobioMike.github.io/raw/master/images/GToTree-logo-1200px.png "><img src="https://github.com/AstrobioMike/AstrobioMike.github.io/raw/master/images/GToTree-logo-1200px.png "></a> GToTree is a user-friendly workflow for phylogenomics. This page is an example of using GToTree to make a phylogenomic tree incorporating newly recovered Metagenome-Assembled Genomes (MAGs) and references from the stellar Genome Taxonomy DataBase (GTDB). Contents NOTEThis page assumes some baseline familiarty with the Unix-like command line. You can find a Crash Course here if wanted 🙂
     Like  Bookmark
  • GUI used was Jetstream2 exosphere: https://jetstream2.exosphere.app/ Summary info The base image created below is publicly available as "STAMPS-2023" and includes: conda v23.5.2 / mamba v1.4.9 jupyterlab v3.6.3 in base conda env an anvio-dev conda environment R v4.3.1 / Rstudio Server (2023.06.1-524) with:BiocManager 1.30.21 remotes 2.4.2
     Like  Bookmark
  • This page gives an overview of how one could subset a pre-packaged HMM single-copy geneset. For example, if you had specific genomes that you wanted to work with, and wanted to use a pre-packaged HMM file (like Cyanobacteria.hmm which holds 251 target genes at the time of putting this page together (12-Feb-2024)), but you wanted to drop that down to some subset of those. General process (example code below) Run GToTree with everything (all wanted input genomes, and pointing to the pre-packaged HMM file that is suitable) Use the output "SCG_hit_counts.tsv" table to figure out which target genes we want to use based on the hit-counts of each target gene per genome; put the wanted gene-names (from their names in the "SCG_hit_counts.tsv" file) into a plain text file, one per line Make a suset SCG-targets hmm file with hmmfetch (which comes installed with the GToTree conda environment) Run GToTree again, with all input genomes, but now pointing to the newly created, subset hmm file passed to the -h argument Example code The example code will be based on running the gtt-test.sh program to start (so there is standard data to be used for this example), but would work the same after adjusting for your situation. If you have trouble, feel free to reach out to me for help: MikeLee<at>bmsis<dot>org
     Like  Bookmark
  • Getting and entering the container This starts off just on the regular log-in node, which is fine for getting the container at first, and jumping into it to look for things. But if trying to do any heavy processing, we'll want to be sure we are on a compute node – example in next section for doing that. With singularity installed already (which is the case on our system): # pulling the image and building container (only needs to be done once) # this will put it in the current working directory singularity pull docker://ugrbioinfo/srnatoolbox:latest # entering the container (this is if it's in our current working directory, otherwise we'd need to provide a longer path to its location)
     Like 1 Bookmark
  • Align, trim, and tree example Env setup # installing mamba if not there yet (enables faster conda installations) conda install -n base -c conda-forge mamba mamba create -n align-trim-and-tree -c conda-forge -c bioconda -c defaults muscle=5.1 trimal=1.4.1 iqtree=2.2.0_beta conda activate align-trim-and-tree Test sequences Making file holding test sequences (copy and paste whole codeblock into command-line):
     Like  Bookmark
  • Baseline steps for running human-read removal Creating environment mamba create -n kraken2 -c conda-forge -c bioconda -c defaults kraken2==2.1.1 Or one with nextflow also: mamba create -n kraken2-nextflow -c conda-forge -c bioconda -c defaults kraken2==2.1.1 nextflow Setting up reference db Run these wherever you want to keep the reference db (it's only about 3 GB): # here is where I'm putting mine (will need to match what is in the code below)
     Like  Bookmark
  • <a href="https://github.com/AstrobioMike/AstrobioMike.github.io/raw/master/images/GToTree-logo-1200px.png "><img src="https://github.com/AstrobioMike/AstrobioMike.github.io/raw/master/images/GToTree-logo-1200px.png "></a> GToTree is a user-friendly workflow for phylogenomics. This page is an example of using GToTree to make a phylogenomic tree incorporating newly recovered Metagenome-Assembled Genomes (MAGs) and references from the stellar Genome Taxonomy DataBase (GTDB). Contents Environment creation This is already done on our instances, but if we wanted to install GToTree with conda/mamba in a new location, we would do so like this:
     Like  Bookmark
  • Example general conda env setups The below uses mamba in place of conda for installs – it's worth it to pick this up if not familiar yet! See here for quick intro to mamba. Jupyter Lab conda environment creation (last updated 3-Oct-2022) mamba create -n jupyterlab -c conda-forge jupyterlab python=3 Then can launch jupyter lab like so: conda activate jupyterlab jupyter lab
     Like  Bookmark
  • Unix-like systems (which we'll define in a little bit) are extremely pervasive in our computational world. And fortunately, having just a baseline familiarity can often grant us access to this extremely helpful and empowering environment. For our two sessions, we are going to run through a Unix crash course that is designed for folks completely new to the Unix-like command line, or those who would like a little more exposure to it. No installation or prior experience is required 🙂 Page Contents Schedule Date
     Like  Bookmark
  • KOFamScan setup and example This is a short page demonstrating setting up and running KOFamScan. Env creation Note, bit is not required, but it has a filtering function I use after KO annotations to only keep significant hits, and if there is more than one significant KO assigned to a given protein (which is extremely rare, but it happens), it will keep only the most significant one. There are examples of both outputs near the end of the page. mamba create -n kofamscan -c conda-forge -c bioconda -c defaults -c astrobiomike kofamscan bit Downloading required KOFamScan HMM profiles and ref file This only needs to be done once. Afterwards we need to point to these when running the program.
     Like 1 Bookmark
  • Using genelab-utils to download GeneLab workflows This page demonstrates using a program in the genelab-utils package to programmatically download packaged workflows from our Data Processing github repository. Contact Mike.Lee@nasa.gov if having trouble. tl;dr example usage conda activate genelab-utils GL-get-workflow MG-Illumina
     Like  Bookmark
  • Using genelab-utils to download GLDS data This page demonstrates using programs in the genelab-utils package to programmatically download specific files from a specific OSD or GLDS ID. Contact Mike.Lee@nasa.gov if having trouble. tl;dr example usage conda activate genelab-utils # get raw fastq files from OSD-170 GL-download-GLDS-data -g OSD-170 -p raw.fastq.gz
     Like  Bookmark
  • GeneLab metagenomics workflow example run General workflow usage info can be found in the NASA GeneLab repo here. This page streamlines that a little and includes explicitly grabbing and running a small example dataset. This workflow as currently written relies on conda and snakemake. The metagenomics workflow does rely on large reference databases. They will be installed and setup automatically by the workflow the first time it is run. All together, after downloaded and unpacked, they will take up about 240GB of storage, but they also may require up to 500GB during installation and initial un-packing. Due to the large time required for downloading and setting up reference databases, it can take 12 hours or longer to run the example data here the first time. The example data are roughly 800 MB (the relatively large size for example data is so that MAGs are still recovered). Installing conda, mamba, and genelab-utils conda If conda is not already present, we recommend miniconda. Installers can be found from conda here, and if helpful, command-line installation is walked-through here.
     Like  Bookmark
  • GeneLab amplicon workflow example run General workflow usage info can be found in the NASA GeneLab repo here. This page streamlines that a little and includes explicitly grabbing and running a tiny example dataset. This workflow as currently written relies on conda and snakemake. There are no large reference requirements and it takes about 5 minutes to run the small example dataset used below on a standard laptop. Installing conda, mamba, and genelab-utils conda If conda is not already present, we recommend miniconda. Installers can be found from conda here, and if helpful, command-line installation is walked-through here. mamba Once conda is installed on your system, we recommend installing mamba, as it generally allows for much faster conda installations:
     Like  Bookmark
  • Setting up some Mac keyboard shortcuts I like on new computer NOTE Some things have changed in new macOSs. See very bottom for updated way of creating keyboard shortcuts (at least for the word/character count one, likely different still for these workflow types - they were in "General" in the Service menu after i created them). Though new ones dont' seem to be :( iTerm launch (or open new window if already open) in automator (that's the program) document type to create is "Quick Action" set up top options like this:
     Like  Bookmark
  • Review info for GeneLab MethylSeq processing document Explanation of what this is about Hello friends! We at GeneLab are putting together our "standardized" pipeline for processing MethylSeq data. If you have experience processing this datatype, we'd appreciate your set of eyes on things and any input if you have it 🙂 Primarily we are asking for a high-level review of: the general steps/planned process
     Like  Bookmark
  • FastANI example FastANI is tool for generating a metric telling us something about how similar two genomes are called average nucleotide identity (ANI). It can be run on many genomes, but each value tells us something about a pairwise comparison between two genomes. It is on a scale between 1 and 100 and it is (roughly) the percent of nucleotide bases that are identical between the two genomes 👍 [toc] Conda install Using mamba on top of conda, 'cause it's faster. Including my bit package to be able to download example genomes quickly. mamba create -n fastani -c conda-forge -c bioconda -c defaults -c astrobiomike fastani bit -y
     Like  Bookmark
  • Command-line blast example [toc] Making conda env conda create -n blast -c conda-forge -c bioconda -c defaults blast conda activate blast Making example files printf ">NR_024570.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence
     Like  Bookmark