Parsing plaac results down by taxonomy

--- tags: prions-fo-life --- # Parsing plaac results down by taxonomy --- > **NOTE** > This was initially setup for Tom on Mac OSX. I'm not sure how things would need to be adjusted if trying to use on a Windows. If that's needed, let me know and I'll try to figure out how to help out :) > **NOTE 2** > This page has been updated to pull and work with the results from 1-June-2021. --- [toc] ## Setting up the first time We have already set up [conda](https://astrobiomike.github.io/unix/conda-intro) on your computer, Tom, so all you should need to do is open a terminal window (can do so by doing a search for "terminal"), then copying and pasting the entire following codeblock in the terminal window and pressing enter: ```bash # updating to my latest bit toolkit package conda install -y -c conda-forge -c bioconda -c defaults -c astrobiomike bit=1.8.33 # installing an additional program needed conda install -y -c anaconda wget # making a new directory for doing this stuff # Tom, if doing this again on your computer, you should remove or rename the old directory - you can do that in the normal finder window mkdir -p ~/prion-plaac-taxonomy-parsing # moving into our new directory cd ~/prion-plaac-taxonomy-parsing # downloading the main program (updated for 1-Jun-2021 generated data) curl -L -o ad-hoc-parse-out-taxa.sh https://ndownloader.figshare.com/files/28364475 ``` ## Example usage ### Getting to where we need to be from a fresh computer screen The above only needs to be done once. Anytime you want to use this after that, just open up a terminal window (can search for "terminal") and run the following command to put us in that same location: ```bash cd ~/prion-plaac-taxonomy-parsing ``` ### Parsing out wanted taxonomy We need to give the program 2 things: 1. The domain we want to target (archaea, bacteria, or eukarya) 2. The taxon we want to target (e.g. Escherichia coli) **Notes to keep in mind** * The taxonomy we search for will need to match exactly (e.g. "Escherichia coli" will be found, but "Escerichia coli" will not). The program will tell us if it can't find what we asked for. * If what we search for may span more than what we actually want, the program won't know this. So we should look at the genome summary output table to make sure when it's done (this specific table is noted below in the example outputs). * The first time we do a domain, it will need to download our results files for that domain. This only needs to happen once the first time a domain is specified, as they will be saved and re-used afterwards. These files sizes are roughly as follows: * Archaea: ~1.5 GB * Bacteria: ~10 GB * Eukarya: ~16 GB So be sure you have the space available on your computer before running for one of the heftier domains. #### Sulfolobus acidocaldarius example This is how we run the command to subset out *Sulfolobus acidocaldarius* (**note** that the taxon we are searching for needs to be in quotes as done here – this tells the command-line to treat it as 1 *thing* even though there is a space in there, which is what it usually uses to break things apart): ```bash bash ad-hoc-parse-out-taxa.sh archaea "Sulfolobus acidocaldarius" ``` As noted above, the first time we run this for archaea, it needs to download our results files for the domain, but only the first time. Including needing to download them, this takes about 2 minutes to run. At the end we'll see something like this: <a href="https://i.imgur.com/AvpXNkw.png"><img src="https://i.imgur.com/AvpXNkw.png"></a> We can open this location in our "finder" window from our terminal here if we run this line (needs to include the space then period, be sure to copy the whole thing): ```bash open . ``` > **NOTE** > Images below are from previous data, not the current run done on 1-June-2021. So details may look different, but the overview is all that matters in this example. <a href="https://i.imgur.com/1XCpiK9.png"><img src="https://i.imgur.com/1XCpiK9.png"></a> And inside the "Sulfolobus-acidocaldarius" folder, we see the following files: <a href="https://i.imgur.com/QATUg5L.png"><img src="https://i.imgur.com/QATUg5L.png"></a> --- * Sulfolobus-acidocaldarius-genome-plaac-results.tsv * holds the genome-level plaac result summaries, e.g.: <a href="https://i.imgur.com/w7fHLEz.png"><img src="https://i.imgur.com/w7fHLEz.png"></a> > **NOTE** > This is the file we want to look at first to make sure the program didn't grab more than we actually wanted based on our taxonomy search. --- * Sulfolobus-acidocaldarius-protein-plaac-positive-plaac-output.tsv * holds the full plaac program's output for the plaac-positive proteins, e.g.: <a href="https://i.imgur.com/EKAzg8Z.png"><img src="https://i.imgur.com/EKAzg8Z.png"></a> --- * Sulfolobus-acidocaldarius-plaac-positive-protein-info.tsv * holds the protein-level info for the plaac-positive proteins (full protein sequences are in the last column of this file), e.g.: <a href="https://i.imgur.com/tIYKD5b.png"><img src="https://i.imgur.com/tIYKD5b.png"></a> --- * Sulfolobus-acidocaldarius-plaac-positive-protein-IDs.txt * holds just the plaac-positive unique protein IDs in a single-column --- * Sulfolobus-acidocaldarius-plaac-positive-seqs.faa * is a fasta formatted file holding the plaac-positive full protein sequences --- ---