Searching for specific GO terms

--- tags: prions-fo-life --- # Searching for specific GO terms [toc] Example case is for your request: > an estimate of prion candidates in archaea and bacteria with GO cc outer membrane; cell outer membrane; extracellular region; extracellular matrix; extracellular space ## Archaea We can get all of the plaac positives for archaea using the [program we put together previously](https://hackmd.io/@astrobiomike/parsing-plaac-results-by-tax) by saying we want all of archaea, e.g., at the terminal: ```bash cd ~/prion-plaac-taxonomy-parsing bash ad-hoc-parse-out-taxa.sh archaea "Archaea" ``` When that's done, it created an "Archaea" folder, and in that is the `Archaea-plaac-positive-protein-info.tsv` file that has the info for each of the plaac-positive proteins, which we can open in excel if we want: <a href="https://i.imgur.com/yu44QDO.png"><img src="https://i.imgur.com/yu44QDO.png"/></a> This file has 3,270 rows (one for each of the 3,269 plaac-positives, plus the header). And in column "L" there is "GO_cellular_component_names", which has things like integral component of membrane, cytoplasm, ATP-binding cassette (ABC) transporter complex, but I'm not seeing any you listed above in there. ## Bacteria ```bash # just making sure we're in the proper starting directory again cd ~/prion-plaac-taxonomy-parsing bash ad-hoc-parse-out-taxa.sh bacteria "Bacteria" ``` Same deal, created a folder "Bacteria", that has `Bacteria-plaac-positive-protein-info.tsv` file, which is about 31MB so is still workable in excel. There, we can search column "L" and find things like extracellular matrix, outer membrane, and cell outer membrane. <a href="https://i.imgur.com/aRBZyag.png"><img src="https://i.imgur.com/aRBZyag.png"/></a> (Note, there are 2 fewer proteins in this file than are reported in our total counts. It seems one archaea genome was pulled along with the bacterial ones when we downloaded things from UniProt, and it had 2 plaac positives. It is only 2 proteins, so not affecting anything, but i haven't worked it out of the whole system yet.) # Command-line tricks You might find it helpful to be able to quickly search for things at the command line command line (the terminal window we ran the `ad-hoc-parse-out-taxa.sh` program in). So here are some examples looking into the Bacteria file we just created above. If wanting a more solid foundation on this stuff, running through my short [crash course here](https://astrobiomike.github.io/unix/unix-intro) would be helpful 🙂 ```bash # first we are changing into the folder that holds the new bacteria protein info file cd ~/prion-plaac-taxonomy-parsing/Bacteria # we can see what's here by running `ls` (for list) ls ``` And we can peek at our file of interest with `head`, which prints the first 10 lines: ```bash head Bacteria-plaac-positive-protein-info.tsv ``` <a href="https://i.imgur.com/qFTCpIA.png"><img src="https://i.imgur.com/qFTCpIA.png"/></a> This looks like kind of a mess because the lines are soft-wrapping, but it will be more clear with further examples. ## Check the columns I have a program in my toolkit you already have installed to display the column names. ```bash bit-colnames Bacteria-plaac-positive-protein-info.tsv ``` <a href="https://i.imgur.com/jigua6r.png"><img src="https://i.imgur.com/jigua6r.png"/></a> We can see that "GO_cellular_component_names" is column 12. For some of the examples below, we'll also use column 1 (unique_protein_ID), so you'll see how we could use the following for any of the columns of interest. ## Focusing on specific columns We can cut out which columns we want to look at with the `cut` command, e.g.: ```bash cut -f 1,12 Bacteria-plaac-positive-protein-info.tsv | head ``` <a href="https://i.imgur.com/PMA6O2D.png"><img src="https://i.imgur.com/PMA6O2D.png"/></a> >The pipe (`|`) we have and then `head` command says to take the output from the `cut` command and send it into `head`, so we only see the first 10 lines instead of all 30,000+ printing to our terminal. ## Search for text If we want to know if extra cellular matrix is in the "GO_cellular_component_names" column (column 12), we can use `grep` which is the command line text search command. ```bash cut -f 1,12 Bacteria-plaac-positive-protein-info.tsv | grep "cell outer membrane" | head ``` <a href="https://i.imgur.com/jn0fYME.png"><img src="https://i.imgur.com/jn0fYME.png"/></a> `grep` is searching each line for the pattern we entered ("cell outer membrane" in this example), and it is printing to the screen each line that holds it – this is why we "piped" it into `head` again, to only print the first 10 matches instead of however many there are. ## Counting how many proteins have this text If we want to know how many have this annotation in that column, we can have `grep` tell us how many lines had matching text (which in this case is how many proteins have the annotation), rather than printing all the matching lines to the screen. We do this by adding the optional argument to `-c` to `grep`: ```bash cut -f 1,12 Bacteria-plaac-positive-protein-info.tsv | grep -c "outer membrane" ``` <a href="https://i.imgur.com/dWRNwPN.png"><img src="https://i.imgur.com/dWRNwPN.png"/></a> And seems there are 1,209. Say we wanted to know about "extracellular matrix": ```bash cut -f 1,12 Bacteria-plaac-positive-protein-info.tsv | grep -c "extracellular matrix" ``` <a href="https://i.imgur.com/Oq55Ni3.png"><img src="https://i.imgur.com/Oq55Ni3.png"/></a> Which shows 13. If we wanted to search for both at the same time: ```bash cut -f 1,12 Bacteria-plaac-positive-protein-info.tsv | grep -c "outer membrane\|extracellular matrix" ``` <a href="https://i.imgur.com/Oq55Ni3.png"><img src="https://i.imgur.com/Oq55Ni3.png"/></a> Which tells us 1,122 (which is the sum of the two before, but if one protein had both, it would have been a total of 1,121 because that protein would have only been counted once) ---